Enhance add_out_dense_sparse_cpu for hybrid sparse tensor (#23057)
This is to improve the performance for hybrid sparse coo tensor on CPU path. This case is appeared at the DLRM terabyte test.
With this fix, according to the previous performance test data, it got ~10x performance improvement on DLRM execution.
without this, the DLRM will run as
Finished training it 100/1000 of epoch 0, 2969.25 ms/it, loss 0.220505, accuracy 0.000 %
with this, the DLRM will run as
Finished training it 100/1000 of epoch 0, 270.71 ms/it, loss 0.220505, accuracy 0.000 %
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23057
Approved by: https://github.com/VitalyFedyunin, https://github.com/malfet