[C2] Small improvement for elementwise_mul operator. (#33537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33537
Cases of embeddings smaller than 128, we can get a bit more compute by
allocating less threads per block.
Test Plan: Unit-test, benchmark.
Reviewed By: xianjiec
Differential Revision: D19969594
fbshipit-source-id: 6cc6b14fc61302804bed9093ea3591f21e3827d8