Support elementwise add / mul for [B, *] nested, [B, 1] dense (CUDA only) (#95620)
Small hack to reuse the 3D custom kernel from #88289 for [B, *] nested, [B, 1] dense elementwise add / mul. Simply treat the inputs as [B, *, 1], [B, 1, 1]. This is added to satisfy an internal ask.
Future work: full general broadcasting support between mixed nested / dense.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95620
Approved by: https://github.com/cpuhrsch, https://github.com/drisspg