Dense->BSR performance improvment (#83085)
Applies the algorithm for re-batching compressed indices to avoid n-batch kernel launches. This is an optimization for `dim()>= 3` inputs and does not change behavior in any way.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83085
Approved by: https://github.com/bhosmer, https://github.com/nikitaved