Using dynamic allocation buffer and dynamic threads on scan with index (#103502)
What this PR does is (continuation from #103435):
- Applying dynamic number of threads for innerdim scan with index function.
- Using dynamically allocated shared memory to get rid of `num_threads` template arguments.
@ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103502
Approved by: https://github.com/ngimel