COO intersection primitives: performance improvement (#92976)
This PR improves COO intersection primitives by:
* making it sync-less (dims <= 8, can be changed to any value that fits stack).
* improving performance with much less kernel calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92976
Approved by: https://github.com/cpuhrsch, https://github.com/pearu