Improve boxed dispatch performance (#33313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33313
Instead of just remembering the number of arguments and iterating over the stack,
the DispatchKeyExtractor now remembers the exact locations of the dispatch relevant arguments
(i.e. Tensor arguments) and only looks at those.
ghstack-source-id: 101908386
Test Plan: unit tests, benchmarks
Differential Revision: D19748549
fbshipit-source-id: b5b9ff2233b3507e0b600460f422912cfa9e3f0f