Improve c10 dispatcher lookup perf (#24882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24882
Previously, looking up a kernel accidentally copied the DispatchTableEntry, which has as its member a std::function cache creator function.
Being an std::function, it was expensive to copy and cost us more than 50ns on each op call.
This diff fixes this by not copying DispatchTableEntry anymore.
ghstack-source-id: 88611173
Differential Revision: D16910530
fbshipit-source-id: 44eeaa7f6ffead940b4a124f0c31d8ef71404db3