[PyTorch] Reduce template expansion in call_functor_with_args_from_stack (#51313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51313
The problem here is similar to the one described in
https://devblogs.microsoft.com/cppblog/build-throughput-series-more-efficient-template-metaprogramming/
in that we are iterating over an integer seqeunce of length N, where N
is the number of argument types to our function, and specializing
`TypeListAt` (which we call `element_t`) for each Ith element of the
typelist, which instantiates O(I) template specializations, for a
total of O(N^2).
The solution is also similar: we iterate over the typelist
directly. Unlike in the blog post, we do also need the index in the
sequence, so we retain the index_sequence.
ghstack-source-id: 121363464
Test Plan:
Inspect -ftime-trace output for RegisterCPU.cpp.
Before: P168220187
After: P168220294
we can see that we spend less time instantiating
call_functor_with_args_from_stack and spend a similar amount of time
compiling it. The win is modest, but it's a win and I've already
written it so I'm sending it out. (I was hoping it would reduce
compilation time for make_boxed_from_unboxed_functor.)
Reviewed By: bhosmer
Differential Revision: D26136784
fbshipit-source-id: c91a523486e3019bd21dcd03e51a58aa25aa0981