Fix copy kernel speed regression introduced in #29631 (#31279)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31271
This fixes copy kernel speed regression introduced in https://github.com/pytorch/pytorch/issues/29631.
The previous implementation forces the compiler to instantiate `static_cast_with_inter_type` because it is passed as an argument of a function. This behavior makes it impossible for compilers to do optimizations like automatic vectorization, and, function call itself is expensive compared to a single casting instruction.
To check the change, run
```
readelf -Ws /home/xgao/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so | grep static_cast_with_inter_type
```
On nightly build, we have output
```
168217: 0000000001852bf0 5 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIsdE5applyEd
168816: 0000000001852d30 33 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEaE5applyEa
168843: 00000000018531f0 7 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIblE5applyEl
168930: 0000000001852c20 3 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIslE5applyEl
168935: 00000000018528d0 124 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIfNS_4HalfEE5applyES1_
169023: 0000000001852f30 17 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEhE5applyEh
169713: 00000000018525c0 3 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIahE5applyEh
170033: 0000000001852c10 3 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIsiE5applyEi
170105: 0000000001852bd0 5 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIshE5applyEh
170980: 0000000001852fc0 27 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeISt7complexIdES1_IfEE5applyES3_
171398: 0000000001852810 13 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIdbE5applyEb
171574: 00000000018532e0 35 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIbNS_8BFloat16EE5applyES1_
171734: 0000000001852b20 6 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIlSt7complexIdEE5applyES2_
172422: 0000000001853350 54 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EaE5applyEa
172704: 00000000018533c0 38 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EfE5applyEf
172976: 0000000001852890 10 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIflE5applyEl
173038: 0000000001852f80 9 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEfE5applyEf
173329: 00000000018531c0 20 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIbfE5applyEf
173779: 00000000018524d0 3 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIhiE5applyEi
174032: 0000000001852960 14 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIfNS_8BFloat16EE5applyES1_
174334: 0000000001852d60 29 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEdE5applyEd
174470: 0000000001852c60 124 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIsNS_4HalfEE5applyES1_
174770: 0000000001852bc0 15 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIlNS_8BFloat16EE5applyES1_
176408: 0000000001853980 144 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeINS_4HalfEbE5applyEb
176475: 0000000001852790 128 FUNC LOCAL DEFAULT 9 _ZN3c1027static_cast_with_inter_typeIdNS_4HalfEE5applyES1_
....
```
And after this PR, we get empty output
```
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31279
Differential Revision: D19075587
Pulled By: ngimel
fbshipit-source-id: c20088241f39fa40c1d055f0a46eb5b9ece52e71