[PyTorch] Add set_data_ptr_noswap & use where possible (#52244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52244
`StorageImpl::set_data_ptr` returns the old pointer and thus has to do extra
work. Found because `std::swap<at::DataPtr>` was showing up in
profiling, although at < 1%.
ghstack-source-id: 121795131
Test Plan:
Run AdIndexer benchmark under `perf stat`.
Before:
```
17,990.01 msec task-clock # 0.998 CPUs utilized ( +- 0.43% )
6,550 context-switches # 0.364 K/sec ( +- 31.42% )
3 cpu-migrations # 0.000 K/sec ( +- 7.14% )
103,820 page-faults # 0.006 M/sec ( +- 2.47% )
35,610,511,494 cycles # 1.979 GHz ( +- 0.40% ) (50.03%)
71,651,045,779 instructions # 2.01 insn per cycle ( +- 0.07% ) (50.02%)
11,679,947,910 branches # 649.246 M/sec ( +- 0.10% ) (50.03%)
69,088,927 branch-misses # 0.59% of all branches ( +- 0.24% ) (50.06%
```
After:
```
17,896.20 msec task-clock # 0.999 CPUs utilized ( +- 0.24% )
4,011 context-switches # 0.224 K/sec ( +- 27.77% )
3 cpu-migrations # 0.000 K/sec
100,350 page-faults # 0.006 M/sec ( +- 1.58% )
35,418,702,208 cycles # 1.979 GHz ( +- 0.23% ) (50.05%)
71,449,334,935 instructions # 2.02 insn per cycle ( +- 0.09% ) (50.03%)
11,652,819,899 branches # 651.134 M/sec ( +- 0.12% ) (50.04%)
69,744,411 branch-misses # 0.60% of all branches ( +- 0.53% ) (50.06%)
```
Cycles difference is within the noise, but it looks like we have an
0.28% instruction count win, which is outside the noise (and fits with
intuition that this should be better).
Reviewed By: hlu1
Differential Revision: D26437297
fbshipit-source-id: bf0fceccf6ad78f1497b03ccb4cdfd1a21c6846c