[PyTorch] Back scalar value to pinned memory for .item() (#119202)
Summary: This diff optimizes the .item() call by backing the scalar value storage with pinned memory, so we dont create an implicit synchronization with libcuda library.
Test Plan:
# Prod VDD model on H100
Vanguard runs
9.8k qps -> 10.1k qps (~3% improvement)
# .item() Benchmark
1 thread 50k iterations
consistent ~2-3% improvements
With pinned memory
item() took 1.627608060836792 seconds
item() took 1.635591983795166 seconds
item() took 1.6398141384124756 seconds
item() took 1.6378591060638428 seconds
item() took 1.618534803390503 seconds
item() took 1.6467158794403076 seconds
item() took 1.6278800964355469 seconds
item() took 1.6205573081970215 seconds
item() took 1.64951753616333 seconds
item() took 1.6286702156066895 seconds
w/o pinned memory
item() took 1.6783554553985596 seconds
item() took 1.6670520305633545 seconds
item() took 1.6748230457305908 seconds
item() took 1.6708712577819824 seconds
item() took 1.6836023330688477 seconds
item() took 1.6518056392669678 seconds
item() took 1.6769678592681885 seconds
item() took 1.661888837814331 seconds
item() took 1.6627326011657715 seconds
item() took 1.6908581256866455 seconds
Differential Revision: D53431148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119202
Approved by: https://github.com/xw285cornell