[CUDA Pinned Memory] [Retry] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#69299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69299
https://github.com/pytorch/pytorch/pull/68906 + https://github.com/pytorch/pytorch/pull/68749 plugged one correctness hole (non-blocking copies of offset pinned memory tensors) while introducing another (non-blocking copies of pinned memory tensors with a non-standard DataPtr context).
In this revision, we use both the tensor data pointer and context to attempt to identify the originating block in the pinned memory allocator.
Test Plan: New unit tests added to cover the missing case previously.
Reviewed By: yinghai
Differential Revision: D32787087
fbshipit-source-id: 0cb0d29d7c39a13f433eb1cd423dc0d2a303c955
(cherry picked from commit 297157b1a13b5c75d860cac9eba4fe7fe1ad5e6f)