[CUDA Pinned Memory] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#68906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68906
The existing PyTorch pinned memory allocator has been a challenge for scalability in multi-GPU inference workloads. The existing allocator is mostly designed in the context of training, where in the process-per-GPU setup we have natural sharding of the global locks and lower allocation rates (perhaps O(100 allocs/sec) per process. In this setup we might have globally on the order of O(200k allocs/sec) - e.g. 20k QPS and 10 allocs/query. This is a different domain.
In the existing allocator, we observe tail latencies of cudaEventCreate and cudaEventDestroy (while holding the lock) can also completely stall all allocations, which is undesirable.
The idea here is to retain a similar design to the existing PyTorch allocator - eager collection of used memory, no lock-free or deferred tricks, identical semantics around events, but to:
a) split up the locks around the various critical datastructures, and
b) do as little work as possible while holding any process-global mutexes (importantly, no CUDA runtime API calls)
c) pool CUDA events manually (as cuda event creation is a bottleneck at high rates from multiple threads).
This does require a bit of care, but I believe it's correct. In general the threading and state transitions are fairly simple.
With these improvements, microbenchmarks show significant improvements (1.5x-3x). Importantly, real workloads also show significant improvements, especially WRT tail latency and stalls.
Test Plan:
Unit tests all pass.
With a synthetic benchmark such as:
```
static void BM_copies_baseline(benchmark::State& state) {
auto N = state.range(0);
auto scale = state.range(1);
auto object_size_min = N;
auto object_size_max = scale * N;
auto device = at::Device(at::kCUDA, at::cuda::current_device());
uint64_t bytes_copied = 0;
uint64_t allocs = 0;
auto stream = at::cuda::getCurrentCUDAStream();
for (auto _ : state) {
auto object_size = static_cast<int64_t>(expf(folly::Random::randDouble(
logf(object_size_min), logf(object_size_max))));
auto tensor = at::empty(
{object_size},
at::TensorOptions().dtype(at::kByte).pinned_memory(true));
at::cuda::CachingHostAllocator_recordEvent(
tensor.storage().data_ptr().get_context(), stream);
bytes_copied += object_size;
allocs += 1;
}
state.counters["BW"] =
benchmark::Counter(bytes_copied, benchmark::Counter::kIsRate);
state.counters["Allocs"] =
benchmark::Counter(allocs, benchmark::Counter::kIsRate);
}
BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(1)->UseRealTime();
BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(4)->UseRealTime();
BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(16)->UseRealTime();
BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(64)->UseRealTime();
BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(128)->UseRealTime();
BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(256)->UseRealTime();
```
I observe roughly 1.5-3x improvements.
End to end application testing also sees significant improvements in the contended scenario.
Reviewed By: jianyuh, ngimel
Differential Revision: D32588784
fbshipit-source-id: ee86c3b7ed4da6412dd3c89362f989f4b5d91736