pytorch
ec1339a4 - [CUDA Pinned Memory] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#68906)

Commit View On GitHub

Commit

2 years ago

[CUDA Pinned Memory] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#68906) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68906 The existing PyTorch pinned memory allocator has been a challenge for scalability in multi-GPU inference workloads. The existing allocator is mostly designed in the context of training, where in the process-per-GPU setup we have natural sharding of the global locks and lower allocation rates (perhaps O(100 allocs/sec) per process. In this setup we might have globally on the order of O(200k allocs/sec) - e.g. 20k QPS and 10 allocs/query. This is a different domain. In the existing allocator, we observe tail latencies of cudaEventCreate and cudaEventDestroy (while holding the lock) can also completely stall all allocations, which is undesirable. The idea here is to retain a similar design to the existing PyTorch allocator - eager collection of used memory, no lock-free or deferred tricks, identical semantics around events, but to: a) split up the locks around the various critical datastructures, and b) do as little work as possible while holding any process-global mutexes (importantly, no CUDA runtime API calls) c) pool CUDA events manually (as cuda event creation is a bottleneck at high rates from multiple threads). This does require a bit of care, but I believe it's correct. In general the threading and state transitions are fairly simple. With these improvements, microbenchmarks show significant improvements (1.5x-3x). Importantly, real workloads also show significant improvements, especially WRT tail latency and stalls. Test Plan: Unit tests all pass. With a synthetic benchmark such as: ``` static void BM_copies_baseline(benchmark::State& state) { auto N = state.range(0); auto scale = state.range(1); auto object_size_min = N; auto object_size_max = scale * N; auto device = at::Device(at::kCUDA, at::cuda::current_device()); uint64_t bytes_copied = 0; uint64_t allocs = 0; auto stream = at::cuda::getCurrentCUDAStream(); for (auto _ : state) { auto object_size = static_cast<int64_t>(expf(folly::Random::randDouble( logf(object_size_min), logf(object_size_max)))); auto tensor = at::empty( {object_size}, at::TensorOptions().dtype(at::kByte).pinned_memory(true)); at::cuda::CachingHostAllocator_recordEvent( tensor.storage().data_ptr().get_context(), stream); bytes_copied += object_size; allocs += 1; } state.counters["BW"] = benchmark::Counter(bytes_copied, benchmark::Counter::kIsRate); state.counters["Allocs"] = benchmark::Counter(allocs, benchmark::Counter::kIsRate); } BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(1)->UseRealTime(); BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(4)->UseRealTime(); BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(16)->UseRealTime(); BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(64)->UseRealTime(); BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(128)->UseRealTime(); BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(256)->UseRealTime(); ``` I observe roughly 1.5-3x improvements. End to end application testing also sees significant improvements in the contended scenario. Reviewed By: jianyuh, ngimel Differential Revision: D32588784 fbshipit-source-id: ee86c3b7ed4da6412dd3c89362f989f4b5d91736

Author

Andrew Tulloch

Committer

facebook-github-bot

Parents

0cdeb586

pytorch ec1339a4 - [CUDA Pinned Memory] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#68906)

Commit

pytorch
ec1339a4 - [CUDA Pinned Memory] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#68906)