pytorch
0c54ea50 - [PyTorch] Avoid atomic refcounting in intrusive_ptr::make (#47100)

Commit

4 years ago

[PyTorch] Avoid atomic refcounting in intrusive_ptr::make (#47100) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47100 Profiling with Linux `perf` shows that we spend at least 1% of our time doing this increment in our framework overhead benchmark. Here's the inline function breakdown for empty_cpu, which takes 6.91% of the total time: ``` - at::native::empty_cpu - 1.91% at::detail::make_tensor<c10::TensorImpl, c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::DispatchKey, caffe2::TypeMeta&> (inlined) - 0.98% c10::make_intrusive<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<c10::TensorImpl>, c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::DispatchKey, caffe2::TypeMeta&> (inlined 0.97% c10::intrusive_ptr<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<c10::TensorImpl> >::make<c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::DispatchKey, caffe2::TypeMeta&> 0.84% intrusive_ptr<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<c10::TensorImpl> > (inlined) - 1.44% c10::make_intrusive<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl>, c10::StorageImpl::use_byte_size_t, long&, c10::DataPtr, c10::Allocator*&, bool> (inlined) - 1.44% c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::make<c10::StorageImpl::use_byte_size_t, long&, c10::DataPtr, c10::Allocator*&, bool> (inlined) 1.02% std::__atomic_base<unsigned long>::operator++ (inlined) - 0.80% ~DataPtr (inlined) ~UniqueVoidPtr (inlined) ~unique_ptr (inlined) - 0.78% c10::TensorOptions::memory_format (inlined) - c10::TensorOptions::set_memory_format (inlined) - c10::optional<c10::MemoryFormat>::operator bool (inlined) c10::optional<c10::MemoryFormat>::initialized (inlined) ``` This change comes with a caveat: if we have constructors where `this` escapes to another thread before returning, we cannot make this assumption, because that other thread may have called `intrusive_ptr::make` already. I chose to just mandate that `instrusive_ptr_target`s's ctors hand back exclusive ownership of `this`, which seems like a reasonable requirement for a ctor anyway. If that turns out to be unacceptable, we could provide an opt-out from this optimization via a traits struct or similar template metaprogramming shenanigan. ghstack-source-id: 116368592 Test Plan: Run framework overhead benchmark. Results look promising, ranging from a tiny regression (? presumably noise) on the InPlace benchmark, 2.5% - 4% on OutOfPlace, to 9% on the empty benchmarks and 10-12% on the view benchmarks. Reviewed By: ezyang Differential Revision: D24606531 fbshipit-source-id: 1cf022063dab71cd1538535c72c4844d8dd7bb25

Author

swolchok

Committer

facebook-github-bot

Parents

f2b7c387

pytorch 0c54ea50 - [PyTorch] Avoid atomic refcounting in intrusive_ptr::make (#47100)

pytorch
0c54ea50 - [PyTorch] Avoid atomic refcounting in intrusive_ptr::make (#47100)