pytorch
aa9ee8d0 - [Static Runtime] Avoid copying function objects per StaticRuntime instance (#68368)

Commit
4 years ago
[Static Runtime] Avoid copying function objects per StaticRuntime instance (#68368) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68368 Currently, each instance of `StaticRuntime` has its own copy of `std::function` object wrapped in `ProcessedNode::Function` object, in order to invoke actual operation implementation. However, all instances of `StaticRuntime` derived from same `StaticModule` objects invoke exactly same op implementation, and this is avoidable. This change adds `StaticModule::functions_` member variable to keep a list of unique instance of `ProcessedFunction` objects. A newly constructed `StaticRuntime` takes `ProcessedFunction`'s pointers instead of the whole function object. This can save a substantial amount of memory per `StaticRuntime` instance. This comes with a sacrifice in execution time. Now that a `ProcessedNode` instance keeps the function object's pointer, executing a node now involves an extra pointer dereference. However, this cost was proved to be negligible from local performance tests. Thanks to hlu1 for proposing this non-intrusive improvement idea :D Test Plan: This change reduces the size of a StaticRuntime instance by 14.41% (459KB -> 393KB) (patched D32181666 to print the memory turnover from instantiating a StaticRuntime instance) for CMF/local ( & 8% for CMF/local_ro). No noticeable latency regression was observed. ==AFTER * CMF/local memory turnover: 393608 latency: PyTorch run finished. Milliseconds per iter: 15.6965. Iters per second: 63.7087 * CMF/local_ro memory turnover:387288 latency: PyTorch run finished. Milliseconds per iter: 7.51308. Iters per second: 133.101 ==BEFORE * CMF/local memory turnover: 459888 latency: PyTorch run finished. Milliseconds per iter: 15.8278. Iters per second: 63.18 * CMF/local_ro memory turnover: 420832 latenfcy: PyTorch run finished. Milliseconds per iter: 7.43756. Iters per second: 134.453 ==Confirmation that ptvsc2_predictor_bench reports the same memrmoy management stats for inline_cvr: ==AFTER Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%) Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) ==BEFORE Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%) Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) Reviewed By: swolchok Differential Revision: D32337548 fbshipit-source-id: e714e735399c93fde337b0f70e203a2de632057a
Author
Parents
Loading