Fix TP agent not recording outgoing tensors with caching allocator (#58384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58384
When the caller send tensors within a request, it does so on fresh streams it obtained from the caching allocator. However it wasn't recording those tensors with the caching allocator. This carried the risk that, if those tensors were deleted before the async CUDA ops were done, the caching allocator could reuse the storage and thus overwrite the previous data while it was still being used.
ghstack-source-id: 129107582
Test Plan: eyes
Reviewed By: mrshenli
Differential Revision: D28473429
fbshipit-source-id: 3f2617048d984cec7a270858d282cecf1140ecf0