Add sync-point insertions and block/thread local memory allocations (#36563)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36563
Test Plan: Imported from OSS
Differential Revision: D21014238
Pulled By: zheng-xq
fbshipit-source-id: 4d61ff2f76345ea2825f2d5f60a771f65b24ad69