[iOS GPU] Use thread buffer to store indices for transpose (#56706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56706
We've seen the transpose op failed on iOS 12 devices. This is because the index buffer is allocated in the device address space which is shared across multiple threads. Write operations are not guaranteed to be atomic. Use a thread buffer solves the issue.
ghstack-source-id: 127365795
Test Plan: CI
Reviewed By: SS-JIA
Differential Revision: D27941353
fbshipit-source-id: 5f09f0a085081b7c5e8019ebe711e36394cdde92