[iOS GPU] [BE] use channel-last to transform the weights (#59113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59113
Manually permuting the weights is slower than `calling at::contiguous()`
ghstack-source-id: 130374487
Test Plan: CI
Reviewed By: SS-JIA
Differential Revision: D28762278
fbshipit-source-id: 1dde3ef82018bc2507d0ca5132b1ee97dc99787f