Enable channels_last_3d on SyncBatchNorm (#88401)
This PR enabled the use of fast channels_last kernels on SyncBatchNorm with channels_last_3d memory format.
With a small benchmark script here https://github.com/pytorch/pytorch/issues/88021#issuecomment-1299059859, on V100, I got
master:
```
DDP channels_last=False, run_forward_backward, time: 0.8945400714874268 sec
DDP channels_last=True, run_forward_backward, time: 1.4736433029174805 sec
```
This PR:
```
DDP channels_last=False, run_forward_backward, time: 0.8927242755889893 sec
DDP channels_last=True, run_forward_backward, time: 0.48697471618652344 sec
```
This PR is a follow-up of https://github.com/pytorch/pytorch/pull/46906
Close https://github.com/pytorch/pytorch/issues/88021
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88401
Approved by: https://github.com/ngimel