add channel last 3d support for maxpool3d on CPU (#97775)
### Testing
Single socket (28 cores):
shape | fp32 forward / ms | bf16 forward / ms | fp32 backward / ms | bf16 backward / ms
-- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 3.959584 | 5.493402 | 0.557232 | 0.568485
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 0.815511 | 1.351261 | 5.710506 | 10.57506
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 10.63426 | 15.28637 | 2.67656 | 1.71365
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.63570 | 2.05532 | 2.55452 | 2.33923
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.375469 | 0.479748 | 0.066364 | 0.065155
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.112197 | 0.112326 | 0.111697 | 0.145364
Single core:
shape | fp32 forward / ms | bf16 forward / ms | fp32 backward / ms | bf16 backward / ms
-- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 92.16582 | 128.6513 | 6.684325 | 12.21541
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 10.14318 | 29.80297 | 7.350142 | 11.25323
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 238.55453 | 331.89967 | 19.694657 | 32.78853
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 30.17079 | 32.75628 | 22.44543 | 30.17796
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 7.474389 | 9.937217 | 0.236015 | 0.434229
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.318954 | 2.469444 | 0.262125 | 0.401361
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97775
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki