[MPS] And native bitwise_[and|or|xor] (#82307)
Implement bitwise operators as metal kernels
Dynamically compile metal library for a triplet of input and output tensor types.
Use `dispatchThreads:threadsPerThreadgroup:` to dispatch work (relies on the fact that MPS device is at least `MTLGPUFamilyMac2`, which will be explicitly checked in https://github.com/pytorch/pytorch/pull/82507
Perf improvements: Add support for non-contiguous tensors and broadcasting
Test Plan:
Already tested in `test_mps.py`, for example by `TestConsistencyCPU.test_output_match_bitwise_xor_cpu_uint8`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82307
Approved by: https://github.com/albanD