[IntraNodeComm] fix a hybridCubeMeshAllReduceKernel breakage caused by a recent refactor (#121575)
`hybridCubeMeshAllReduceKernel` uses the latter half of p2p buffers as relay buffers. The relay buffer address is calculated using a bf16 base pointer and the buffer size in byte. The breakage was caused by not taking element size into account.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121575
Approved by: https://github.com/Chillee