[performance] ensure `causal_mask` is created directly on device (#22378)
* ensure causal_mask is created directly on device
* add copy tag to opt, update bart implementation
* add device to all _make_causal_mask copies
* formatting fixes
* more manual fixes due to unlinked versions of _prepare_decoder_attention_mask