[CUDA] DecoderMaskedMultiHeadAttention files consolidation (#27688)
This deletes 3 per-head-size .cu files and merges their content into a
single file to avoid dependency during cuda compiling.
Currently, masked_multihead_attention_kernel template is implemented in
decoder_masked_multihead_attention_impl.cu. The other three .cu files
use the masked_multihead_attention_kernel template but not include the
implementation. That causes problem when they are built in cuda plugin
ep.