[CUDA] Special case for K==0 in CUDA MatMul (#21525)
### Description
This change addresses a case where we multiply two matrices, and their
inner dimension is 0.
numpy and Eigen which is being used in our CPU EP implementation
correctly handle this case
and output a [M, N] matrix filled with zeros.
### Motivation and Context
This is required to support GenAI empty input Lora implementation.
Addresses: https://github.com/microsoft/onnxruntime/issues/21483