Adding cuda kernel (optimized for sm80) for block-wise 4b quantized float 16 GEMM. (#18619)
### Description
Adding CUDA kernel for block-wise 4b quantized float 16 GEMM, this is
specially optimized for Nvidia Ampere GPUs.
### Motivation and Context
Trying to improve quantized LLM inference performance on Nvidia Ampere
GPUs
### Note:
This is implemented by extending CUTLASS, so it has a hard dependency on
CUTLASS. However, in current build system, loading of CUTLASS dependency
is guarded with:
(onnxruntime_USE_FLASH_ATTENTION OR
onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION)
If both of these options are turned off, then compilation will fail.
Why CUTLASS dependency is guarded at all? It's a header file only
library that does not introduce any binary if not instantiated. What's
the downside of removing all the guards and just include CUTLASS
unconditionally?