CUDA: use shared mem for ssm_conv (#20128)
* CUDA: use shared mem for ssm_conv
* fuse silu + ssm_conv
* fuse unary + mul
* enable for fp16
* formatting
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>