[CUDA] Correct after_gather_dim for nibbled uint4 index (#26484)
### Description
The after_gather_dim in CUDA backend now only supports uint8 dtype.
This PR ensures indexing matches correctly in gather_block_quantized
with nibbled 4bits weights.
### Motivation and Context
This allows token_embeddings and lm_head tied in 4bit weights, which
saves more room and compresses models further.