Fix Build Error when tensor dumping is enabled (#25414)
### Description
Fix cuda build error when DEBUG_GENERATION is defined.
### Motivation and Context
In https://github.com/microsoft/onnxruntime/pull/24821, a dumping API
was removed:
`void Print(const char* name, int index, bool end_line)`
But related code is not updated.
In MatMulNBits, there is a recent change to add bfloat16 support, but
the tensor dumper only support BFloat16 but not __nv_bfloat16. This PR
adds functions to support __nv_bfloat16 in cuda tensor dumper.