(Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (#18559)
* CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta)
* CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort
---------
Co-authored-by: pl752 <maximpl752@gmail.com>