Fold transpose into matmul in Gram NS for tall matrices
Replace (Q @ X).mT.contiguous() with X.mT @ Q.mT which produces a
contiguous result directly. cuBLAS handles transposed inputs natively
via transpose flags, so the matmul cost is identical but the extra
memcpy from .contiguous() is eliminated.
Benchmark (Qwen2.5-3B, 2xA100, ZeRO-2, 3 runs avg):
Before: 936.6ms/step (backward: 628.4ms)
After: 931.5ms/step (backward: 612.8ms)
Speedup vs standard NS: 11.2% -> 11.7%
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>