Use fused addmm and eliminate eye allocation in Gram NS
Replace separate scalar-multiply + matmul + add operations with single
torch.addmm calls for Q and R updates, reducing kernel launch overhead.
Remove torch.eye allocation by using diagonal().add_() instead.
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>