Use accelerator API for dtype selection in Newton-Schulz iterations
Both NS functions now query the accelerator to choose compute dtype
instead of hardcoding. Standard NS uses is_bf16_supported() to select
bf16 vs fp32; Gram NS uses is_fp16_supported() to select fp16 vs fp32.
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>