Compute global gradient norm according to 'enable_grad_norm_clip' (#5728)
* Introduce PassThrough op to wait for all gradient ready before weight update
* Compute gradient norm for fp32 runs
* Update FE UT expected value
* Respect enable_grad_norm_clip