Disable nvtx decorator to avoid graph break (#5697)
`instrument_w_nvtx` breaks a graph as `range_push` and `range_pop`
return a non-tensor int.
This PR disables the decorator to avoid the break graph.
This actually impacts the performance. In my environment, the training
iteration time using Llama-3-8B/4GPUs/ZeRO1 is improved from 3.02s ->
2.54s.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>