use new cuda kernel launch code in nvprof parsing (#35016)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/33986.
The meaning of cbid 13 and 211 can be found at here
https://github.com/ezyang/nvprof2json/blob/837c094852c9c5164344db7c19432da37d9a8b09/nvprof2json.py#L238
https://github.com/ezyang/nvprof2json/blob/837c094852c9c5164344db7c19432da37d9a8b09/nvprof2json.py#L436
or it can also be found in the header file at `/usr/local/cuda/extras/CUPTI/include/cupti_runtime_cbid.h`.
Please also check [this at stackoverflow](https://stackoverflow.com/questions/48552390/whats-the-difference-between-launching-with-an-api-call-vs-the-triple-chevron-s). I also executed the profiling code (in the issue) on CUDA 9.2, and the cbid is already changed to 211. Just in case someone would build pytorch against older CUDA versions, I leave both 13 and 211 in the assertion.
cc csarofeen ptrblck ezyang ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35016
Differential Revision: D20550879
Pulled By: ezyang
fbshipit-source-id: 968efc5e1126f1dd31acc9f5f4463f351d8a4c4f