[inductor] use torch.prifiler in the triton wrapper (#97405)
I think it's helpful to use torch.profiler to profile the triton wrapper.
E.g., I tried it for nvidia_deeprecommender's infernece graph.
Even with max-autotune, we see the majority of the time the GPU is running 2 mm/addmm op. That's why max autotune does not help for this model since tuning does not affect the external mm ops.
<img width="711" alt="Screenshot 2023-03-22 at 5 49 28 PM" src="https://user-images.githubusercontent.com/52589240/227072474-2f0d7205-4a10-4929-b1b7-551214788c61.png">
next step I'll check why the triton mm kernels are not picked.
EDIT: the above screenshot is captured without max-autotune due to a typo. below is the trace with max-autotune enabled:
<img width="712" alt="Screenshot 2023-03-22 at 6 43 26 PM" src="https://user-images.githubusercontent.com/52589240/227077624-fdccf928-be08-4211-871b-a9e3d7b76fbe.png">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97405
Approved by: https://github.com/ngimel