check if `multi_tensor_apply_kernel` was called (#92077)
Replacing all the hard coded number of cuda kernel launches with `multi_tensor_apply_kernel` call check, keeping the dependency on kineto profiler there
Rel: https://github.com/pytorch/pytorch/pull/91844#issuecomment-1379844523
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92077
Approved by: https://github.com/ngimel