Emit torch.cuda.synchronize() after every kernel call in inductor (#90472)
Debugging illegal memory access is hard; even CUDA_LAUNCH_BLOCKING=1
and using C10_CUDA_KERNEL_LAUNCH_CHECK doesn't guarantee a useful stack trace.
doesn't necessarily guarantee that you'll get a stack trace pointing to the
right kernel. This diff adds a config option to force a CUDA synchronize after
every kernel call in inductor, for debugging those tricky cases.
Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967/)
Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90472
Approved by: https://github.com/jansel