[inductor] Support using the 'stream' param in AOT mode (#105589)
Summary:
When in AOT mode, make use of the existing stream param:
- Pass through and use the stream param in the launchKernel helper function.
- In non-AOT mode, assign the stream param in the caller and pass to launchKernel
- Use a CUDAStreamGuard so all fallback ops execute on the stream
- CUDAStreamGuard subsumes CUDAGuard in AOT mode since it sets both stream and device
Test Plan:
- Ran cpp_wrapper tests: pytest test/inductor/test_cpp_wrapper.py
- Manually inspected cpp output from the alexnet benchmark:
a) In AOT mode:
```
static inline void launchKernel(
CUfunction func,
int gridX,
int gridY,
int gridZ,
int numWraps,
int sharedMemBytes,
cudaStream_t stream) {
AT_CUDA_DRIVER_CHECK_OVERRIDE(cuLaunchKernel(
func, gridX, gridY, gridZ, 32*numWraps, 1, 1, sharedMemBytes, stream, args, nullptr));
...
at::cuda::CUDAStreamGuard stream_guard(at::cuda::getStreamFromExternal(stream, 0));
...
launchKernel(triton_poi_fused_convolution_0, 1, 784, 1, 4, 4352, kernel_args_var_0, stream);
...
```
b) Regular cpp wrapper:
```
...
at::cuda::CUDAGuard device_guard(0);
cudaStream_t stream0 = at::cuda::getCurrentCUDAStream(0);
...
launchKernel(triton_poi_fused_convolution_0, 1, 784, 1, 4, 4352, kernel_args_var_0, stream0);
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105589
Approved by: https://github.com/desertfire