[caffe2][cuda] Fix instrumentation of malloc/free SDTs for `CUDACachingAllocator` (#108907)
Summary:
There's currently a bug in `CUDACachingAllocator` which makes it impossible to determine whether a `malloc`ed sample has been deallocated (introduced in D48229150).
It happens because we currently instrument the `malloc` SDT **before** a block of memory has been allocated by either `cudaMalloc` or local cashing allocator `malloc` call. Since this is a static tracepoint, it receives arg values at the point of instrumentation. Currently, it receives the memory pointer, `void* p`, which is NULL.
Changes in this diff:
1) Move this SDT to right before the `allocate` function returns, so that memory has been allocated already and `p` pointer points to a valid, non-NULL address.
2) Enable tracing of `cudaMalloc` calls, in addition to `NativeCachingAllocator::malloc`
3) renames a poorly-named local var: `r` --> `devPtr` (pointer to the allocated memory block)
Test Plan:
Tested with a local PyTorch script that leaks memory. Verified the following:
* prior to this fix (prod), malloc samples are **not** marked as "freed"
* with the fix (branch), samples **are** marked as "freed"
* results are comparable with the current uprobe implementation to sample PyTorch malloc events in `gpusnoop`
Reviewed By: chaekit
Differential Revision: D48873734
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108907
Approved by: https://github.com/chaekit