Fix nanmedian result using more CUDA memory than necessary (#68591)
Summary:
CUDA's `at::nanmedian` creates a sorted copy of the array, then indexes into it to create a single element view. This view necessarily keeps the entire `sorted` tensor's storage alive which can be avoided by returning a copy, which is what `at::median` does indirectly via `at::where`.
This also changes the index variable `k` to be a simple `int64_t` instead of the CUDA tensor that was used before. This saves the additional host and device operations from calling `Tensor`'s `operator -` which helps balance out the cost of the `clone` added here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68591
Reviewed By: dagitses
Differential Revision: D32538538
Pulled By: ngimel
fbshipit-source-id: abe9888f80cf9d24d50a83da756e649af1f6ea3b