Refine GatherND CPU/CUDA Kernels & Add UTs (#3688)
* Refactor GatherND CPU Kernel (Renaming & Simplify)
* Add batch_dim=1 or 2, negative slices tests
* Rename gather_nd_gard_impl.cu
* Use dispatcher to refactor CUDA GatherND/GatherNDGrad
* Change GatherNDBase::CommonComputeKernel --> GatherNDBase::PrepareCompute
* Use HasCudaEnvironment instead of __CUDA_ARCH__ for some double type tests