Add nightly pipeline for MI100 to run convergence and batch size test similar to V100. (#6611)
* Partial updating of ROCM reduction code.
* Update reduction_all.cu
* Add reduce template parameters.
* miopen common
* Reuse CUDA's reduction_functions.cc
* Reduction ops.
* Update remaining reduction ops to use MIOpen. double datatype is not supported, so disable those typed kernels.
* Disable a couple more unsupported tests.
* Code formatting.
* Delete ROCM-specific reduction code that is identical to CUDA reduction code.
* Fix scratch buffer early free.
* Fix merge conflict.
* first attempt nightly amd ci pipeline
* try fix bad yaml file
* try again with corrected model directory
* add convergence test as well
* update reference loss for amd mi100
* include mi100 test results csv
* update the mi100 convergence test reference values
* update batch sizes for mi100 32g
* fix gpu sku for run_convergence_test.py
* undo unrelated changes to master
* pr comments
* pr comment
Co-authored-by: Jesse Benson <jesseb@microsoft.com>