Fix multi-tensor LAMB reduction to be deterministic (#6028)
* define ordering of reduction across blocks
* save state
* remove debug code
* remove debug code
* review comments
* significant correction for reduction only over blocks on same tensor
* addressing ocmments
* update rocm/lamb.cc to build as well
* remove times 2048*size in multitensor test until threshold error in rocm resolved
* convert tuple => struct as per recomendation
* update comment
* apply perfect forwarding for launch_multitensor to permit passing ref rather than pointer
* remove excess template arguments from rocm lamb.cc launch_multitensor as well
* fixes for AMD build
* pr comments
* run formatter from vscode
* formatter on cuda files