[BE] Update cutlass with NVIDIA upstream changes to 3.1 (#100333)
Updates cutlass with some more upstream changes that went into the 3.1 rc. We already merged in 3.1 so best to get these performance and other fixes into master as well. Follow up to #94188
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100333
Approved by: https://github.com/ezyang, https://github.com/jansel