Run mm decomposition tests for CPU and GPU (#108620)
Summary: Run mm decomposition tests for CPU and GPU
One nit - this will suppress CPU tests on hosts that have CUDA (i.e., TEST_CUDA is True), but doesn't have Triton because we don't have access to whether the test is actually for CPU or CUDA (which would require reading the device argument)
(This is a general limitation on torch.compile tests because on CUDA they require triton in the std config.)
Test Plan: sandcastle, github
Differential Revision: D48998215
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108620
Approved by: https://github.com/bertmaher