Jack cao g/dynamo 2.0 cherry pick (#4653)
* Update the test to use the more standard torch.compile api (#4634)
* Move to the new torch.compile api
* use torch.compile for inference test
* Use torch.compile for training too
* Make WaitDeviceOps block until device execution finishes (#4626)
* Make WaitDeviceOps block until device execution finishes
* Add comment
* Add test
* Use shared_mutex instead
* Update comment
* typo
* Remove assertNotIn in the test since it is too unstbale
* handle opbyop
* Update test_operations.py