[SR][easy] CPU fuser uses native control flow (#72544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72544
Now that static runtime supports control flow, there's no need to fall back to the JIT. We get better performance with the native control flow since we avoid heap allocation/ref count bumps during stack construction.
I've left the old `prim::TensorExprDynamicGroup` around in case we need to support it in the future. I've also added native support for a few scalar ops that are used inside the control flow sub-blocks.
ghstack-source-id: 148825816
Test Plan: New unit tests
Reviewed By: d1jang
Differential Revision: D34083080
fbshipit-source-id: a7ffc0fda39ab3df3ba47e44a03d857131dc1e50
(cherry picked from commit 2ef39e0e54d5e9da76af9e617a11233ffc81b011)