[nvFuser] Transform replay refactor and minor updates (#39579)
Summary:
We've got quite a few things going on, preparing a push back to upstream so we don't get too desynced.
- Major refactor of transform replay. It is now far more robust and fixes bugs discovered in reductions. Preparing for extension to explicit broadcast ops which will be the last major memory pattern for op coverage. Broadcast ops will allow us to express up to and potentially beyond norms and gemms.
- Initial runtime expression evaluator. This allows us to evaluate expressions at runtime. Will be useful for determining our grid/block layout at runtime, so we don't have to manually compute them according to the code we're trying to generate.
- Moving to int64 and double for scalar representations to match PyTorch JIT.
- Improvements in codegen interface where we return Tensor like object instead of parent class Val.
- Add `addcmul` and `lerp` ops
- General updates, fixes, test additions, test inprovements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39579
Differential Revision: D21974001
Pulled By: soumith
fbshipit-source-id: 7f7ccc91593466e948f3ce90f8f9b7fbc5c28de2