Reapply D25859132: [te] Optimize allocation of kernel outputs (#50546)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50546
And fix the ROCm build
ghstack-source-id: 119837166
Test Plan: CI
Reviewed By: ZolotukhinM
Differential Revision: D25912464
fbshipit-source-id: 023e1f6c9fc131815c5a7a31f4860dfe271f7ae1