pytorch
62f676f5 - [te] Optimize allocation of kernel outputs (#50318)

Commit
3 years ago
[te] Optimize allocation of kernel outputs (#50318) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50318 We can skip the dispatcher and go to the device-specific `at::native::empty_strided` implementation. Also, unpacking the TensorOptions struct at kernel launch time actually takes a bit of work, since the optionals are encoded in a bitfield. Do this upfront and use the optionals directly at runtime. ghstack-source-id: 119735738 Test Plan: Before: ``` ------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------- FusedOverhead 2143 ns 2142 ns 332946 UnfusedOverhead 2277 ns 2276 ns 315130 ``` After: ``` ------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------- FusedOverhead 2175 ns 2173 ns 321877 UnfusedOverhead 2394 ns 2394 ns 307360 ``` (The noise in the baseline makes this really hard to read, it seemed to be about 3-5% faster in my local testing) Reviewed By: eellison Differential Revision: D25859132 fbshipit-source-id: 8753289339e365f78c790bee076026cd649b8509
Author
Parents
Loading