add AutoNonVariableTypeMode for USE_STATIC_DISPATCH on JIT->ATen path (#27274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27274
This is yet another fix to address #26764.
PR #26908 toggles NonVariableTypeMode in ATen dispatcher, which is where
USE_STATIC_DISPATCH takes place thus it's most logically sound place to do
such tweaks.
However, we observed nontrivial perf regression due to this fix. Turns out
the numel() tensor method gets called in several for-loops thus incurs ~7M
thread_local updates in a single forward call:
```
7173330 numel
558 size
416 q_scale
302 _empty_affine_quantized
288 contiguous
257 q_zero_point
216 qscheme
173 empty
110 set_
105 as_strided
104 permute
...
```
As numel() is not called from a single place so a natural workaround is to
update function_wrapper.py so that it only adds the guard on gen_namespace_function()
case and ignore the gen_tensor_method() case. But some tensor methods are actually
being called from JIT side directly (e.g. "aten::eq_" -> "(self).eq_") so the
only "band aid" left on the table is to insert guard on JIT->aten path as originally
did on #26868 - this is a simplified version of it as it doesn't hurt to extend the
NonVariableMode scope a little bit to also cover stack drop/pack calls.
On Android we only expose JIT API so we don't need worry about TensorMethods being
called directly. On iOS we don't provide a wrapper yet but we can mention this caveat
in the doc. Hopefully by the time it's widely used we can finish Variable/Tensor
unification and remove all these hacks.
Test Plan:
- Verified it runs quantized/fp32 MobileNetV2 models;
- Verified it fixes the perf regression (revert #26908 separately);
Differential Revision: D17732489
Pulled By: ljk53
fbshipit-source-id: c14ca66aebc6b6f17ad6efac7ca47f9487c98de5