make nanogpt work with both compiled autograd and _LazyGraphModule (#118981)
@xmfan and @fegin reported that _LazyGraphModule ( https://github.com/pytorch/pytorch/pull/117911 ) makes nanogpt training fail with compiled autograd.
We have a repro: ``` python benchmarks/dynamo/torchbench.py --training --backend=inductor --disable-cudagraphs --accuracy --only nanogpt --repeat 1 --compiled-autograd ```
but it's still mysterious how to trigger the issue with a toy model.
The error message for the failure is https://gist.github.com/shunting314/6402a6388b3539956090b6bc098952fb . In compile_fx we will call `detect_fake_mode`. This function will look for an active FakeTensorMode from both TracingContext and example inputs. The error is triggered because we find different FakeTensorMode from these 2 sources.
Although I don't know what really causes the discrepancy of FakeTensorMode above, the fix here is to force _LazyGraphModule recompilation if we have compiled autograd enabled. This does not hurt compilation time most of the time because we anyway will call the graph module here in the backward pass when compiled autograd is enabled: https://github.com/pytorch/pytorch/blob/855d5f144efc1db50316b9fcad1e62bf37caed10/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L705
Let me know if we can have a better fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118981
Approved by: https://github.com/jansel