[export] improve binary op fast path broadcast check (#121546)
# Context
I believe we have an incorrect guard being created during FakeTensor's binary op fast path.
Consider this case
```
# op.shape: (10, 192); final_shape: (s0, 10, 192)
# Guard Ne(s0, 10) is created when we create SymBool(10 == s0)
if isinstance(op, torch.Tensor) and op.shape == final_shape:
break
```
As of right now, `op.shape == final_shape` checks whether one of the binary op's operands is the same as the binay op's output shape.
* If one of them is a dynamic shape, then we'll create a guard via`SymBool` creation (i.e. `s0 == 10`).
* If the `SymBool` expr resolves to `false`, then we'll create the guard `Ne(s0, 10)`.
This is a problem when the # of dimensions aren't the same between `op.shape` & `final_shape`. Take the case above for example, `op.shape: (10, 192); final_shape: (s0, 10, 192)`. Although, the shapes aren't the same, it doesn't necessarily mean that `s0 != 10`.
Some thoughts (feel free to ignore). What if the # of dimensions are equal but one of the shapes has symbols. Here's three cases:
1. `op.shape: (9000, 10, 192); final_shape: (s0, 10, 192)` -- not broadcastable.
2. `op.shape: (1, 10, 192); final_shape: (s0, 10, 192)` -- 0/1 specialization wins?
3. `op.shape: (100, 10, 192); final_shape: (s0, 10, 192) where s0 = 100` -- Ask user to mark `s0` as a constant.
# Test
```
$ TORCHDYNAMO_VERBOSE=1 PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_dynamic_shapes.py -k test_export_fast_binary_broadcast_check_dynamic_shapes
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (dim0)! For more information, run with TORCH_LOGS="+dynamic".
- Not all values of dim0 = L['a'].size()[0] in the specified range 3 <= dim0 <= 1024 satisfy the generated guard Ne(L['a'].size()[0], 3).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121546
Approved by: https://github.com/aakhundov