Add environment variable to force flattening of 3D input tensor (#96761)
Adding an environment variable `TORCH_LINEAR_FLATTEN_3D` to force flattening of 3D input tensor even when it is non-contiguous.
Today, the `Linear` op would flatten a 3D input sensor if it is contiguous.
It was found that even for some non-contiguous inputs (esp. with BF16 data type), flattening would also yield higher performance.
For example:
```
x_size = (3072, 1196, 128)
x = torch.rand(x_size, device="cuda", dtype=torch.bfloat16)
x = torch.transpose(x, 1, 2)
torch._C._nn.linear(x, weight, bias)
```
Since the detailed auto-tuning is unknown, this PR adds an environment variable for users to make a choice.
(Default value is still 0.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96761
Approved by: https://github.com/ngimel