[functorch] Generate 2^n tests, not 3^n tests for vmap (pytorch/functorch#937)
Previously, our vmap tests were generating 3^n tests per OpInfo sample.
For each tensor argument, we would generate all permutations of bdim =
(0, -1, None).
This is pretty redundant and also performance intensive. The original
purpose of this was to make sure functorch's batching rules work with
bdim other than 0 (it's really easy to forget that the bdim is not
always at the front of the tensor).
The new strategy is to generate all permutations of bdim = (-1, None)
and also include the case where all bdims are 0 as a sanity check.
This leads to 2^n tests.
On my machine test_vmap goes from 3m25s to 2m45s, which is promising.
However the biggest wins are going to be in test_ops where n can be as
high as 10.