[composite compliance] add test for fwd AD
Fixes https://github.com/pytorch/pytorch/issues/74678
Test timings:
```
======================================= 756 passed, 99 skipped, 13864 deselected, 76 xfailed, 16 warnings in 278.35s (0:04:38) =======================================
```
Slowest ops
```
======================================================================== slowest 20 durations ========================================================================
32.16s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_instance_norm_cuda_float32
30.51s call test/test_ops.py::TestCompositeComplianceCPU::test_forward_ad_nn_functional_instance_norm_cpu_float32
9.89s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad__masked_norm_cuda_float32
8.54s call test/test_ops.py::TestCompositeComplianceCPU::test_forward_ad__masked_norm_cpu_float32
8.52s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_diff_cuda_float32
8.33s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_linalg_solve_triangular_cuda_float32
8.08s call test/test_ops.py::TestCompositeComplianceCPU::test_forward_ad_linalg_solve_triangular_cpu_float32
8.03s call test/test_ops.py::TestCompositeComplianceCPU::test_forward_ad_diff_cpu_float32
6.52s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_cov_cuda_float32
5.77s call test/test_ops.py::TestCompositeComplianceCPU::test_forward_ad_cov_cpu_float32
4.12s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_lu_solve_cuda_float32
3.78s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad__masked_std_cuda_float32
3.67s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_gradient_cuda_float32
3.55s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad__masked_var_cuda_float32
3.47s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_max_pool2d_cuda_float32
3.42s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_batch_norm_without_cudnn_cuda_float32
3.40s call test/test_ops.py::TestCompositeComplianceCPU::test_forward_ad_nn_functional_max_pool2d_cpu_float32
3.30s call test/test_ops.py::TestCompositeComplianceCPU::test_forward_ad__masked_std_cpu_float32
3.30s call test/test_ops.py::TestCompositeComplianceCPU::test_forward_ad_gradient_cpu_float32
3.28s call test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_batch_norm_cuda_float32
====================================================================== short test summary info =======================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75178
Approved by: https://github.com/zou3519