Add padded, torch.sum benchmark for jagged_mean operator (#2354)
Summary:
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2354
Add to TritonBench a `jagged_mean` reduction operator benchmark for nested tensors using the PyTorch `torch.sum` function and [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26).
This diff implements a benchmark for reducing along the ragged dimension of 3-dimensional jagged tensors. For a 3-dimensional tensor of shape `(B, *, M)`, where `*` is the ragged dimension, this benchmark pads each 2-dimensional tensor with zeros. Next, it divides the `sum` of the padded 3-dimensional tensor by the number of elements along each ragged dimension `*`, calculated using `x.offsets().diff()`.
This benchmark avoids a GPU/CPU sync and is thus faster than the previous two PyTorch benchmarks, where D59144906 incurs a GPU/CPU sync and D59146024 uses the unoptimized `torch.nanmean` function.
Reviewed By: davidberard98
Differential Revision: D59245842
fbshipit-source-id: f860b0d8bc98e27bb4dbea8dc44fac185ce5529f