benchmark
e26cd75d - Add torch.sum benchmark for jagged_layernorm operator (#2376)

Commit

1 year ago

Add torch.sum benchmark for jagged_layernorm operator (#2376) Summary: Pull Request resolved: https://github.com/pytorch/benchmark/pull/2376 Add to TritonBench a jagged_layernorm reduction operator benchmark for nested tensors using the PyTorch `torch.sum` function, [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26), and [`torch.ops.aten._padded_dense_to_jagged_forward`](https://www.internalfb.com/code/fbsource/[16a15f9537d5a41100caaf394a398a0ab447d865]/xplat/caffe2/torch/_inductor/jagged_lowerings.py?lines=251). This diff implements two benchmarks: 1. The baseline PyTorch benchmark uses `unbind` to call `torch.nn.LayerNorm` on each variable-length tensor in the nested tensor. This implementation is extremely slow, resulting in very high latency for all input shapes. 2. The more efficient PyTorch benchmark leverages `torch.sum` as well as aten lowerings to pad a jagged tensor, perform operations that execute a layer normalization, and unpad back into a jagged tensor format. First, the benchmark creates two padded dense tensors using [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26); the first pads the input tensor, and the second pads a set of `1`s in the shape of the input tensor to create a mask for the normalization. Layer normalization requires reductions on a layer of input, dimensions `(1, 2)` in this case (as opposed to something like instance normalization, which reduces on dimension `1`). The `mean` is defined by the `torch.sum` of the padded inputs divided by the ragged offsets, the product of the last two dimensions' sizes. To calculate the `variance`, we mask out padded values in the ragged dimension. Lastly, we perform layer normalization by dividing the padded, normalized input by the root of the variance with an added epsilon value (preventing a root of `0`); then, we return the unpadded tensor using [`torch.ops.aten._padded_dense_to_jagged_forward`](https://www.internalfb.com/code/fbsource/[16a15f9537d5a41100caaf394a398a0ab447d865]/xplat/caffe2/torch/_inductor/jagged_lowerings.py?lines=251). The latter PyTorch benchmark avoids a GPU/CPU sync and is faster than the `unbind` implementation to varying degrees of difference. Notes - This [article](https://wandb.ai/wandb_fc/LayerNorm/reports/Layer-Normalization-in-Pytorch-With-Examples---VmlldzoxMjk5MTk1) was helpful in understanding layer normalization (as opposed to other variants)! - This implementation works for nested tensors where variable-length tensors have `seqlen = 0` Reviewed By: davidberard98 Differential Revision: D59332017 fbshipit-source-id: b54b8afe064e2adb2ab5dbfc20f0a08b178e354e

Author

jananisriram

Committer

facebook-github-bot

Parents

c70ccb37

benchmark e26cd75d - Add torch.sum benchmark for jagged_layernorm operator (#2376)

benchmark
e26cd75d - Add torch.sum benchmark for jagged_layernorm operator (#2376)