Unify attention output handling (#2343)

Commit

1 year ago

Unify attention output handling (#2343) - Always return the hidden states. - Create the output tensor inside the `attention` and `paged_attention` functions. This removes the difference between how the output is handled between attention (output parameter) and paged attention (return value). This also removes the assumption that the attention implementation can write to an output tensor (in preparation of FlashInfer).

References

#2343 - Unify attention output handling

Author

danieldk

Parents

22fb1be5

text-generation-inference 47447ef0 - Unify attention output handling (#2343)

text-generation-inference
47447ef0 - Unify attention output handling (#2343)