Add optional torchembed RoPE backend to apply_rotary_pos_emb (#8052)
Adds `torchembed` as an optional fused RoPE backend for
`deepspeed.sequence.layer.apply_rotary_pos_emb()`, following the same
pattern used in transformers and vLLM.
## Changes
- **`deepspeed/sequence/layer.py`**: Add `try/except ImportError` guard
for `torchembed._triton.fused_rope_forward`. When `torchembed` is
installed, the tensor is on CUDA, and `rotary_dim` is even, the function
dispatches to the fused triton kernel instead of the PyTorch reference
path.
- **`setup.py`**: Add `torchembed` extras key (`pip install
deepspeed[torchembed]`).
- **`tests/unit/sequence/test_apply_rotary_pos_emb.py`**: Numerical
correctness vs PyTorch reference across seq_len (1/17/128), dim
(32/64/128), and various rotary_dim. Gradient flow test.
## Implementation details
The torchembed kernel processes `(*leading, seq_len, dim)` tensors with
`RotaryEmbedding(use_fused=True)`, applying Neox-style RoPE via triton.
The helper reshapes arbitrary leading dims, calls the kernel, and
restores the original shape — transparent to callers.
## Testing
```bash
pytest tests/unit/sequence/test_apply_rotary_pos_emb.py -v
```
---------
Signed-off-by: py-ai-dev <py.oss.ml@gmail.com>
Co-authored-by: Claude Sonnet 5 <noreply@anthropic.com>