SemanticDiff

pytorch
cfd8c58b - Tune elementwise ops for ROCm (#21754)

Commit View On GitHub

Login via GitHub
Home
Pricing
FAQ
Install

Login via GitHub

Commit

5 years ago

Tune elementwise ops for ROCm (#21754) Summary: ``` The stride calculation using OffsetCalculator performs poorly with MAX_DIMS=25. This reduces MAX_DIMS (after coalescing) to 16 on ROCm. I think it's unlikely that anyone will exceed this limit. If they do, we can add additional specializations for ROCm with more dimensions. ``` I'm not sure about the underlying cause. With MAX_DIM=25, the add kernel's params is ~648 bytes vs. ~424 bytes with MAX_DIM=16. The kernel instruction footprint is bigger too, but most of these instructions are never executed and most kernel parameters are never loaded because the typical dimensionality is much smaller. Mini benchmark here: https://gist.github.com/colesbury/1e917ae6a0ca9d24712121b92fed4c8f (broadcasting operations are much faster) cc iotamudelta Pull Request resolved: https://github.com/pytorch/pytorch/pull/21754 Reviewed By: bddppq Differential Revision: D15811906 Pulled By: colesbury fbshipit-source-id: 063f92c083d26e2ef2edc98df7ff0400f9432b9d

Author

colesbury

colesbury

Committer

facebook-github-bot

facebook-github-bot

Parents

FAQ Terms Privacy Refunds Impressum

Loading