GptOss experts implementation (#43227)
* experts impl gpt oss
* no need to transpose dequantized experts
* skip test_reverse_loading_mapping
* fix custom gating
* revert transposition and simply support transposed experts to avoid modifying eager
* style
* don't rely on weight shapes as they can be square matrices
* no need to relaod
* fallback to eager
* Update src/transformers/models/gpt_oss/modeling_gpt_oss.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fix
* force 16 bytes alignmenet during weight loading
* simplify logic
* quantization conversions should be applied first
* avoid baddbmm as it is less performant / less optimizable by max-autotune
* no need for logger
* add comment explaining limitation
* standarize operations and only reshape when needed
* fixup conversion and test
* Update src/transformers/conversion_mapping.py
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* force alignment docstring
* move default apply gate
* offsets
* add docs and make kernel_config optional
* use reshapes as they are equivalent to views when memory is contiguous
* fix and better notes
* reshapes instead of views
* keep model saving and reloading in grouped_mm test to catch misalignment issues
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>