[`GPTNeoX`] Faster rotary embedding for GPTNeoX (based on llama changes) (#25830)
* Faster rotary embedding for GPTNeoX
* there might be un-necessary moves from device
* fixup
* fix dtype issue
* add copied from statements
* fox copies
* oupsy
* add copied from Llama for scaled ones as well
* fixup
* fix
* fix copies