Change dropout of device Privateuse1 to fused kernel (#106774)
Similar to issue in #97894, dropout is dispatched to fused kernel(native_dropout) only with some devices like cuda, etc.. It is hard to optimize performance when using AOT with custom device, as dropout is finally decomposed to bernoulli and mul. This PR changes this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106774
Approved by: https://github.com/ezyang