Avoid repeated GemmSoftmaxGemmPermuteTunableOp<HipT> ctor invocation (#16518)
The `GemmSoftmaxGemmPermuteTunableOp<HipT>` is expensive to construct,
avoid the ctor invocation will substantially improve the launch time and
get better performance during the decoding. This get <7% e2e time
reduction of whisper large.