Using standard layernorm cuda kernel for skiplayernorm. (#15076)
* Current SkipLayernorm did not using stable algo and cause correctness
issue.
* Enrich existing layernorm kernel to accept bias and residual.
* Tune standard layernorm threads.y according to elements and device
property.
* Remove existing skiplayernorm cuda implementation.