[CUDA] Fix SkipLayerNorm vectorized kernel out-of-bounds read (#17943)
Fix a bug in https://github.com/microsoft/onnxruntime/pull/11803:
When hidden size is not exactly same as next size (for example ld=320 in
stable diffusion) current vectorized kernel might read out-of-bounds,
and might cause CUDA failure.
Also resolved another issue: for the first and last size, current macro
will cause some dead code (some branch will never run). Here we change
it to avoid those branches in boundary sizes.
Performance tests with stable diffusion shows that the performance is
on-par before/after this fix.