[CUDA] Fix Alignment of SkipLayerNorm Vectorized Kernel (#15054)
Some of our vectorized kernels (including SkipLayerNorm) doesn't check
the alignment of data pointer. While ORT's allocator may guarantee the
alignment, but since training is using PyTorch's allocator, which cannot
guarantee that, we need to add the data pointer check before we call any
vectorized kernel.
This PR is to fix such data pointer alignment issue for SkipLayerNorm's
vectorized kernel. We found this issue when running huggingface's swinv2
model. The PR also refactored the code for SkipLayerNorm kernel to make
it simpler.