Make training CUDA kernels to adhere established code structure patterns (#10735)
Current training optimizer kernels include CPU headers
that affects changes that we can make in the CPU code with C++14 compiler and
other refactoring efforts. Rearrange the kernel according to the established patterns
and do not include headers that are not needed.