Tune block size for layer_norm considering #rows and GPU resource (#15410)
fine tune cuda layernorm block size considering number of rows to
process together with column number, and hardware resources (number of
SMs, etc)
Co-authored-by: Lei Zhang <phill.zhang@gmail.com>