[GPU] Add a fused kernel groupnorm implementation (#33738)
### Details:
- Add a new OCL implementation for fsv16 group normalization
- The new implementation is used if each group contains fewer than
fsv=16 features
- A single fused kernel handles all stages of the reduction, avoiding
excessive loading of shared values and reusing cache in cases of small
inputs
### Tickets:
- CVS-177816
---------
Co-authored-by: Roman Lyamin <Roman.Lyamin@intel.com>