[GPU] NormalizeL2Decomposition fp32 inner nodes to avoid fp16 range overflow from reducesum (#31623)
### Description of the issue(symptom, root-cause, how it was resolved)
- fp16 range overflow happens in reducesum layer from NormalizeL2
decomposition subgraph. It causes accuracy failure in customer model.
- Use decomposition with fp32 internal nodes instead of using ref
kernel. It has slightly better performance in target model(42fps)\
- Onednn reduction primitive supports fp32 src/dst. Removed fp32
limitation in ReduceImplementationManager.
#### The code and line that caused this issue (if it is not changed
directly)
-
[src/plugins/intel_gpu/src/kernel_selector/cl_kernels/normalize_gpu_within_spatial_ref.cl](https://github.com/openvinotoolkit/openvino/blob/7e847ed3db46004f78ce4e2008e03cca534d2050/src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp#L853)
#### Reproduction step and snapshot (if applicable. Do not attach for
customer model)
- $ ./benchmark_app -d GPU.1 -m
~/task/blackmagic/RealWeightsIR/MusicRetimer/bt.xml -i
~/task/blackmagic/InputNpys/MusicRetimer_bt_input_0.npy
#### Problematic graph
- Decomposition subgraph
<img width="952" height="991" alt="image"
src="https://github.com/user-attachments/assets/603e715a-302d-4a7e-92ac-0ed8a4bd8725"
/>
- Decomposition with fp32 nodes
<img width="769" height="780" alt="image"
src="https://github.com/user-attachments/assets/cdcca658-8da4-4ec2-8d90-f50b24504536"
/>
#### Checklist
- [v] Is it a proper fix? (not a workaround)
- [v] Did you include test case for this fix, if necessary?
- [v] Did you review existing test that can be extended to cover this
scenario? Which test did you review?
### Tickets:
- 163878