[GPU] Optimize gen9 common f32 conv kernel for batch 32 large 1d input (#32364)
### Description of the issue(symptom, root-cause, how it was resolved)
- Customer model is quite slow in f32 inference mode which has batch=32
1d large convolution on DG2.
- Optimize gen9_common_conv_kernel_f32 for this case
#### The code and line that caused this issue (if it is not changed
directly)
-
src/plugins/intel_gpu/src/kernel_selector/cl_kernels/gen9_common_conv_fwd_data_f32.cl
#### Reproduction step and snapshot (if applicable. Do not attach for
customer model)
- $ benchmark_app -d GPU -m emb.xml -infer_precision f32
#### Problematic graph
<img width="210" height="176" alt="image"
src="https://github.com/user-attachments/assets/c4c3904d-f6f7-4c71-96bf-faffa1c0af4f"
/>
#### Checklist
- [x] Is it a proper fix? (not a workaround)
- [x] Did you include test case for this fix, if necessary?
- [x] Did you review existing test that can be extended to cover this
scenario? Which test did you review?
### Tickets:
- 173214