[CPU] Optimize GQA attention bias application for FP16 (#25871)
### Description
When using attention bias input for GQA op with FP16, on the platforms
that don't natively support FP16 math a cast to fp32 needs to be
performed, and thus a temporary buffer needs to be created to store the
fp32 values. The issue is that this temporary buffer was being allocated
/ deallocated inside of a loop for every token being processed.
Refactored the implementation so that the allocation takes place only
once.
Phi model throughput increased by 15%.