Ryanunderhill/beamscorer gpu (#16272)
### Description
Make BeamScorer run on the GPU vs the CPU.
Brief overview:
Adds a CUDA 'CudaBeamSearchScorer' implementation of IBeamScorer
Instead of a 'done' flag per beam, there is one single 'not done'
variable that is copied to the CPU every iteration
Removes some of the extra CPU side buffers and parameters that are no
longer needed
Remaining future optimizations:
CPU copied beam indices is still used in the non
DecoderMaskedSelfAttention case. An extra kernel can be written to avoid
PickGptPasteState needing CPU copied beam indices (called from
UpdateGptFeeds).
### Motivation and Context
It's faster to keep the work on the GPU to avoid GPU->CPU->GPU copies of
data.