Speedup Llama2 cpu throughput in bench by 1.69x with iobinding (#19853)
### Description
Always set `use_io_binding=True` when using optimum.onnxruntime unless
there is a special case.
### Motivation and Context
By default, `ORTModel` under optimum.onnxruntime will choose the
appropriate `use_io_binding` value based on provider and use cases.
> use_io_binding (`Optional[bool]`, defaults to `None`):
> Whether to use IOBinding during inference to avoid memory copy between
the host and device, or between numpy/torch tensors and ONNX Runtime
ORTValue. Defaults to
> `True` if the execution provider is CUDAExecutionProvider. For
[~onnxruntime.ORTModelForCausalLM], defaults to `True` on
CPUExecutionProvider,
> in all other cases defaults to `False`.
For Llama token benchmark, using iobinding yields almost 2x speedup,
even on CPU. This is because this particular model yields a large number
of outputs (>60). Without iobinding, a copy is performed for each output
from ortvalue to numpy array. This adds significant overhead to the
overall run time.
```
Evaluating Llama2 `model(inputs)` step with past_key_values
Before, w/o iobinding on cpu
Batch Size: 1
Sequence Length: 512
Latency: 0.4518657898902893 s
Throughput: 2.2130464894073856 tps
After, w/ iobinding on cpu
Batch Size: 1
Sequence Length: 512
Latency: 0.2662619352340698 s
Throughput: 3.7557001871893703 tps
```