Enable parallel computation in Clip ops (#14925)
### Description
<!-- Describe your changes. -->
This PR speeds-up Clip operations by replacing their sequential
implementation with a parallelized one. The parallelization is achieved
by dividing the input data into chunks of size N and using a thread pool
to process the chunks in parallel. The chunk size N is set to 16K based
on performance evaluation on input tensors of 10^i elements for i in [1
.. 6].
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The Clip operation is frequently executed in image processing models.
Its implementation can be easily parallelized and therefore sped up when
executed on a multi-core machine. On long inputs (>= 100K elements) this
PR achieves speedup of over 2x. On shorter inputs, this PR does not
introduce any substantial performance change.