Improve parallelization of TfIdfVectorizer, Reduce memory consumption (#18539)
### Description
TfIdfVectorizer has two steps: first search for n-grams in the input,
second, weight the results. The second step was not parallelized. The PR
adresses that issue. Before two vectors were of the size of the output
were allocated to compute the results. The first one, frequencies, was
used as an intermediate vector between the two steps. This vector is now
broken into multiple small vectors, one per thread. The memory
consumption is then reduced for batches with a number of rows > the
number of threads.
### Motivation and Context
Performance and memory consumption.
For one model, the improvment is +15% faster (4 cores, model size is
~6Mb, batch size is 100). Here is another benchmark on
a machine with 32 cores with different size of vocabularies and batch
sizes. The tested TfIdfVectorizer only deals with unigram and processes
sequences of 10 tokens (integers).
