Speedup and reduce binary size for TfIdfVectorizer (#3197)
Speed up TfIdf.
Build Trie like structure to quickly exclude dead-ends.
Use ParallelFor() for each of the rows processing.
Make it non-template, batch it.
Check for short tail within the inner loop.