TreeEnsemble speed up (#17449)
### Description
This PR proposes a change that should speed up inference for the
TreeEnsemble* kernels. Previously, when traversing a decision tree, the
`TreeNodeElement` pointer would be incremented or decremented to the
appropriate child node - I assume this was because the
`truenode_inc_or_first_weight` and `falsenode_inc_or_n_weights` member
were overloaded for two purposes.
In this PR, we now assign the true branch pointer. We also initialise
`nodes_` in a pre-order traversal which means that the false branch's
position can be resolved statically and does not need to be stored.
I observe the following speed ups. The benchmarks used are derived from
those in https://github.com/siboehm/lleaves/tree/master/benchmarks and
the baseline is the main branch.
NYC Dataset
--------------
| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |
|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 176.539 | 155.709 | 145.119 | 11.7989 | 17.7976 |
| 4 | 59.9015 | 51.9652 | 50.0884 | 13.2488 | 16.382 |
| 8 | 34.5561 | 31.3024 | 28.2535 | 9.41581 | 18.2387 |
Airline Dataset
---------------
| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |
|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 2127.34 | 1389.7 | 920.373 | 34.6745 | 56.736 |
| 4 | 723.307 | 481.634 | 310.618 | 33.4122 | 57.0558 |
| 8 | 420.722 | 278.397 | 185.265 | 33.8286 | 55.9651 |
mtpl2 Dataset
--------------
| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |
|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 1143.62 | 1020.04 | 998.171 | 10.8055 | 13.0988 |
| 4 | 386.153 | 339.905 | 328.061 | 11.9764 | 14.3729 |
| 8 | 225.995 | 200.665 | 199.057 | 11.2084 | 13.4408 |
These were run using an M2 Pro with 16GB of RAM. All times are in
milliseconds and averages over 10 runs with a batch size of 100,000.
### Motivation and Context
Performance improvements.