TreeEnsemble speed up (#17449)

Commit

2 years ago

TreeEnsemble speed up (#17449) ### Description This PR proposes a change that should speed up inference for the TreeEnsemble* kernels. Previously, when traversing a decision tree, the `TreeNodeElement` pointer would be incremented or decremented to the appropriate child node - I assume this was because the `truenode_inc_or_first_weight` and `falsenode_inc_or_n_weights` member were overloaded for two purposes. In this PR, we now assign the true branch pointer. We also initialise `nodes_` in a pre-order traversal which means that the false branch's position can be resolved statically and does not need to be stored. I observe the following speed ups. The benchmarks used are derived from those in https://github.com/siboehm/lleaves/tree/master/benchmarks and the baseline is the main branch. NYC Dataset -------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 176.539 | 155.709 | 145.119 | 11.7989 | 17.7976 | | 4 | 59.9015 | 51.9652 | 50.0884 | 13.2488 | 16.382 | | 8 | 34.5561 | 31.3024 | 28.2535 | 9.41581 | 18.2387 | Airline Dataset --------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 2127.34 | 1389.7 | 920.373 | 34.6745 | 56.736 | | 4 | 723.307 | 481.634 | 310.618 | 33.4122 | 57.0558 | | 8 | 420.722 | 278.397 | 185.265 | 33.8286 | 55.9651 | mtpl2 Dataset -------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 1143.62 | 1020.04 | 998.171 | 10.8055 | 13.0988 | | 4 | 386.153 | 339.905 | 328.061 | 11.9764 | 14.3729 | | 8 | 225.995 | 200.665 | 199.057 | 11.2084 | 13.4408 | These were run using an M2 Pro with 16GB of RAM. All times are in milliseconds and averages over 10 runs with a batch size of 100,000. ### Motivation and Context Performance improvements.

References

#17449 - TreeEnsemble speed up

Author

adityagoel4512

Parents

65249f42

onnxruntime db558ef9 - TreeEnsemble speed up (#17449)

onnxruntime
db558ef9 - TreeEnsemble speed up (#17449)