Speedup `CumSum` for large arrays (#22048)
### Description
This PR refactors the `CPU` kernel for the `CumSum` operator. The new
implementation strives to have as little indirection as possible.
### Motivation and Context
Currently the `CumSum` operator perform very poorly in the case of 1D
tensors(it was slower than a python loop). This is caused by the
extensive use of the `SliceIterator`-s.
Here is a relevant snippet:
```python
import time
import ndonnx as ndx
import onnxruntime as ort
import numpy as np
import onnx
def test_cumsum(sz):
a = ndx.array(shape=(sz,), dtype=ndx.int64)
b = ndx.cumsum(a)
model = ndx.build({'a': a}, {'b': b})
onnx.save(model, "model.onnx")
input = np.ones(sz, np.int64)
start = time.time()
result = ort.InferenceSession(model.SerializeToString()).run(None, {'a': input})
end = time.time()
return end - start
def test_cumsum_by_hand(sz):
input = np.ones(sz, np.int64)
start = time.time()
answer = [0]
for i in input:
answer.append(answer[-1] + i)
end = time.time()
return end - start
print(test_cumsum(int(1e7)))
print(test_cumsum_by_hand(int(1e7)))
```
Before
```console
0.9794480800628662
0.4518160820007324
```
After
```console
0.02483987808227539
0.5496008396148682
```
The `model.onnx`:
<img width="214" alt="image"
src="https://github.com/user-attachments/assets/a213d6ff-86c3-49b5-a493-ebfd97deaa41">
The flame graph:
