Add helper functions to dump 4d tensors in CPU for debugging (#21043)
Add some helper functions to dump 4D tensors to help debugging.
Example to use it:
(1) Change DUMP_TENSOR_LEVEL from 0 to 2 in
contrib_ops/cpu/utils/debug_macros.h to enable dumping. Without
enabling, the dumping code will not be built into ORT binary.
(2) Add a few lines to dump tensors like
```
DUMP_CPU_TENSOR_INIT();
DUMP_CPU_TENSOR("tensor name", tensor_data, dim0, dim1, dim2, dim3);
```
Changes:
- [x] Add functions to dump 4D int32/int64/float/half tensors in CPU
- [x] Add functions to dump 4D int32/int64 tensors in CUDA
- [x] Change namespace (remove .transformers from namespace, and move
files to utils directory)