perf: Use custom allocator (#2768)
This PR replaces the system allocator with a custom allocator to improve performance:
* Windows: mimalloc
* Unix: tikv-jemallocator
## Performance:
* Linux
* `cpython --no-cache`: 208.8ms -> 190.5ms
* `cpython`: 32.8ms -> 31ms
* Mac:
* `cpython --no-cache`: 436.3ms -> 380ms
* `cpython`: 40.9ms -> 39.6ms
* Windows:
* `cpython --no-cache`: 367ms -> 268ms
* `cpython`: 92.5ms -> 92.3ms
## Size
* Linux: +5MB from 13MB -> 18MB (I need to double check this)
* Mac: +0.7MB from 8.3MB-> 9MB
* Windows: -0.16MB from 8.29MB -> 8.13MB (that's unexpected)