Optimize quantized LSTM (#8634)
* optimize some lstm gate computation. Remove no need string constructions.
* change gcc optimization flags for computation bound logics in rnn_helpers
* better qgemm for M=1
* Some improve on avx512
* add condition to limit GCC related marcros
* Correct QGemm assembly for M=1 AVX2 optimization to pass mlas_test.
* Fix rnn_helper build issue for wasm.
* better asm code here according to feedbacks.
* Remove customized vectorize and unroll option for GCC.
Using restrict on some function to help GCC to correctly vectorize it.
Rewrite clip_add_bias() to let GCC correctly vectorize it.
* Better restrict semantic for merge_lstm_gates_to_memory() by adding in_place().
Add MSC __restrict for the clip_add_bias() mthod to vectorize correctly.
* Force CI restart as it stucked by the onnxruntime-python-checks-ci-pipeline which can not restart.