Longformer Attention CUDA kernel memory Improvements (#6646)
* Integrate memory improvements from NVidia
* compute max_global_num before buffer allocation
* update conversion script to support transformers 4.0
* update benchmark script for creating dummy inputs for different batch_size
* Use a wrapper of cuda event to avoid memory leak