Decoder Attention CUDA Op (#9792)
* add kernel interface
* register kernel
* add self/cross qkv projection without cache
* add LaunchTransQkv2 for (S,B,X,N,H) -> (X,B,N,S,H)
* refactor ConcatPastToPresent
* DecoderQkvToContext interface
* q,k,v buffer and cache as output
* qk, pv and transctx
* fix compiler error on linux machine
* key_padding_mask
* add test_parity file. However not runnable
* add partial unittest
* made partial attributes to inputs
* --gen_doc
* change kernel interface, add more tests
* morre parity tests
* fix test
* fix typo
* transpose optimizer has bug. remove it temporarily
* add input shape checks
* add type/shape inference
* fix cache shape check
* fix rocm build failure
* fix rocm build error
* review comments
* review comments