Support DistilBert's Attention fusion in Optimizer (#4748)
* checkin
* attention fusion
* attention work under layernorm, still need refine
* embedlayernorm(have problems with graph.Resolve())
* some fix
* update: attention works but onnx results in protobuf parsing failed
* tested by optimizer
* add embedlayer fusion test
* add attention fusion test
* clean code, need refactor later
* clean code
* added reshape fusion for distilbert, modified attention, added tests
* refactor
* small fix
* remove uncessary lines
* fix reshape and modify attention
* resolving conflicts
* restore
* refactor and review partial comments
* refactor attention
* small fix
* fix inf compare
* match new pattern for attention fusion
* formatting
* attention does not depend on transposescalematmul
* fix
* review coments
* revert changes
* review comments
* small fix