Update SAM/SAM HQ attention implementation + fix Cuda sync issues (#39386)
* update attention implementation and improve inference speed
* modular sam_hq + fix integration tests on A10
* fixup
* fix after review
* softmax in correct place
* return attn_weights in sam/sam_hq