Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition (#17332)
* Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition
* Argh.
* Making CISC happy ;)
* Integrate CONT tests
* Use loopy loop
* Skip new tests for (B)F16 for now.