cuda : performance optimizations (#1530)
* xor hack
* block y dim
* loop unrolling
* Fixed cmake LLAMA_CUDA_BY option
* Removed hipblas compatibility code
* Define GGML_CUDA_DMMV_BLOCK_Y if not defined
* Fewer iters, more ops per iter
* Renamed DMMV X/Y compilation options