llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691)
* Add support for quantizing already quantized models
* Threaded dequantizing and f16 to f32 conversion
* Clean up thread blocks with spares calculation a bit
* Use std::runtime_error exceptions.