Explained different GGML File formats.
I was always confused with it's nomenclature and couldn't find any resource. It's very clever implementation to save space.
What is GGML or GGUF in the world of Large Language Models ? 🚀
GGUF / GGML are file formats for quantized models
Basically, GGUF (i.e. "GPT-Generated Unified Format"), previously GGML, is a quantization method that allows users to use the CPU to run an LLM but also offload some of its layers to the GPU for a speed up.
📌 GGML is a C++ Tensor library designed for machine learning, facilitating the running of LLMs either on a CPU alone or in tandem with a GPU.
💡 GGUF (new)
💡 GGML (old)
Llama.cpp has dropped support for the GGML format and now only supports GGUF
------------
* GGUF contains all the metadata it needs in the model file (no need for other files like tokenizer_config.json) except the prompt template
* llama.cpp has a script to convert *.safetensors model files into *.gguf
* Transformers & Llama.cpp support both CPU, GPU and MPU inference
Being compiled in C++, with GGUF the inference is multithreaded.
↪️ GGML format recently changed to GGUF which is designed to be extensible, so that new features shouldn’t break compatibility with existing models. It also centralizes all the metadata in one file, such as special tokens, RoPE scaling parameters, etc. In short, it answers a few historical pain points and should be future-proof.
----------------
📌 GGUF (GGML) vs GPTQ
▶️ GPTQ is not the same quantization format as GGUF/GGML. They are different approaches with different codebases but have borrowed ideas from each other.
▶️ GPTQ is a post-training quantziation method to compress LLMs, like GPT. GPTQ compresses GPT models by reducing the number of bits needed to store each weight in the model, from 32 bits down to just 3-4 bits.
▶️ GPTQ analyzes each layer of the model separately and approximating the weights in a way that preserves the overall accuracy.
▶️ Quantizes the weights of the model layer-by-layer to 4 bits instead of 16 bits, this reduces the needed memory by 4x.
▶️ Achieves same latency as fp16 model, but 4x less memory usage, sometimes faster due to custom kernels, e.g. Exllama
----------------------------
▶️ There's also the bits and bytes library, which quantizes on the fly (to 8-bit or 4-bit) and is related to QLoRA. This is also knows as dynamic quantization
▶️ And there's some other formats like AWQ: Activation-aware Weight Quantization - which is a quantization method similar to GPTQ. There are several differences between AWQ and GPTQ as methods but the most important one is that AWQ assumes that not all weights are equally important for an LLM’s performance. For AWQ, best to use the vLLM package
-------------
Using GGUF is simple with the `ctransformers` library, which is a Python bindings for the Transformer models implemented in C/C++ using GGML library.
`pip install ctransformers[cuda]`
After installation, we can navigate to the model that we want to load, e.g. “TheBloke/zephyr-7B-beta-GGUF” and choose a specific file.
Like GPTQ, these files indicate the quantization method, compression, level, size of the model, etc.
And then choose according e.g. “zephyr-7b-beta.Q4_K_M.gguf” if you wanted to choose a 4-bit quantization
Bringing a splash of innovation to every hue, we at Bitwise Builder wish you a Holi as vibrant and dynamic as the tech solutions we create. Happy Holi to our clients and team who color our world every day! 🎨🖥️
#Holi2024#TechInColors#BitwiseBuilder