What do these cryptic file names mean? Your guide to LLM quantization

Last week we did yes. Working with the LM Studio, but there were some questions, which is why I would like to go into these details here today.

If you look at the download options for a model like Qwen on Hugging Face or in LM Studio, you stumble across cryptic names like Q4_K_M, Q8_0 or Q6_K. Spontaneously you think: ‘If Q8 has a higher number, it must be better than Q4, right?’

Technically, it is Q8 More precise than Q4, But that doesn't mean it's a better choice for you. These cryptic abbreviations are not arbitrary numbers. They tell you a story about the model's performance, size, and accuracy even before you download it.

This article deciphers these suffixes once and for all, so that you make the right decision for your hardware and never have to guess blindly again or only choose ‘greater number = better’ according to scheme F.

What is quantization? The simple explanation

All right, into the subject. What do these cryptic file names mean? Imagine you have a huge, detailed painting (the unquantized model). Each color is represented with 16 or 32 bits, which takes up an enormous amount of storage space and computing power. This is the original state of an LLM, the so-called FP16– or FP32models.

Quantization is the process of compressing this painting by reducing the number of bits per color, for example, to 8 or even just 4. This makes the image smaller and can be processed faster, often with minimal loss of quality.

Every time you send a prompt to an LLM, it processes billions of these ‘color pixels’, the so-called ‘color pixels’. weights (Weights). These weights are the heart of the model. When we compress them, we need much less storage space and the calculations run many times faster. This is exactly what makes it possible to run large LLMs on normal consumer hardware, such as your laptop or PC.

The anatomy of a quantization abbreviation

A name like Q4_K_M It may seem complicated, but it's just a string of information:

  • Q: stands for ‘quantised’.
  • The number (e.g. Q4): Specifies how many bits each weight has now. One Q4Version uses 4 bits per weight. The higher the number, the more accurate, but also larger and slower the model.
  • K: stands for grouped quantization. This is a modern and very accurate method. The model quantifies the weights not all at once, but in small groups (e.g. 64 pieces). Individual scaling factors are calculated for each group, resulting in significantly higher accuracy than older methods.
  • 0 or 1: These figures represent older, less precise quantization methods. They use a single global scale for the entire model, which is less accurate but often faster. If you have a choice, K Almost always the best option.
  • S, M, L: These letters indicate the precision and often come in conjunction with the K-suffix before. They mean Small, medium or large. They help to distinguish the models if they have the same bit number. S stands for the fastest but inaccurate variant, while L stands for the slowest but most accurate. M is often a good compromise.

One Q4_K_M is therefore a ‘quantised model with 4 bits per weight using the modern group-based method (K) and medium precision (M).’

Perplexity and KL Divergence: How to measure the quality?

But how exactly do you know if a quantized model is good enough?
There are two important metrics for this:

Perplexity (PPL)

perplexity Measures how well a model predicts text. Je lower the PPL value, the ‘less confused’ the model is and the better it is compared to the original.

The developers of llama.cpp list PPL values for the different quantizations. For example, you can see that an Q8_0model has almost the same performance as the unquantized original, while a Q2_K It already shows significant losses.

KL Divergence

KL Divergence It is a more advanced metric. It measures how much the probability distribution of the quantized version differs from that of the original model.

This is important because perplexity can compensate for errors. A model may have a good average value, but it's totally wrong with rare or difficult words. The KL Divergence detects these ‘blind spots’. Je lower The value, the better.

However, if you are still too inaccurate and want to read more about the topic, I recommend the following authors and sources: Paul Ilvez | René Peinl | Michael Jentsch | Maarten Grootendorst


So, after we are already relatively far into the topic, here for the sake of completeness:

Beyond PPL+KLD: Further evaluation metrics

Perplexity and KL divergence are important, but they are only part of the picture when it comes to evaluating LLMs. For specific tasks, there are other, tailor-made metrics that provide a more precise statement about the quality of the generated texts.

  • F1 score: This value is Harmonious means from two key figures: Precision (How many of the detected elements were correct?) and recall (How many of the correct elements were detected at all?) The F1 score is particularly common in tasks such as Named entity recognition (NER) is used to identify specific entities such as names or places in a text.
  • BLEU Score (Bilingual evaluation understudy): This score is the standard in the Machine translation. It measures how much the machine-generated translation matches the reference translations created by human experts. The closer the generated text is to the reference texts, the higher the BLEU value.
  • ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation): ROUGE is a set of metrics that evaluates the overlap between the generated text and the reference text. They are often associated with Summary tasks used. ROUGE helps to quantify how well the model filters out the most important information from a text and reproduces it in a new, concise form.

And for all of you who still don't have enough, I'll let you this one still there. But then we drift quite far away. ⁇

Making the right choice

Now that you know the terms, you can choose the models wisely.

  • Q4_K_M is recommended by many developers as the best compromise between size, speed and accuracy. It is often an excellent standard option.
  • Q5_K_M and Q5_K_S provide even more precision with only slightly larger file size.
  • Q6_K and Q8_0 are the heavyweights of quantisation. They are almost as good as the original models, but require significantly more storage space and are slower. Use them when you need the best possible accuracy and your hardware can do it.

TL:DR

At the end of the day, the best method is to simply try different variants. Thanks to tools like LM Studio or Sites like Hugginface.co You can easily do this without being deterred by the technical details. With this information, however, you should at least make the pre-selection easier.