Which open weight model should we choose? MoE? Q4?

Mr. Hasegawa (Service Reliability Group (SRG) of the Media Headquarters)@rarirureluis)is.
#SRG(Service Reliability Group) is a group that mainly provides cross-sectional support for the infrastructure of our media services, improving existing services, launching new ones, and contributing to OSS.
This article is a guide that explains technical terms such as open weights, quantization, and MoE to help readers solve the complex problem of selecting an AI model, enabling them to easily select the AI ​​model that is best suited to their hardware environment.

Qwen3.5 has good performance


The Qwen3.5 is the latest open weight model being developed by Alibaba.
When I run unsloth/Qwen3.5-35B-A3B-GGUF:UD-MXFP4_MOE in my environment (DDR5-6400 256GB + RTX4090 Gen4 x16), I get a speed of 143tps with thinking enabled.
The Qwen3.5 boasts the highest benchmark scores among open weight models.

Considering how to use a local LLM

I use it with browser-use and with Brave Leo AI BYOM.
I was using Gemini 3.1 Pro for browser-use purposes, but I didn't notice any difference (personal experience). In fact, the response time has become faster, thanks to the local environment.

What exactly is "open weight"?


Recently, the term "open weight" has been heard more and more in the AI ​​community.
Many people may wonder, "How is this different from open source?"
Open Weights is a format in which only the weights of a trained model and the minimum code required for inference are made public. The training dataset, detailed training procedures, and details of hyperparameters are not made public.
In other words, a major feature of OpenWeight is that companies and researchers can download and run the "model itself" in a local environment.
There is no need to worry about the data you enter being sent to an external party, and you can also fine-tune it to suit your own needs.

What is "quantization" and why is it necessary?


LLM parameters are usually represented as high-precision floating-point numbers such as bfloat16 or float32. However, handling these as is would require tens to hundreds of GB of memory.
This is where a technology called "quantization" comes in.
Quantization is a technique that reduces file size and memory consumption by reducing the number of bits used to represent each weight in a model.
For example, if a value that is normally expressed in 16 bits is compressed to 4 bits, the file size can be reduced to about one-fourth of its original size, simply by calculation.
However, compressing information does result in some loss of precision. Fewer bits means a lighter model, but potentially lower quality answers. This "size vs. quality" trade-off is at the heart of OpenWeight operations.

Deciphering Quantization Naming Conventions


UD-IQ2_XXS
The meaning of each symbol is as follows:
  • UD
  • Q
  • Q6
  • K
  • IQ2
  • XL

Characteristics and uses of each quantization level


Below is a list of the quantization models we will be targeting this time.
Here we will use unsloth/Qwen3.5-35B-A3B-GGUF as an example.

2-bit quantization (ultra-lightweight, experimental)

namesize
UD-IQ2_XXS9.76 GB
UD-Q2_K_XL12.9 GB
This is the most aggressive compression, with each weight being expressed using just around 2 bits.
The advantage is that it allows large models to run even in environments with limited VRAM, but this has a significant impact on the quality of answers.
However, by combining dynamic quantization (UD), important layers can be maintained at high precision, resulting in better-than-expected quality.
In actual benchmarks, Unsloth's UD-Q2_K_XL has been reported to outperform standard 3-bit quantization (Q3_K_M) in several benchmarks.
UD-IQ2_XXS

3-bit quantization (lightweight and practical)

namesize
UD-IQ3_XXS14.1 GB
UD-IQ3_S15.2 GB
UD-Q3_K_M16.7 GB
UD-Q3_K_XL17.2 GB
This quantization is intermediate between 2bit and 4bit.
UD-IQ3_XXS
UD-Q3_K_M
XL

4-bit quantization (balanced, most popular)

namesize
UD-MXFP4_MOE19.5 GB
UD-Q4_K_M19.9 GB
UD-Q4_K_XL20.6 GB
4-bit quantization is the most commonly used level as it provides a good balance between quality and size.
UD-Q4_K_XL
Important matrices are assigned high-precision quantization such as Q5_K, while others use Q4_K to reduce size.
UD-MXFP4_MOE
MXFP4 (Microscaling FP4) is a standard established by the OCP (Open Compute Project) and is also adopted by OpenAI's gpt-oss-120b.
It runs particularly efficiently on dedicated hardware (such as the NVIDIA Blackwell generation).

5-bit quantization (high quality, moderately large capacity)

namesize
UD-Q5_K_XL24.9 GB
Select this if you require higher quality than 4bit.
Some experts say that "int4 (4-bit integer) quantization is sufficient," but there are cases where the benefits of 5 bits or more are felt for tasks that require precision, such as coding and logical reasoning.

6bit quantization (high quality)

namesize
UD-Q6_K_S28.5 GB
UD-Q6_K_XL30.3 GB
The quality degradation is almost imperceptible.
UD-Q6_K_XL

8-bit quantization (almost original precision)

namesize
UD-Q8_K_XL38.7 GB
8bit is very close to full precision (16bit) quality.
The increase in perplexity (difficulty of predicting text) is almost zero, so it is chosen for research and verification purposes or in environments with ample VRAM.

What is MoE (Mixture of Experts)?


Along with quantization, the "MoE" architecture is also important.
MoE (Mixture of Experts) is a mechanism that has a subnetwork of multiple "experts" (specialists) within the model, and activates only some of the experts for each input token.
For example, even in a model with 120B parameters, in the case of the MoE architecture, the parameters actually used to process each token (active parameters) are only a portion of the total.
In the case of OpenAI's gpt-oss-120b, the total number of parameters is 117B, but the active parameters are about 5.1B.
The benefits of this mechanism are as follows:
  • High computational efficiency: Since not all parameters are used, the computational cost during inference is significantly reduced.
  • It is possible to run large-capacity models on realistic hardware: It has the characteristic that the actual processing is light compared to the number of parameters.
  • Specialization leads to higher quality: Each expert specializes in a specific pattern, which increases overall quality.
There are some caveats to quantizing the MoE model.
The expert layer (FFN layer) accounts for the majority of the parameters in the entire model, so aggressive quantization here is the key to reducing its size.
On the other hand, attention layers and the like are significantly affected by the decrease in accuracy, so it is common to maintain high accuracy.
Another feature of the MoE model is that it is well suited to "offloading" to CPU RAM.
Since only a portion of the expert layer is used during execution, it is possible to run the program while saving the parts that do not fit in VRAM to system RAM or NVMe.

What is Unsloth Dynamic?


UD-
With conventional uniform quantization, all layers are compressed using the same number of bits. However, dynamic quantization analyzes in advance which layers of the model have the greatest impact on the output, and automatically assigns high precision (e.g., 8-bit or 16-bit) to layers with high importance and low precision (e.g., 2-bit or 3-bit) to layers with low importance.
This system allows you to enjoy the benefits of "higher quality for the same file size" or "smaller file size for the same quality."
Dynamic quantization is achieved by utilizing calibration data called an imatrix (Importance Matrix). Unsloth uses a proprietary calibration dataset optimized for conversation, coding, and inference tasks, which contributes to improved quality.
Actual benchmark results also show that Unsloth's UD-IQ2_XXS has higher task accuracy than other companies' IQ3_S, and there are cases where perplexity and KL divergence values ​​alone cannot accurately evaluate quality in real-world applications.
Q4_K_M

Which quantization should I choose?


We will organize guidelines based on the hardware situation.
UD-IQ3_XXS
UD-Q4_K_XL

summary


This article explained open weights, quantization, and MoE.
If you have 16GB or more of VRAM, we recommend UD-IQ3_XXS or UD-Q4_K_XL.
MoE is computationally efficient and suitable for local inference, enabling optimal AI model deployment.
If you are interested in SRG, please contact us here.