What MTP Actually Does
Standard autoregressive models generate one token at a time. Each token requires a full forward pass through the model. Multi-Token Prediction (MTP) changes this by predicting multiple future tokens in parallel using additional output heads trained for exactly that purpose.
The mechanism is speculative decoding, but with a critical difference: instead of running a separate small "draft" model to guess upcoming tokens, MTP uses the model's own state. The draft predictions come from the model itself, not a secondary model you have to keep in memory. This means no VRAM overhead for a second weights file, no context switching between models, and no accuracy loss when the draft model disagrees with the target.
The result is a throughput increase that consistently hits 1.5x or better on compatible models, with some users seeing higher gains depending on prompt structure and batch size.
The Reality Check: What's Actually Available
Here is where enthusiasm meets the current state of model availability.
The headline model right now is Qwen 3.6. The ggml-org team has published quantized GGUFs with MTP heads on HuggingFace. But there is a catch: the available variants are Qwen 3.6 27B and Qwen 3.6 35B-A3B. The published quants are BF16 and Q8_0, with no Q4_K_M or smaller parameter counts available.
If you are running on a consumer GPU — an RTX 4070 Ti SUPER with 16GB, for example — these models will not fit. Not even with offload to system RAM at usable speeds. The current MTP model catalog is enterprise-hardware territory, not the "download and run it on your gaming GPU" experience the rest of the local inference ecosystem has trained us to expect.
Gemma 4 is the other family with confirmed MTP architecture, but at the time of writing, there are no published GGUFs with MTP heads for the smaller parameter variants (the 4B or 9B models). The capability exists in the architecture and in llama.cpp's backend, but the actual downloadable weights are not there yet.
This will change. The pattern with new inference features is that support lands in the engine first, then the model weights follow. But as of today, MTP is a feature you should know about and monitor, not one you can enable on most local setups.
What You Can Do Right Now
If you want a speedup today and you are on consumer hardware, standard speculative decoding is the practical path. It is not MTP, but it delivers real gains using models you can actually run.
The setup is simple: pair your main model with a tiny draft model. The draft guesses the next few tokens, and the main model either accepts them or corrects them. In llama.cpp, this uses --draft-model pointing to your draft model. LocalAI passes these through in the standard launch configuration.
- Main model: Qwen 3 8B Q4_K_M (~5GB)
- Draft model: Qwen 3 0.6B Q4_K_M (~0.5GB)
- Total VRAM: Under 6GB for both, with a typical 1.2x to 1.5x speedup
llama.cpp has supported this for a while. When the server is started with --draft-model, speculative decoding runs automatically.
MTP will eventually replace or supplement this approach for compatible models, because it avoids the draft model entirely. But speculative decoding with a small draft is what works today on 16GB cards.
How MTP Will Work Once Models Arrive
When smaller MTP-enabled GGUFs do land, enabling them will be straightforward:
--spec-type mtpactivates the native multi-token heads--spec-draft-n-max <N>sets how many draft tokens to predict per step
If you are on LocalAI, the same backend will support these flags through the standard launch configuration. No draft model needed — point at an MTP-capable GGUF and enable the speculator.
One caveat: you will need a current build. MTP support is recent. If your llama.cpp or LocalAI container is more than a few weeks old, update before trying. The feature will not exist in older binaries.
What This Means for Local AI
MTP is significant because it breaks the "buy more GPU or wait" trap — but only once the model weights catch up. A 1.5x speedup on the same hardware is functionally equivalent to upgrading your GPU tier, except it costs nothing and takes minutes to enable.
The broader point: local inference is closing the gap with hosted APIs on speed, not just on privacy or cost. MTP is one of several recent advances — alongside better quantisation, MoE support, and context-length extensions — that make self-hosted models increasingly viable for daily work. The gap between "research paper" and docker pull is shrinking from months to weeks.
For operators running production local AI, that pace is the story. The models are not standing still, and neither are the engines that run them.
Practical Steps
- Check your hardware budget. If you have 24GB+ VRAM, watch the ggml-org HuggingFace repo for Qwen 3.6 MTP quants. If you have 16GB or less, MTP-native models are not an option yet.
- Use speculative decoding today. Pair your main model with a small draft model, such as Qwen 3 0.6B or Llama 3.2 1B. Gains are real, and hardware requirements are trivial.
- Update your engine. MTP support is already in current llama.cpp builds. LocalAI v4.1.x includes the relevant backend patches. When the weights land, you will be ready.
- Measure before and after. Use the same prompt, same temperature, same seed. Speedup varies by task. Code generation typically benefits more than open-ended chat.
- Watch the model releases. The engine support is ahead of the weights. When Gemma 4 9B MTP or Qwen 3.6 8B MTP GGUFs appear, that is when the feature becomes practical for most local operators.
Dallum Brown
Writer and curator exploring the impact of technology on everyday life.
View All Articles