DeepSeek-V4-Flash-MTP-bf16

This repository contains Multi-Token Prediction (MTP) drafter weights split from deepseek-ai/DeepSeek-V4-Flash for use with mlx-vlm speculative decoding.

This is not a standalone chat or text-generation model. Load it as the draft model alongside a compatible DeepSeek V4 Flash target checkpoint.

Use with mlx-vlm

uv run mlx_vlm.generate \
  --model mlx-community/DeepSeek-V4-Flash-4bit \
  --draft-model mlx-community/DeepSeek-V4-Flash-MTP-bf16 \
  --prompt "Hi, how are you?" \
  --max-tokens 256 \
  --enable-thinking

For local weights:

uv run mlx_vlm.generate \
  --model /path/to/target-model \
  --draft-model /path/to/DeepSeek-V4-Flash-MTP \
  --prompt "Hi, how are you?" \
  --max-tokens 256 \
  --enable-thinking

Model Details

  • Model type: deepseek_v4_mtp
  • MTP block size: 2
  • Target architecture: DeepSeek V4 Flash
  • Precision: bf16 / mixed 8-bit + 4-bit as configured
  • Runtime: MLX / mlx-vlm
  • Format: Safetensors with MLX-compatible config and tokenizer files

The stored tensors include bfloat16 parameters and MLX quantized tensors as described in config.json.

Intended Use

Use this repo only as a speculative decoding drafter for compatible DeepSeek V4 Flash checkpoints. The target model verifies drafted tokens, while this MTP model proposes candidate tokens per decoding step.

Limitations

This checkpoint requires runtime support for Qwen/DeepSeek MTP draft models in mlx-vlm. Standard standalone generation through generic Transformers APIs is not expected to work with this repository by itself.

Please refer to the upstream deepseek-ai/DeepSeek-V4-Flash model card and license terms for model usage constraints.

Downloads last month
2,006
Safetensors
Model size
1B params
Tensor type
F32
·
BF16
·
U8
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/DeepSeek-V4-Flash-MTP-bf16

Quantized
(65)
this model