github 看到一个项目， 3090 跑 27B， 129tps，最高 207tps

https://github.com/Luce-Org/lucebox-hub

DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22.

Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072)

PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward.

~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

DFlash

Speculative

context

10 replies 2026-05-06 21:21:27 +08:00

stefwoo

10 days ago

为了把大模型和草稿小模型一起塞进 24G 显存，选 4-bit 量化（~16G ），草稿保持 BF16 （~1.2G ），KV 缓存用 quant 量化。
预填充时，草稿小模型飞速扫遍长文本，只挑出最重要的 5% 片段；大模型只对这 5% 做稀疏预填充，跳过其余 95% 的无关内容。
随后进入生成阶段：草稿模型一次幻想出多个候选 token ，大模型用树形注意力一次性验证整棵树，实现高速逐词解码。

strobber16

10 days ago

这让我回去再看了一遍 bycloud 讲 speculative decoding 的那个视频。然后发现自己还是看不懂

Hermitist

10 days ago

全是跑 cuda 的, 可惜了我的 macbook air 测试不了

beyondstars

10 days ago

这么激进的 quantization 确定不影响模型实际表现吗？

940i3s34v4F1HW41

PRO

10 days ago via iPhone

27b 就是弱智

Ironpan

9 days ago

我看 benchmark 只有效率的对比, 没有效果的对比?

jinsongzhaocn

9 days ago

24GB 显存的推荐一个实用的组合:
# LLM 模型+嵌入模型+24GB 显存组合配置(2026-04-30)
## Qwen-9B 19252MB LLM 模型
docker run -d --name vllm-qwen3.5-9b-awq-bf16-int4 --gpus all \
-p 8100:8000 \
-e VLLM_USE_MODELSCOPE=True \
-v /home/tab/docs/vllm_model:/models \
vllm/vllm-openai:v0.19.0-ubuntu2404 \
--model /models/cyankiwi/Qwen3___5-9B-AWQ-BF16-INT4 \
--served-model-name Qwen3-9B \
--host 0.0.0.0 \
--port 8000 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--max-model-len auto \
--max-num-seqs 4 \
--enable-prefix-caching \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
--gpu-memory-utilization 0.80

## bge-m3(1024 维度) 嵌入模型
docker run -d \
--gpus all \
--name vllm-baai-bge-m3 \
--ipc=host \
-p 8101:8000 \
-v /home/tab/docs/vllm_model:/models \
-e VLLM_USE_MODELSCOPE=True \
vllm/vllm-openai:v0.19.0-ubuntu2404 \
--model /models/BAAI/bge-m3 \
--served-model-name bge-m3 \
--gpu-memory-utilization 0.2

jingle

9 days ago

DFlash 简单说就是用了个专用小模型并行预估多个 token ，如果预估得对，就会提速；如果猜错了（比如 thinking 模式就有多种不同的思考路径，从中选最优的，这个时候 DFlash 就容易跪）；然后 DDtree 的思路就是对多种不同路径的结果进行择优，因此效果是对 DFlash 选错路径风险的补充；本质上都是对 llm 自回归预估的并行加速，与输入问题的类型强相关，不是所有的场景都会被加速，比如上述那种思考路径多的情况，就浪费计算还不讨好个人理解，仅供参考

stefwoo

9 days ago

@jingle 是的，文字相关的就会很差。如果只有唯一解法的问题，就会好很多。编程也还好。

jiaorong

6 days ago

dflash 就是个废物，根本没用