github 看到一个项目, 3090 跑 27B, 129tps,最高 207tps - V2EX
stefwoo
1.29D

github 看到一个项目, 3090 跑 27B, 129tps,最高 207tps

  •  
  •   stefwoo 10 days ago 3263 views

    https://github.com/Luce-Org/lucebox-hub

    DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22.

    Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072)

    PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward.

    ~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

    10 replies    2026-05-06 21:21:27 +08:00
    stefwoo
        1
    stefwoo  
    OP
       10 days ago
    为了把大模型和草稿小模型一起塞进 24G 显存,选 4-bit 量化(~16G ),草稿保持 BF16 (~1.2G ),KV 缓存用 quant 量化。
    预填充时,草稿小模型飞速扫遍长文本,只挑出最重要的 5% 片段;大模型只对这 5% 做稀疏预填充,跳过其余 95% 的无关内容。
    随后进入生成阶段:草稿模型一次幻想出多个候选 token ,大模型用树形注意力一次性验证整棵树,实现高速逐词解码。
    strobber16
        2
    strobber16  
       10 days ago
    这让我回去再看了一遍 bycloud 讲 speculative decoding 的那个视频。然后发现自己还是看不懂
    Hermitist
        3
    Hermitist  
       10 days ago
    全是跑 cuda 的, 可惜了我的 macbook air 测试不了
    beyondstars
        4
    beyondstars  
       10 days ago
    这么激进的 quantization 确定不影响模型实际表现吗?
    940i3s34v4F1HW41
        5
    940i3s34v4F1HW41  
    PRO
       10 days ago via iPhone   2
    27b 就是弱智
    Ironpan
        6
    Ironpan  
       9 days ago
    我看 benchmark 只有效率的对比, 没有效果的对比?
    jinsongzhaocn
        7
    jinsongzhaocn  
       9 days ago
    24GB 显存的推荐一个实用的组合:
    # LLM 模型+嵌入模型+24GB 显存组合配置(2026-04-30)
    ## Qwen-9B 19252MB LLM 模型
    docker run -d --name vllm-qwen3.5-9b-awq-bf16-int4 --gpus all \
    -p 8100:8000 \
    -e VLLM_USE_MODELSCOPE=True \
    -v /home/tab/docs/vllm_model:/models \
    vllm/vllm-openai:v0.19.0-ubuntu2404 \
    --model /models/cyankiwi/Qwen3___5-9B-AWQ-BF16-INT4 \
    --served-model-name Qwen3-9B \
    --host 0.0.0.0 \
    --port 8000 \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --enable-auto-tool-choice \
    --max-model-len auto \
    --max-num-seqs 4 \
    --enable-prefix-caching \
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
    --gpu-memory-utilization 0.80

    ## bge-m3(1024 维度) 嵌入模型
    docker run -d \
    --gpus all \
    --name vllm-baai-bge-m3 \
    --ipc=host \
    -p 8101:8000 \
    -v /home/tab/docs/vllm_model:/models \
    -e VLLM_USE_MODELSCOPE=True \
    vllm/vllm-openai:v0.19.0-ubuntu2404 \
    --model /models/BAAI/bge-m3 \
    --served-model-name bge-m3 \
    --gpu-memory-utilization 0.2
    jingle
        8
    jingle  
       9 days ago
    DFlash 简单说就是用了个专用小模型并行预估多个 token ,如果预估得对,就会提速;如果猜错了(比如 thinking 模式就有多种不同的思考路径,从中选最优的,这个时候 DFlash 就容易跪);然后 DDtree 的思路就是对多种不同路径的结果进行择优,因此效果是对 DFlash 选错路径风险的补充;本质上都是对 llm 自回归预估的并行加速,与输入问题的类型强相关,不是所有的场景都会被加速,比如上述那种思考路径多的情况,就浪费计算还不讨好个人理解,仅供参考
    stefwoo
        9
    stefwoo  
    OP
       9 days ago
    @jingle 是的,文字相关的就会很差。如果只有唯一解法的问题,就会好很多。编程也还好。
    jiaorong
        10
    jiaorong  
       6 days ago
    dflash 就是个废物,根本没用
    About     Help     Advertise     Blog     API     FAQ     Solana     5688 Online   Highest 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 66ms UTC 03:33 PVG 11:33 LAX 20:33 JFK 23:33
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86