96.95% HumanEval on a $400 GPU: Frontier-Level Coding Performance on Consumer Hardware
Executive Summary
96.95% on HumanEval. On par with 2025 frontier models. Running on a $400 card at $0 per token.
Fitting a 35B MoE model with speculative decoding into 16GB VRAM shouldn’t work. Qwen3.6-35B-A3B is 21GB at Q4, KV cache for 128K context is 11GB, MTP draft head adds another 1.5GB. Total: ~33.5GB. The RTX 5060 Ti 16GB runs it at 70 tok/s peak (coding tasks) with 98.3% draft acceptance, and the agentic setup scores 96.95% on HumanEval — matching 2025 frontier model performance.
| Metric | Value |
|---|---|
| HumanEval agentic pass@1 | 96.95% |
| HumanEval vanilla pass@1 | 60.98% |
| Peak generation speed | 70 tok/s (server-side, coding prompt) |
| Effective agentic throughput | ~19 tok/s (end-to-end across 164 problems) |
| Draft acceptance rate | 98.3% (113/115) |
| Context window | 128K tokens (model designed max) |
| Model VRAM usage | 10.4 GB GPU + 10.8 GB CPU offload |
| Hardware | RTX 5060 Ti 16 GB, ~$400 |
| Token cost | $0 (local) |
Peak vs Effective Throughput
The model generates at 70 tok/s when producing tokens uninterrupted. The end-to-end agentic run across 164 HumanEval problems averaged ~19 tok/s. The gap is the harness, not the model.
Where the wall-clock time goes:
| Source | Estimate |
|---|---|
| API roundtrips (1–5 iterations × 3–5 s each) | ≈ 10–15 s |
| Context growth (each tool result adds tokens) | slower subsequent calls |
| Tool-call JSON parse + dispatch + serialize | ≈ 1–2 s |
| API queuing (single-slot, parallel=1) | ≈ 1–3 s |
| Total overhead | ≈ 26.7 s |
| Effective throughput | ~19 tok/s |
Pure generation is fast. The loop serializes it.
| Metric | tok/s | When |
|---|---|---|
| Peak generation | 70 | model output inside an agentic turn (no tool calls) |
| Vanilla server-side | 49.91 | 200-token single-turn on a CS history prompt |
| Effective wall-clock per problem | ~19 | end-to-end for the 73-min agentic run |
MTP adds ~70% throughput when draft acceptance is high (coding: 98.3%, 70 tok/s). When acceptance drops, throughput falls to ~30 tok/s — the MTP overhead isn’t free. The 70 tok/s number depends on the use case.
The 36-Point Gap
The same 35B-A3B Q4 weights went from 60.98% to 96.95% on HumanEval by adding a code_execution tool and a 5-iteration think-execute-observe loop. Same model, same GPU, same quantization. The delta is execution feedback.
| Mode | pass@1 | Stderr | Time | Source |
|---|---|---|---|---|
| Vanilla (lm-eval, greedy) | 60.98% | ±3.82% | 4.4 min | lm_eval_results/.../2026-06-05T15-26-44.json |
| Agentic (hermes-agent + code_execution) | 96.95% | ±1.34% | 73 min | human_eval_agentic/samples.jsonl |
| Delta from REPL iteration | +35.97 pp | — | — | same weights, same GPU |
Vanilla setup: lm-eval 0.4.12, local-completions model type, do_sample=false, max_gen_toks=1024, endpoint http://172.24.64.1:8081/v1/completions. No tools, no iteration — single greedy completion per problem.
Agentic setup: hermes-agent AIAgent class with enabled_toolsets=["code_execution"] and max_iterations=5. System prompt forces a think-execute-observe loop. For each problem, the agent writes Python in a fenced block, calls the code_execution tool, sees stdout/stderr, iterates, and emits a final solution. Scored with human_eval.check_correctness(problem, code, timeout=5.0).
Failed Problems (Agentic)
5/164 problems failed:
| Task ID | Code length (chars) | Notes |
|---|---|---|
| HumanEval/26 | 338 | returned code, test failed |
| HumanEval/129 | 0 | agent returned empty response |
| HumanEval/133 | 510 | returned code, test failed |
| HumanEval/140 | 573 | returned code, test failed |
| HumanEval/162 | 323 | returned code, test failed |
The /129 failure is the only one where the agent produced no output (likely an API or harness hiccup). The other 4 returned plausible-looking code that didn’t pass the hidden test suite.
The 16GB Wall
Hardware Mismatch
RTX 5060 Ti 16GB GDDR7. Qwen3.6-35B-A3B: 40 transformer layers, 128 experts per layer (8 active per token + 1 shared expert). 35B total parameters, 3B active per token.
Q4_K_M quantization: 21GB. f16 KV cache (128K context): 11GB. Sum: 32GB. Available VRAM: 16GB.
VRAM requirement vs available:
| Component | GB |
|---|---|
| Model (Q4_K_M) | 21 |
| KV cache (f16) | 11 |
| MTP head | 1.5 |
| Total needed | 33.5 |
| Available (RTX 5060 Ti) | 16 |
MoE Context
Qwen3.6-35B-A3B uses a Mixture-of-Experts architecture. Only 8 of 128 experts per layer are active per token, plus 1 shared expert. This keeps active parameters low (~3B) but total parameters high (35B) — all must be stored in memory.
KV Cache Constraint
The KV cache is required for 128K context. Without it, the model can’t attend to prior tokens. 11GB cache leaves 5GB for the rest of the model on a 16GB card.
Speculative Decoding Overhead
MTP requires loading the main 35B model and a draft head. The draft head adds ~1GB of memory, pushing total requirements further over the VRAM limit.
Three-Part Fix
Three techniques fit the model into 16GB:
- Partial MoE offloading — 22 of 40 MoE feed-forward layers run on CPU. GPU retains all attention layers, embedding/output layers, and the shared expert. GPU attention and CPU MoE computation run in parallel via PCIe async.
- TurboQuant KV compression — KV cache uses 3-bit turbo3 format instead of f16, shrinking 11GB cache to ~2GB.
- MTP speculative decoding — Draft head generates 2 candidate tokens per step. Main model verifies both in 1 forward pass. 98% acceptance yields ~2 tokens per verification step.
VRAM Allocation
| Component | GB |
|---|---|
| Model (Q4_K_M GPU portion) | 10.4 |
| MTP head (GPU) | 0.9 |
| TurboQuant KV cache | 1.9 |
| Headroom | 2.8 |
| Total used | ~13.2 |
| Remaining model + mmap (CPU) | ~10.6 + 0.5 |
Total VRAM usage: ~13.2GB. Remaining 10.6GB model weights and 515MB mmap’d draft head live in CPU RAM.
Technical Implementation
Toolchain
Built with VS2022 BuildTools, cmake 3.31.6, Ninja, and CUDA 13.2, targeting sm_120 architecture.
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build build
MTP Head Loading Patches
llama.cpp’s MTP loader assumes draft heads mirror the main model’s architecture. Qwen3.6 uses hybrid attention (full-attention + Gated DeltaNet layers), so three patches were needed:
Fix A — Layer indexing. Model declares n_layer=41 (40 trunk + 1 MTP head). Original loader iterated all 41 layers for draft tensors. Fix: add mtp_trunk_n_layer to skip trunk layers when computing draft head tensor offsets.
Fix B — Q+gate fused tensor. Qwen3.6’s attention concatenates query and sigmoid gate into one tensor. Original loader set Q output dim to n_head * head_dim instead of 2 * n_head * head_dim. Fix:
// Create Q tensor with fused gate dimension
attn_q = create_tensor_qkv(layer, il, n_embd, 2 * n_head * head_dim, ...);
// Split Q and gate via strided view
Q = ggml_view_3d(ctx, attn_q, head_dim, n_head, n_tokens, ...);
gate = ggml_view_3d(ctx, attn_q, head_dim, n_head, n_tokens, gate_offset, ...);
// Apply gate before output projection
attn_out = ggml_mul(ctx, attn_out, ggml_sigmoid(ctx, gate));
Fix C — Out-of-bounds guard. Draft head tensor indices sometimes exceeded its layer range, throwing dev_layer.at(tn.bid) exceptions. Fix: clamp layer index to valid range and guard against NaN in empty-expert routing.
FA Kernel Crash on sm_120
FlashAttention crashed on sm_120 with CUDA error: shared object initialization failed at fattn-common.cuh:1428. Root cause: cudaOccupancyMaxActiveBlocksPerMultiprocessor returned spurious errors when shared memory usage was zero. Fix: fall back to the kernel’s __launch_bounds__(128, 2) annotation.
int blocks_per_sm;
cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&blocks_per_sm, kernel, 128, 0);
if (err != cudaSuccess) {
// sm_120 quirk: fall back to __launch_bounds__(128, 2)
blocks_per_sm = 2;
}
Mmap and CPU Offloading
CPU-offloaded tensors triggered page-in stalls from mmap’d files on Windows. Adding --no-mmap forces full RAM load at startup, eliminating latency.
Benchmark Results
HumanEval Results
| Mode | pass@1 | Stderr | Time | Source |
|---|---|---|---|---|
| Vanilla (lm-eval, greedy) | 60.98% | ±3.82% | 4.4 min | lm_eval_results/.../2026-06-05T15-26-44.json |
| Agentic (hermes-agent + code_execution) | 96.95% | ±1.34% | 73 min | human_eval_agentic/samples.jsonl |
| Delta from REPL iteration | +35.97 pp | — | — | same weights, same GPU |
2025 Frontier Comparison
| Model | HumanEval pass@1 | Year | Cost |
|---|---|---|---|
| Qwen3.6-35B agentic (ours) | 96.95% | 2026 | $0 (local) |
| GPT-4o (2025) | ~90-94% | 2025 | $20-60/Mtok |
| Claude 3.5 Sonnet (2025) | ~92-95% | 2025 | $15-75/Mtok |
| Gemini 2.5 Pro (2025) | ~92-96% | 2025 | $7-21/Mtok |
| Qwen3.6-35B vanilla (ours) | 60.98% | 2026 | $0 (local) |
On par with 2025 frontier models on HumanEval, at $0 per token, on a $400 card.
Configuration Sweep (Throughput)
All tests use a 200-500 token prompt at temperature 0.7. Wall-clock measurements include network round-trip.
| Configuration | Context | Wall tok/s | Server tok/s | Notes |
|---|---|---|---|---|
| Q5_K_M + MTP + TurboQuant | 128K | 6 | — | MTP head bottleneck |
| Q4_K_M + MTP + TurboQuant | 128K | 30 | 49 | No mmap fix |
Q4_K_M + MTP + TurboQuant + --no-mmap | 128K | 54.54 | 54.54 | Optimal |
| Q4_K_M + MTP + TurboQuant | 256K | 6–8 | 8 | MTP attention O(n) |
| Q4_K_M + non-MTP + TurboQuant | 128K | 39 | 53 | No MTP overhead |
| Q4_K_M + non-MTP, no TurboQuant | 128K | 41 | 57 | TurboQuant adds latency for Q4 |
| Q4_K_M + non-MTP + TurboQuant | 256K | 44 | 56 | Best for long context |
Context Size Sweep
| Context | Tok/s |
|---|---|
| 8K | 34 |
| 16K | 46 |
| 32K | 46 |
| 64K | 45 |
| 128K | 54 |
| 256K | 7 |
Configuration Comparison (128K)
| Configuration | Tok/s |
|---|---|
| Q5+MTP | 6 |
| Q4+MTP | 30 |
| Q4+MTP+no-mmap | 54.54 |
| Q4+non-MTP | 39 |
| Q4+non-MTP+no-TQ | 41 |
--no-mmap Impact
Single largest performance gain: 30 tok/s → 54.54 tok/s (+82%) from disabling mmap. Mmap causes per-access page-in checks for CPU-offloaded tensors. Disabling it loads all tensors into RAM at startup.
Agentic Harness Architecture
For each of the 164 HumanEval problems:
| Step | Action |
|---|---|
| 1 | User message + system prompt → AIAgent.run_conversation |
| 2 | While iter < max_iterations: POST messages + tool schemas |
| 3 | Receive assistant message |
| 4 | If tool calls: execute code_execution, append result, repeat |
| 5 | If no tool calls: return final response |
| 6 | Extract final code from last Python block |
| 7 | Run human_eval.check_correctness(problem, code, timeout=5.0) |
| 8 | If max iterations reached: return failure |
The full loop logic lives in agent/conversation_loop.py:774 (the main while loop) and agent/tool_executor.py (the dispatch). The reasoning budget is bounded by max_iterations=5; the model’s self-debug behavior emerges from the system prompt’s instruction to “execute and iterate”, not from any harness-level retry logic.
Reproducible Stack
The Full Stack
| Layer | Detail |
|---|---|
| OS | Windows 11 + WSL2 Ubuntu |
| Model | Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf (21.11 GB) |
| Server | llama-server.exe, sm_120, --no-mmap, turbo3 |
| GPU | RTX 5060 Ti 16GB, CUDA 13.2 |
| Vanilla benchmark | lm-eval 0.4.12 |
| Agentic benchmark | hermes-agent AIAgent loop |
| Interactive client | opencode codebuddy |
Endpoints
| Client | Endpoint | Purpose |
|---|---|---|
| lm-eval | http://172.24.64.1:8081/v1/completions | vanilla HumanEval baseline |
| hermes-agent | http://172.24.64.1:8081/v1/chat/completions | agentic HumanEval |
| opencode | http://localhost:8081/v1/chat/completions | interactive coding |
172.24.64.1 is the Windows default gateway IP as seen from inside WSL2. localhost works for native Windows clients.
Launch Command (C:\llama\start-server.bat)
llama-server.exe ^
-m C:\llama\models\Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf ^
-ngl 99 ^
--n-cpu-moe 22 ^
-c 131072 ^
--no-mmap ^
--no-warmup ^
--cache-type-k turbo3 ^
--cache-type-v turbo3 ^
--threads 8 ^
--parallel 1 ^
--host 0.0.0.0 ^
--port 8081 ^
--spec-type mtp ^
--spec-draft-n-max 2 ^
--spec-draft-n-cpu-moe 0
opencode Configuration
// C:\Users\lhqezio\.config\opencode\opencode.jsonc
{
"provider": {
"llama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Local Qwen 35B (codebuddy)",
"options": {
"baseURL": "http://localhost:8081/v1"
},
"models": {
"Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf": {
"name": "codebuddy"
}
}
}
}
}
Key Design Decisions
Q4_K_M over Q5_K_M. Q4 fits in VRAM without TurboQuant compression. TurboQuant decompression overhead exceeds bandwidth savings for small models. Q5 requires compression to fit the KV cache, adding latency.
128K over 256K context. MTP head attention scales O(n) with context length. At 256K, draft head takes 43ms per token vs 20ms at 128K. Draft head is on the critical path with no parallelism.
--no-mmap as default. Required for any mixed GPU/CPU inference. Eliminates page-in stalls for offloaded tensors.
22 MoE layers on CPU, 18 on GPU. Balances VRAM headroom against offloading cost. 18 GPU layers leave enough VRAM for cache and MTP head. 22 CPU layers keep offloading overhead manageable.
MTP throughput depends on draft acceptance. 70 tok/s when acceptance is high (coding, 98.3%). ~30 tok/s when acceptance drops. MTP isn’t a universal speedup — it pays off on structured tasks where the draft head predicts accurately.
TurboQuant value depends on model size. Helps when the model would not otherwise fit. Hurts when the model already fits and decompression overhead exceeds bandwidth savings.
Key Takeaways
- 96.95% HumanEval on a $400 card, on par with 2025 frontier models. Same score as GPT-4o and Claude 3.5 Sonnet at $0 per token.
- The 36-point gap (61% → 97%) came from execution feedback, not model size. Same 35B-A3B Q4 weights, same GPU, same quantization. Adding a
code_executiontool and a 5-iteration loop closed the gap. - Peak generation (70 tok/s) is fast; effective agentic throughput (~19 tok/s) is harness-bound. Multi-roundtrip protocols, growing context, and serialized tool calls dominate wall clock — not Python execution. The model is fast, the loop isn’t.
- 16GB VRAM cannot match 24GB benchmark speeds. 70 tok/s peak is near the theoretical maximum for this hardware.
--no-mmapis the largest performance knob for mixed GPU/CPU inference. Gains of 80%+ are common.- The same server serves three clients: lm-eval (vanilla), hermes-agent (agentic), and opencode (interactive). The OpenAI-compatible API is the integration point.
- MTP is use-case dependent. 70 tok/s on coding, ~30 tok/s when draft acceptance drops. Lead with the condition, not just the peak.
References
- unsloth/Qwen3.6-35B-A3B-MTP-GGUF — MTP-enabled GGUF model
- unsloth/Qwen3.6-35B-A3B-GGUF — Non-MTP base model
- llama.cpp TurboQuant fork — TurboQuant KV compression + MTP merge
- Qwen3.6 model card — Architecture details, MTP configuration
- llama.cpp MTP PR — Upstream Multi-Token Prediction support
- HumanEval benchmark — Original benchmark by OpenAI
- hermes-agent — Agentic framework used for agentic HumanEval runs