2026.06.06

96.95% HumanEval on a $400 GPU: Frontier-Level Coding Performance on Consumer Hardware

Executive Summary

96.95% on HumanEval. On par with 2025 frontier models. Running on a $400 card at $0 per token.

Fitting a 35B MoE model with speculative decoding into 16GB VRAM shouldn’t work. Qwen3.6-35B-A3B is 21GB at Q4, KV cache for 128K context is 11GB, MTP draft head adds another 1.5GB. Total: ~33.5GB. The RTX 5060 Ti 16GB runs it at 70 tok/s peak (coding tasks) with 98.3% draft acceptance, and the agentic setup scores 96.95% on HumanEval — matching 2025 frontier model performance.

Metric	Value
HumanEval agentic pass@1	96.95%
HumanEval vanilla pass@1	60.98%
Peak generation speed	70 tok/s (server-side, coding prompt)
Effective agentic throughput	~19 tok/s (end-to-end across 164 problems)
Draft acceptance rate	98.3% (113/115)
Context window	128K tokens (model designed max)
Model VRAM usage	10.4 GB GPU + 10.8 GB CPU offload
Hardware	RTX 5060 Ti 16 GB, ~$400
Token cost	$0 (local)

Peak vs Effective Throughput

The model generates at 70 tok/s when producing tokens uninterrupted. The end-to-end agentic run across 164 HumanEval problems averaged ~19 tok/s. The gap is the harness, not the model.

Where the wall-clock time goes:

Source	Estimate
API roundtrips (1–5 iterations × 3–5 s each)	≈ 10–15 s
Context growth (each tool result adds tokens)	slower subsequent calls
Tool-call JSON parse + dispatch + serialize	≈ 1–2 s
API queuing (single-slot, parallel=1)	≈ 1–3 s
Total overhead	≈ 26.7 s
Effective throughput	~19 tok/s

Pure generation is fast. The loop serializes it.

Metric	tok/s	When
Peak generation	70	model output inside an agentic turn (no tool calls)
Vanilla server-side	49.91	200-token single-turn on a CS history prompt
Effective wall-clock per problem	~19	end-to-end for the 73-min agentic run

MTP adds ~70% throughput when draft acceptance is high (coding: 98.3%, 70 tok/s). When acceptance drops, throughput falls to ~30 tok/s — the MTP overhead isn’t free. The 70 tok/s number depends on the use case.

The 36-Point Gap

The same 35B-A3B Q4 weights went from 60.98% to 96.95% on HumanEval by adding a code_execution tool and a 5-iteration think-execute-observe loop. Same model, same GPU, same quantization. The delta is execution feedback.

Mode	pass@1	Stderr	Time	Source
Vanilla (lm-eval, greedy)	60.98%	±3.82%	4.4 min	`lm_eval_results/.../2026-06-05T15-26-44.json`
Agentic (hermes-agent + code_execution)	96.95%	±1.34%	73 min	`human_eval_agentic/samples.jsonl`
Delta from REPL iteration	+35.97 pp	—	—	same weights, same GPU

Vanilla setup: lm-eval 0.4.12, local-completions model type, do_sample=false, max_gen_toks=1024, endpoint http://172.24.64.1:8081/v1/completions. No tools, no iteration — single greedy completion per problem.

Agentic setup: hermes-agent AIAgent class with enabled_toolsets=["code_execution"] and max_iterations=5. System prompt forces a think-execute-observe loop. For each problem, the agent writes Python in a fenced block, calls the code_execution tool, sees stdout/stderr, iterates, and emits a final solution. Scored with human_eval.check_correctness(problem, code, timeout=5.0).

Failed Problems (Agentic)

5/164 problems failed:

Task ID	Code length (chars)	Notes
HumanEval/26	338	returned code, test failed
HumanEval/129	0	agent returned empty response
HumanEval/133	510	returned code, test failed
HumanEval/140	573	returned code, test failed
HumanEval/162	323	returned code, test failed

The /129 failure is the only one where the agent produced no output (likely an API or harness hiccup). The other 4 returned plausible-looking code that didn’t pass the hidden test suite.

The 16GB Wall

Hardware Mismatch

RTX 5060 Ti 16GB GDDR7. Qwen3.6-35B-A3B: 40 transformer layers, 128 experts per layer (8 active per token + 1 shared expert). 35B total parameters, 3B active per token.

Q4_K_M quantization: 21GB. f16 KV cache (128K context): 11GB. Sum: 32GB. Available VRAM: 16GB.

VRAM requirement vs available:

Component	GB
Model (Q4_K_M)	21
KV cache (f16)	11
MTP head	1.5
Total needed	33.5
Available (RTX 5060 Ti)	16

MoE Context

Qwen3.6-35B-A3B uses a Mixture-of-Experts architecture. Only 8 of 128 experts per layer are active per token, plus 1 shared expert. This keeps active parameters low (~3B) but total parameters high (35B) — all must be stored in memory.

KV Cache Constraint

The KV cache is required for 128K context. Without it, the model can’t attend to prior tokens. 11GB cache leaves 5GB for the rest of the model on a 16GB card.

Speculative Decoding Overhead

MTP requires loading the main 35B model and a draft head. The draft head adds ~1GB of memory, pushing total requirements further over the VRAM limit.

Three-Part Fix

Three techniques fit the model into 16GB:

Partial MoE offloading — 22 of 40 MoE feed-forward layers run on CPU. GPU retains all attention layers, embedding/output layers, and the shared expert. GPU attention and CPU MoE computation run in parallel via PCIe async.
TurboQuant KV compression — KV cache uses 3-bit turbo3 format instead of f16, shrinking 11GB cache to ~2GB.
MTP speculative decoding — Draft head generates 2 candidate tokens per step. Main model verifies both in 1 forward pass. 98% acceptance yields ~2 tokens per verification step.

VRAM Allocation

Component	GB
Model (Q4_K_M GPU portion)	10.4
MTP head (GPU)	0.9
TurboQuant KV cache	1.9
Headroom	2.8
Total used	~13.2
Remaining model + mmap (CPU)	~10.6 + 0.5

Total VRAM usage: ~13.2GB. Remaining 10.6GB model weights and 515MB mmap’d draft head live in CPU RAM.

Technical Implementation

Toolchain

Built with VS2022 BuildTools, cmake 3.31.6, Ninja, and CUDA 13.2, targeting sm_120 architecture.

cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build build

MTP Head Loading Patches

llama.cpp’s MTP loader assumes draft heads mirror the main model’s architecture. Qwen3.6 uses hybrid attention (full-attention + Gated DeltaNet layers), so three patches were needed:

Fix A — Layer indexing. Model declares n_layer=41 (40 trunk + 1 MTP head). Original loader iterated all 41 layers for draft tensors. Fix: add mtp_trunk_n_layer to skip trunk layers when computing draft head tensor offsets.

Fix B — Q+gate fused tensor. Qwen3.6’s attention concatenates query and sigmoid gate into one tensor. Original loader set Q output dim to n_head * head_dim instead of 2 * n_head * head_dim. Fix:

// Create Q tensor with fused gate dimension
attn_q = create_tensor_qkv(layer, il, n_embd, 2 * n_head * head_dim, ...);

// Split Q and gate via strided view
Q = ggml_view_3d(ctx, attn_q, head_dim, n_head, n_tokens, ...);
gate = ggml_view_3d(ctx, attn_q, head_dim, n_head, n_tokens, gate_offset, ...);

// Apply gate before output projection
attn_out = ggml_mul(ctx, attn_out, ggml_sigmoid(ctx, gate));

Fix C — Out-of-bounds guard. Draft head tensor indices sometimes exceeded its layer range, throwing dev_layer.at(tn.bid) exceptions. Fix: clamp layer index to valid range and guard against NaN in empty-expert routing.

FA Kernel Crash on sm_120

FlashAttention crashed on sm_120 with CUDA error: shared object initialization failed at fattn-common.cuh:1428. Root cause: cudaOccupancyMaxActiveBlocksPerMultiprocessor returned spurious errors when shared memory usage was zero. Fix: fall back to the kernel’s __launch_bounds__(128, 2) annotation.

int blocks_per_sm;
cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &blocks_per_sm, kernel, 128, 0);
if (err != cudaSuccess) {
    // sm_120 quirk: fall back to __launch_bounds__(128, 2)
    blocks_per_sm = 2;
}

Mmap and CPU Offloading

CPU-offloaded tensors triggered page-in stalls from mmap’d files on Windows. Adding --no-mmap forces full RAM load at startup, eliminating latency.

Benchmark Results

HumanEval Results

Mode	pass@1	Stderr	Time	Source
Vanilla (lm-eval, greedy)	60.98%	±3.82%	4.4 min	`lm_eval_results/.../2026-06-05T15-26-44.json`
Agentic (hermes-agent + code_execution)	96.95%	±1.34%	73 min	`human_eval_agentic/samples.jsonl`
Delta from REPL iteration	+35.97 pp	—	—	same weights, same GPU

2025 Frontier Comparison

Model	HumanEval pass@1	Year	Cost
Qwen3.6-35B agentic (ours)	96.95%	2026	$0 (local)
GPT-4o (2025)	~90-94%	2025	$20-60/Mtok
Claude 3.5 Sonnet (2025)	~92-95%	2025	$15-75/Mtok
Gemini 2.5 Pro (2025)	~92-96%	2025	$7-21/Mtok
Qwen3.6-35B vanilla (ours)	60.98%	2026	$0 (local)

On par with 2025 frontier models on HumanEval, at $0 per token, on a $400 card.

Configuration Sweep (Throughput)

All tests use a 200-500 token prompt at temperature 0.7. Wall-clock measurements include network round-trip.

Configuration	Context	Wall tok/s	Server tok/s	Notes
Q5_K_M + MTP + TurboQuant	128K	6	—	MTP head bottleneck
Q4_K_M + MTP + TurboQuant	128K	30	49	No mmap fix
Q4_K_M + MTP + TurboQuant + `--no-mmap`	128K	54.54	54.54	Optimal
Q4_K_M + MTP + TurboQuant	256K	6–8	8	MTP attention O(n)
Q4_K_M + non-MTP + TurboQuant	128K	39	53	No MTP overhead
Q4_K_M + non-MTP, no TurboQuant	128K	41	57	TurboQuant adds latency for Q4
Q4_K_M + non-MTP + TurboQuant	256K	44	56	Best for long context

Context Size Sweep

Context	Tok/s
8K	34
16K	46
32K	46
64K	45
128K	54
256K	7

Configuration Comparison (128K)

Configuration	Tok/s
Q5+MTP	6
Q4+MTP	30
Q4+MTP+no-mmap	54.54
Q4+non-MTP	39
Q4+non-MTP+no-TQ	41

`--no-mmap` Impact

Single largest performance gain: 30 tok/s → 54.54 tok/s (+82%) from disabling mmap. Mmap causes per-access page-in checks for CPU-offloaded tensors. Disabling it loads all tensors into RAM at startup.

Agentic Harness Architecture

For each of the 164 HumanEval problems:

Step	Action
1	User message + system prompt → `AIAgent.run_conversation`
2	While `iter < max_iterations`: POST messages + tool schemas
3	Receive assistant message
4	If tool calls: execute `code_execution`, append result, repeat
5	If no tool calls: return final response
6	Extract final code from last Python block
7	Run `human_eval.check_correctness(problem, code, timeout=5.0)`
8	If max iterations reached: return failure

The full loop logic lives in agent/conversation_loop.py:774 (the main while loop) and agent/tool_executor.py (the dispatch). The reasoning budget is bounded by max_iterations=5; the model’s self-debug behavior emerges from the system prompt’s instruction to “execute and iterate”, not from any harness-level retry logic.

Reproducible Stack

The Full Stack

Layer	Detail
OS	Windows 11 + WSL2 Ubuntu
Model	`Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf` (21.11 GB)
Server	`llama-server.exe`, sm_120, `--no-mmap`, turbo3
GPU	RTX 5060 Ti 16GB, CUDA 13.2
Vanilla benchmark	lm-eval 0.4.12
Agentic benchmark	hermes-agent `AIAgent` loop
Interactive client	opencode codebuddy

Endpoints

Client	Endpoint	Purpose
lm-eval	`http://172.24.64.1:8081/v1/completions`	vanilla HumanEval baseline
hermes-agent	`http://172.24.64.1:8081/v1/chat/completions`	agentic HumanEval
opencode	`http://localhost:8081/v1/chat/completions`	interactive coding

172.24.64.1 is the Windows default gateway IP as seen from inside WSL2. localhost works for native Windows clients.

Launch Command (`C:\llama\start-server.bat`)

llama-server.exe ^
  -m C:\llama\models\Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf ^
  -ngl 99 ^
  --n-cpu-moe 22 ^
  -c 131072 ^
  --no-mmap ^
  --no-warmup ^
  --cache-type-k turbo3 ^
  --cache-type-v turbo3 ^
  --threads 8 ^
  --parallel 1 ^
  --host 0.0.0.0 ^
  --port 8081 ^
  --spec-type mtp ^
  --spec-draft-n-max 2 ^
  --spec-draft-n-cpu-moe 0

opencode Configuration

// C:\Users\lhqezio\.config\opencode\opencode.jsonc
{
  "provider": {
    "llama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Local Qwen 35B (codebuddy)",
      "options": {
        "baseURL": "http://localhost:8081/v1"
      },
      "models": {
        "Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf": {
          "name": "codebuddy"
        }
      }
    }
  }
}

Key Design Decisions

Q4_K_M over Q5_K_M. Q4 fits in VRAM without TurboQuant compression. TurboQuant decompression overhead exceeds bandwidth savings for small models. Q5 requires compression to fit the KV cache, adding latency.

128K over 256K context. MTP head attention scales O(n) with context length. At 256K, draft head takes 43ms per token vs 20ms at 128K. Draft head is on the critical path with no parallelism.

--no-mmap as default. Required for any mixed GPU/CPU inference. Eliminates page-in stalls for offloaded tensors.

22 MoE layers on CPU, 18 on GPU. Balances VRAM headroom against offloading cost. 18 GPU layers leave enough VRAM for cache and MTP head. 22 CPU layers keep offloading overhead manageable.

MTP throughput depends on draft acceptance. 70 tok/s when acceptance is high (coding, 98.3%). ~30 tok/s when acceptance drops. MTP isn’t a universal speedup — it pays off on structured tasks where the draft head predicts accurately.

TurboQuant value depends on model size. Helps when the model would not otherwise fit. Hurts when the model already fits and decompression overhead exceeds bandwidth savings.

Key Takeaways

96.95% HumanEval on a $400 card, on par with 2025 frontier models. Same score as GPT-4o and Claude 3.5 Sonnet at $0 per token.
The 36-point gap (61% → 97%) came from execution feedback, not model size. Same 35B-A3B Q4 weights, same GPU, same quantization. Adding a code_execution tool and a 5-iteration loop closed the gap.
Peak generation (70 tok/s) is fast; effective agentic throughput (~19 tok/s) is harness-bound. Multi-roundtrip protocols, growing context, and serialized tool calls dominate wall clock — not Python execution. The model is fast, the loop isn’t.
16GB VRAM cannot match 24GB benchmark speeds. 70 tok/s peak is near the theoretical maximum for this hardware.
--no-mmap is the largest performance knob for mixed GPU/CPU inference. Gains of 80%+ are common.
The same server serves three clients: lm-eval (vanilla), hermes-agent (agentic), and opencode (interactive). The OpenAI-compatible API is the integration point.
MTP is use-case dependent. 70 tok/s on coding, ~30 tok/s when draft acceptance drops. Lead with the condition, not just the peak.

References

unsloth/Qwen3.6-35B-A3B-MTP-GGUF — MTP-enabled GGUF model
unsloth/Qwen3.6-35B-A3B-GGUF — Non-MTP base model
llama.cpp TurboQuant fork — TurboQuant KV compression + MTP merge
Qwen3.6 model card — Architecture details, MTP configuration
llama.cpp MTP PR — Upstream Multi-Token Prediction support
HumanEval benchmark — Original benchmark by OpenAI
hermes-agent — Agentic framework used for agentic HumanEval runs