96.95% HumanEval on a $400 GPU: Frontier-Level Coding Performance on Consumer Hardware

Executive Summary

96.95% on HumanEval. On par with 2025 frontier models. Running on a $400 card at $0 per token.

Fitting a 35B MoE model with speculative decoding into 16GB VRAM shouldn’t work. Qwen3.6-35B-A3B is 21GB at Q4, KV cache for 128K context is 11GB, MTP draft head adds another 1.5GB. Total: ~33.5GB. The RTX 5060 Ti 16GB runs it at 70 tok/s peak (coding tasks) with 98.3% draft acceptance, and the agentic setup scores 96.95% on HumanEval — matching 2025 frontier model performance.

MetricValue
HumanEval agentic pass@196.95%
HumanEval vanilla pass@160.98%
Peak generation speed70 tok/s (server-side, coding prompt)
Effective agentic throughput~19 tok/s (end-to-end across 164 problems)
Draft acceptance rate98.3% (113/115)
Context window128K tokens (model designed max)
Model VRAM usage10.4 GB GPU + 10.8 GB CPU offload
HardwareRTX 5060 Ti 16 GB, ~$400
Token cost$0 (local)

Peak vs Effective Throughput

The model generates at 70 tok/s when producing tokens uninterrupted. The end-to-end agentic run across 164 HumanEval problems averaged ~19 tok/s. The gap is the harness, not the model.

Where the wall-clock time goes:

SourceEstimate
API roundtrips (1–5 iterations × 3–5 s each)≈ 10–15 s
Context growth (each tool result adds tokens)slower subsequent calls
Tool-call JSON parse + dispatch + serialize≈ 1–2 s
API queuing (single-slot, parallel=1)≈ 1–3 s
Total overhead≈ 26.7 s
Effective throughput~19 tok/s

Pure generation is fast. The loop serializes it.

Metrictok/sWhen
Peak generation70model output inside an agentic turn (no tool calls)
Vanilla server-side49.91200-token single-turn on a CS history prompt
Effective wall-clock per problem~19end-to-end for the 73-min agentic run

MTP adds ~70% throughput when draft acceptance is high (coding: 98.3%, 70 tok/s). When acceptance drops, throughput falls to ~30 tok/s — the MTP overhead isn’t free. The 70 tok/s number depends on the use case.


The 36-Point Gap

The same 35B-A3B Q4 weights went from 60.98% to 96.95% on HumanEval by adding a code_execution tool and a 5-iteration think-execute-observe loop. Same model, same GPU, same quantization. The delta is execution feedback.

Modepass@1StderrTimeSource
Vanilla (lm-eval, greedy)60.98%±3.82%4.4 minlm_eval_results/.../2026-06-05T15-26-44.json
Agentic (hermes-agent + code_execution)96.95%±1.34%73 minhuman_eval_agentic/samples.jsonl
Delta from REPL iteration+35.97 ppsame weights, same GPU

Vanilla setup: lm-eval 0.4.12, local-completions model type, do_sample=false, max_gen_toks=1024, endpoint http://172.24.64.1:8081/v1/completions. No tools, no iteration — single greedy completion per problem.

Agentic setup: hermes-agent AIAgent class with enabled_toolsets=["code_execution"] and max_iterations=5. System prompt forces a think-execute-observe loop. For each problem, the agent writes Python in a fenced block, calls the code_execution tool, sees stdout/stderr, iterates, and emits a final solution. Scored with human_eval.check_correctness(problem, code, timeout=5.0).

Failed Problems (Agentic)

5/164 problems failed:

Task IDCode length (chars)Notes
HumanEval/26338returned code, test failed
HumanEval/1290agent returned empty response
HumanEval/133510returned code, test failed
HumanEval/140573returned code, test failed
HumanEval/162323returned code, test failed

The /129 failure is the only one where the agent produced no output (likely an API or harness hiccup). The other 4 returned plausible-looking code that didn’t pass the hidden test suite.


The 16GB Wall

Hardware Mismatch

RTX 5060 Ti 16GB GDDR7. Qwen3.6-35B-A3B: 40 transformer layers, 128 experts per layer (8 active per token + 1 shared expert). 35B total parameters, 3B active per token.

Q4_K_M quantization: 21GB. f16 KV cache (128K context): 11GB. Sum: 32GB. Available VRAM: 16GB.

VRAM requirement vs available:

ComponentGB
Model (Q4_K_M)21
KV cache (f16)11
MTP head1.5
Total needed33.5
Available (RTX 5060 Ti)16

MoE Context

Qwen3.6-35B-A3B uses a Mixture-of-Experts architecture. Only 8 of 128 experts per layer are active per token, plus 1 shared expert. This keeps active parameters low (~3B) but total parameters high (35B) — all must be stored in memory.

KV Cache Constraint

The KV cache is required for 128K context. Without it, the model can’t attend to prior tokens. 11GB cache leaves 5GB for the rest of the model on a 16GB card.

Speculative Decoding Overhead

MTP requires loading the main 35B model and a draft head. The draft head adds ~1GB of memory, pushing total requirements further over the VRAM limit.


Three-Part Fix

Three techniques fit the model into 16GB:

  1. Partial MoE offloading — 22 of 40 MoE feed-forward layers run on CPU. GPU retains all attention layers, embedding/output layers, and the shared expert. GPU attention and CPU MoE computation run in parallel via PCIe async.
  2. TurboQuant KV compression — KV cache uses 3-bit turbo3 format instead of f16, shrinking 11GB cache to ~2GB.
  3. MTP speculative decoding — Draft head generates 2 candidate tokens per step. Main model verifies both in 1 forward pass. 98% acceptance yields ~2 tokens per verification step.

VRAM Allocation

ComponentGB
Model (Q4_K_M GPU portion)10.4
MTP head (GPU)0.9
TurboQuant KV cache1.9
Headroom2.8
Total used~13.2
Remaining model + mmap (CPU)~10.6 + 0.5

Total VRAM usage: ~13.2GB. Remaining 10.6GB model weights and 515MB mmap’d draft head live in CPU RAM.


Technical Implementation

Toolchain

Built with VS2022 BuildTools, cmake 3.31.6, Ninja, and CUDA 13.2, targeting sm_120 architecture.

cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build build

MTP Head Loading Patches

llama.cpp’s MTP loader assumes draft heads mirror the main model’s architecture. Qwen3.6 uses hybrid attention (full-attention + Gated DeltaNet layers), so three patches were needed:

Fix A — Layer indexing. Model declares n_layer=41 (40 trunk + 1 MTP head). Original loader iterated all 41 layers for draft tensors. Fix: add mtp_trunk_n_layer to skip trunk layers when computing draft head tensor offsets.

Fix B — Q+gate fused tensor. Qwen3.6’s attention concatenates query and sigmoid gate into one tensor. Original loader set Q output dim to n_head * head_dim instead of 2 * n_head * head_dim. Fix:

// Create Q tensor with fused gate dimension
attn_q = create_tensor_qkv(layer, il, n_embd, 2 * n_head * head_dim, ...);

// Split Q and gate via strided view
Q = ggml_view_3d(ctx, attn_q, head_dim, n_head, n_tokens, ...);
gate = ggml_view_3d(ctx, attn_q, head_dim, n_head, n_tokens, gate_offset, ...);

// Apply gate before output projection
attn_out = ggml_mul(ctx, attn_out, ggml_sigmoid(ctx, gate));

Fix C — Out-of-bounds guard. Draft head tensor indices sometimes exceeded its layer range, throwing dev_layer.at(tn.bid) exceptions. Fix: clamp layer index to valid range and guard against NaN in empty-expert routing.

FA Kernel Crash on sm_120

FlashAttention crashed on sm_120 with CUDA error: shared object initialization failed at fattn-common.cuh:1428. Root cause: cudaOccupancyMaxActiveBlocksPerMultiprocessor returned spurious errors when shared memory usage was zero. Fix: fall back to the kernel’s __launch_bounds__(128, 2) annotation.

int blocks_per_sm;
cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &blocks_per_sm, kernel, 128, 0);
if (err != cudaSuccess) {
    // sm_120 quirk: fall back to __launch_bounds__(128, 2)
    blocks_per_sm = 2;
}

Mmap and CPU Offloading

CPU-offloaded tensors triggered page-in stalls from mmap’d files on Windows. Adding --no-mmap forces full RAM load at startup, eliminating latency.


Benchmark Results

HumanEval Results

Modepass@1StderrTimeSource
Vanilla (lm-eval, greedy)60.98%±3.82%4.4 minlm_eval_results/.../2026-06-05T15-26-44.json
Agentic (hermes-agent + code_execution)96.95%±1.34%73 minhuman_eval_agentic/samples.jsonl
Delta from REPL iteration+35.97 ppsame weights, same GPU

2025 Frontier Comparison

ModelHumanEval pass@1YearCost
Qwen3.6-35B agentic (ours)96.95%2026$0 (local)
GPT-4o (2025)~90-94%2025$20-60/Mtok
Claude 3.5 Sonnet (2025)~92-95%2025$15-75/Mtok
Gemini 2.5 Pro (2025)~92-96%2025$7-21/Mtok
Qwen3.6-35B vanilla (ours)60.98%2026$0 (local)

On par with 2025 frontier models on HumanEval, at $0 per token, on a $400 card.

Configuration Sweep (Throughput)

All tests use a 200-500 token prompt at temperature 0.7. Wall-clock measurements include network round-trip.

ConfigurationContextWall tok/sServer tok/sNotes
Q5_K_M + MTP + TurboQuant128K6MTP head bottleneck
Q4_K_M + MTP + TurboQuant128K3049No mmap fix
Q4_K_M + MTP + TurboQuant + --no-mmap128K54.5454.54Optimal
Q4_K_M + MTP + TurboQuant256K6–88MTP attention O(n)
Q4_K_M + non-MTP + TurboQuant128K3953No MTP overhead
Q4_K_M + non-MTP, no TurboQuant128K4157TurboQuant adds latency for Q4
Q4_K_M + non-MTP + TurboQuant256K4456Best for long context

Context Size Sweep

ContextTok/s
8K34
16K46
32K46
64K45
128K54
256K7

Configuration Comparison (128K)

ConfigurationTok/s
Q5+MTP6
Q4+MTP30
Q4+MTP+no-mmap54.54
Q4+non-MTP39
Q4+non-MTP+no-TQ41

--no-mmap Impact

Single largest performance gain: 30 tok/s → 54.54 tok/s (+82%) from disabling mmap. Mmap causes per-access page-in checks for CPU-offloaded tensors. Disabling it loads all tensors into RAM at startup.


Agentic Harness Architecture

For each of the 164 HumanEval problems:

StepAction
1User message + system prompt → AIAgent.run_conversation
2While iter < max_iterations: POST messages + tool schemas
3Receive assistant message
4If tool calls: execute code_execution, append result, repeat
5If no tool calls: return final response
6Extract final code from last Python block
7Run human_eval.check_correctness(problem, code, timeout=5.0)
8If max iterations reached: return failure

The full loop logic lives in agent/conversation_loop.py:774 (the main while loop) and agent/tool_executor.py (the dispatch). The reasoning budget is bounded by max_iterations=5; the model’s self-debug behavior emerges from the system prompt’s instruction to “execute and iterate”, not from any harness-level retry logic.


Reproducible Stack

The Full Stack

LayerDetail
OSWindows 11 + WSL2 Ubuntu
ModelQwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf (21.11 GB)
Serverllama-server.exe, sm_120, --no-mmap, turbo3
GPURTX 5060 Ti 16GB, CUDA 13.2
Vanilla benchmarklm-eval 0.4.12
Agentic benchmarkhermes-agent AIAgent loop
Interactive clientopencode codebuddy

Endpoints

ClientEndpointPurpose
lm-evalhttp://172.24.64.1:8081/v1/completionsvanilla HumanEval baseline
hermes-agenthttp://172.24.64.1:8081/v1/chat/completionsagentic HumanEval
opencodehttp://localhost:8081/v1/chat/completionsinteractive coding

172.24.64.1 is the Windows default gateway IP as seen from inside WSL2. localhost works for native Windows clients.

Launch Command (C:\llama\start-server.bat)

llama-server.exe ^
  -m C:\llama\models\Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf ^
  -ngl 99 ^
  --n-cpu-moe 22 ^
  -c 131072 ^
  --no-mmap ^
  --no-warmup ^
  --cache-type-k turbo3 ^
  --cache-type-v turbo3 ^
  --threads 8 ^
  --parallel 1 ^
  --host 0.0.0.0 ^
  --port 8081 ^
  --spec-type mtp ^
  --spec-draft-n-max 2 ^
  --spec-draft-n-cpu-moe 0

opencode Configuration

// C:\Users\lhqezio\.config\opencode\opencode.jsonc
{
  "provider": {
    "llama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Local Qwen 35B (codebuddy)",
      "options": {
        "baseURL": "http://localhost:8081/v1"
      },
      "models": {
        "Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf": {
          "name": "codebuddy"
        }
      }
    }
  }
}

Key Design Decisions

Q4_K_M over Q5_K_M. Q4 fits in VRAM without TurboQuant compression. TurboQuant decompression overhead exceeds bandwidth savings for small models. Q5 requires compression to fit the KV cache, adding latency.

128K over 256K context. MTP head attention scales O(n) with context length. At 256K, draft head takes 43ms per token vs 20ms at 128K. Draft head is on the critical path with no parallelism.

--no-mmap as default. Required for any mixed GPU/CPU inference. Eliminates page-in stalls for offloaded tensors.

22 MoE layers on CPU, 18 on GPU. Balances VRAM headroom against offloading cost. 18 GPU layers leave enough VRAM for cache and MTP head. 22 CPU layers keep offloading overhead manageable.

MTP throughput depends on draft acceptance. 70 tok/s when acceptance is high (coding, 98.3%). ~30 tok/s when acceptance drops. MTP isn’t a universal speedup — it pays off on structured tasks where the draft head predicts accurately.

TurboQuant value depends on model size. Helps when the model would not otherwise fit. Hurts when the model already fits and decompression overhead exceeds bandwidth savings.


Key Takeaways

  • 96.95% HumanEval on a $400 card, on par with 2025 frontier models. Same score as GPT-4o and Claude 3.5 Sonnet at $0 per token.
  • The 36-point gap (61% → 97%) came from execution feedback, not model size. Same 35B-A3B Q4 weights, same GPU, same quantization. Adding a code_execution tool and a 5-iteration loop closed the gap.
  • Peak generation (70 tok/s) is fast; effective agentic throughput (~19 tok/s) is harness-bound. Multi-roundtrip protocols, growing context, and serialized tool calls dominate wall clock — not Python execution. The model is fast, the loop isn’t.
  • 16GB VRAM cannot match 24GB benchmark speeds. 70 tok/s peak is near the theoretical maximum for this hardware.
  • --no-mmap is the largest performance knob for mixed GPU/CPU inference. Gains of 80%+ are common.
  • The same server serves three clients: lm-eval (vanilla), hermes-agent (agentic), and opencode (interactive). The OpenAI-compatible API is the integration point.
  • MTP is use-case dependent. 70 tok/s on coding, ~30 tok/s when draft acceptance drops. Lead with the condition, not just the peak.

References