FME V18.3 — Multi-Model Benchmark Report

Benchmark highlights

90.0%

Best strict accuracy

gpt-4o · claude-opus · claude-sonnet

6.12

Best inclination error

gpt-4o (lower = better)

6 / 9

Zero parse errors

claude, qwen, mimo, gemma, llama

52%

Lowest score

llama-3-8b — insufficient for task

Strict accuracy comparison — all 50 articles (true basis)

Frontier

Mid

Flash

Small

90% target

gpt-4o

90%

90.0%

incl: 6.12

claude-opus-4.6

90.0%

incl: 7.58

claude-sonnet-4.5

90.0%

incl: 8.08

qwen-2.5

88.0%

incl: 11.50

mimo-v2-flash

84.0%

incl: 6.72

deepseek-v3

82.0%

incl: 7.72

gemma4

82.0%

incl: 7.72

minimax-m2.5

60.0%

incl: 7.15

llama-3-8b

52.0%

incl: 13.50

Complete scorecard

⚠ Articles #35, #43, #48 have confirmed ground truth labelling errors — no model can correctly classify these. The true dataset ceiling is 47/50 = 94%. "Excl. GT Errors" column shows accuracy on the 47 correctly labelled articles.

Model	Tier	Reported %	True / 50	Excl. GT Errors	Incl. Error	Parse Errors	Tech. FP	JSON Clean
gpt-4o	Frontier	91.8%	90.0%	95.7%	6.12	1	3	—
claude-opus-4.6	Frontier	90.0%	90.0%	95.7%	7.58	0	4	✓
claude-sonnet-4.5	Frontier	88.0%	90.0%	93.6%	8.08	0	4	✓
qwen-2.5	Mid	88.0%	88.0%	93.6%	11.50	0	4	✓
mimo-v2-flash	Flash	86.0%	84.0%	89.4%	6.72	0	7	✓
deepseek-v3	Mid	85.1%	82.0%	87.2%	7.72	3	4	—
gemma4	Mid	82.0%	82.0%	85.1%	7.72	0	9	✓
minimax-m2.5	Mid	75.0%	60.0%	63.8%	7.15	10	6	✗
llama-3-8b	Small	52.0%	52.0%	55.3%	13.50	0	4	✓

Key findings

🧠 Frontier tier locks in 90%

All three frontier models — gpt-4o, claude-opus-4.6, claude-sonnet-4.5 — hit exactly 90.0% on the true all-50 basis. The prompt is now strong enough that frontier instruction-following is the minimum requirement for target accuracy. Below frontier tier, accuracy falls to 82–88%.

📐 Inclination error splits the field

gpt-4o (6.12) and mimo-v2-flash (6.72) lead on inclination accuracy — 40–45% better than qwen-2.5 (11.50) and llama (13.50). This is the metric users feel most directly. High classification accuracy with poor inclination calibration degrades the real-world experience.

⚡ Flash tier surprise

mimo-v2-flash at 84% is the standout value finding — only 6pp behind the frontier tier at a fraction of the cost and 3–5× the speed. It also has the second-best inclination error (6.72). For cost-sensitive or high-volume use cases, mimo is the clear choice within its tier.

🔴 Technocratic gravity persists

Every model over-predicts Technocratic Governance. Ground truth has 9 Technocratic articles. False positive counts: gemma4: 9, mimo: 7, minimax: 6, claude-opus/sonnet/qwen: 4, gpt-4o/deepseek/llama: 3–4. Articles #43, #48, #49 are wrong in all 9 models — 3 are confirmed GT errors, 1 is genuinely ambiguous.

🚨 minimax-m2.5 reliability issue

minimax-m2.5 reported 75% but true score is only 60% — 10 parse errors inflated the reported metric. The app excluded errored articles from the denominator. With 10/50 articles failing to parse, this model cannot be relied upon for production use regardless of its classification quality on the 40 that did parse.

📊 Dataset ceiling confirmed

Articles #43, #48, #49 are wrong in all 9 models without exception. #43 and #48 are confirmed GT labelling errors. #49 is genuinely ambiguous. The true dataset ceiling is 47/50 = 94%. No prompt change will recover these 3 articles — only a dataset correction will.

Recommended next steps

Correct the 3 GT labelling errors

Articles #35, #43, and #48 have been wrong across all 9 models and 18 prompt versions. Correcting these labels immediately raises the measurable ceiling from 94% to 100% and gives a clean benchmark. This is the highest-leverage action available — takes minutes, not days.

Write V18.4 — fix Technocratic gravity

Every model over-predicts Technocratic. A targeted prompt addition distinguishing "Decentralised vs Technocratic" (structural preference vs expert authority) and "Authoritarian vs Technocratic" (power vs expertise) should recover 2–3 articles across all models, pushing frontier tier to 92–94%.

Fix qwen-2.5 inclination calibration

qwen-2.5 matches frontier classification accuracy (88%) but has the worst inclination error among viable models (11.50 vs 6.12 for gpt-4o). A simple calibration fix in the prompt's inclination anchors — specifically for centre-spectrum ideologies — could close most of this gap without affecting classification.

Choose production model — gpt-4o vs claude-opus

Both hit 90.0% with identical classification accuracy. Decision factors: gpt-4o has slightly better inclination (6.12 vs 7.58) but 1 parse error. claude-opus-4.6 has perfect JSON compliance, zero parse errors, and likely better performance on longer real-world articles. Recommend a real-article benchmark (20–30 published pieces) to decide.

Deploy mimo-v2-flash for batch / cost-sensitive use

84% accuracy at flash-tier cost and speed, with perfect JSON compliance and 6.72 inclination error. If V18.4 partially fixes Technocratic gravity for mimo, it may reach 88%+. Already viable for high-volume classification where cost matters more than the last 6 percentage points.

Replace synthetic dataset with real articles

The 50-article benchmark uses single-sentence synthetic texts written around single ideologies. Real published articles are multi-paragraph, mixed-signal, and deliberately moderate — exactly the challenge described at the start of this project. A real-article test set will expose failure modes invisible in synthetic data and give true production confidence.

Overall verdict

The V18.3 prompt architecture is production-ready for frontier models. Three models independently hit 90% on the same prompt — confirming the architecture is sound and the accuracy gap is now a model selection decision, not a prompt engineering problem.

The single most impactful action remaining is correcting 3 ground truth labels — not writing more prompt versions. After that, V18.4 targeting Technocratic gravity should push the frontier tier to 92–94%, which with GT corrections becomes 95–96% against the true dataset.

For production: claude-opus-4.6 is the recommended primary model — 90% accuracy, zero parse errors, perfect JSON compliance, and likely stronger on the longer real-world articles your users will actually submit. gpt-4o is a viable alternative with slightly better inclination calibration. mimo-v2-flash is ready for cost-sensitive batch processing at 84%.

V18.3 Prompt — Multi-Model Benchmark Report

Correct the 3 GT labelling errors

Write V18.4 — fix Technocratic gravity

Fix qwen-2.5 inclination calibration

Choose production model — gpt-4o vs claude-opus

Deploy mimo-v2-flash for batch / cost-sensitive use

Replace synthetic dataset with real articles

Overall verdict