FME V18.4 — Multi-Model Benchmark Report

Benchmark highlights

86.0%

Best strict accuracy

mimo-v2-flash · $0.02

−10pp

gpt-4o regression vs V18.3

90% → 80% · Technocratic over-correction

$0.020

Cheapest viable model

mimo-v2-flash · 7 min · 0 parse errors

mimo-v2-pro parse errors

True score 52% — not viable

Strict accuracy — all 50 articles (true basis)

Flash

Frontier (Qwen)

Mid (DeepSeek)

Mid (Qwen)

Frontier (OpenAI)

Frontier (Claude)

90% target

mimo-v2-flash

90%

86%

86.0%

$0.020

7 min

qwen3-235b

84%

84.0%

$0.029

33 min

deepseek-v3.2

82%

82.0%

$0.064

17 min

qwen2.5-72b

82%

82.0%

$0.033

16 min

gpt-4o

80%

80.0%

$0.504

4 min

qwen3.6-plus

74%

74.0%

$0.182

55 min

claude-haiku-4.5

72%

72.0%

$0.400

10 min

mimo-v2-pro ⚠

52% (true)

52.0%

$0.110

11 min

Complete scorecard

⚠ V18.4 prompt over-corrected Technocratic gravity — causing regressions on gpt-4o (−10pp) and qwen2.5-72b (−4.3pp) vs V18.3. The new BLOCK rules pushed misclassifications from Technocratic into Neoliberal Capitalism. Articles #31, #38 (Democratic Socialism) are now wrong in all 8 models. See diagnosis section below.

mimo-v2-pro reported 83.9% but had 18 parse errors — the app excluded them from the denominator. True score is 52%. This model is not viable for production use.

Model	Tier	Reported	True /50	Excl. GT	Incl. Err	Parse Err	Tech FP	Cost	Time	$/Accuracy pt	vs V18.3
mimo-v2-flash	Flash	86.0%	86.0%	91.5%	8.58	0	3	$0.020	7 min	$0.023	+2.0pp
qwen3-235b	Frontier	84.0%	84.0%	89.4%	7.80	0	3	$0.029	33 min	$0.035	new
deepseek-v3.2	Mid	82.0%	82.0%	87.2%	13.70	2	3	$0.064	17 min	$0.078	0.0pp
qwen2.5-72b	Mid	83.7%	82.0%	87.2%	8.57	2	3	$0.033	16 min	$0.040	−4.3pp
gpt-4o	Frontier	80.0%	80.0%	85.1%	8.08	0	3	$0.504	4 min	$0.630	−10.0pp ⚠
qwen3.6-plus	Mid	81.3%	74.0%	78.7%	7.48	11	1	$0.182	55 min	$0.246	new
claude-haiku-4.5	Frontier	76.0%	72.0%	74.5%	9.90	5	3	$0.400	10 min	$0.556	new
mimo-v2-pro ⚠	Mid	83.9%	52.0%	55.3%	7.77	18 ⚠	1	$0.110	11 min	$0.212	new

V18.3 → V18.4 regression analysis (models tested in both)

Model	V18.3 True	V18.4 True	Delta	V18.3 Incl	V18.4 Incl	Incl Delta	Root cause
gpt-4o	90.0%	80.0%	−10.0pp	6.12	8.08	+1.96	Technocratic BLOCK rules over-fired → 3 new DemSoc→Neoliberal errors, 2 new Technocratic misses
qwen2.5-72b	88.0%	82.0%	−4.3pp	8.50	8.57	+0.07	Same BLOCK over-correction. Inclination calibration section had minimal effect.
mimo-v2-flash	84.0%	86.0%	+2.0pp	6.72	8.58	+1.86	Classification improved +2pp (Technocratic fixes helped mimo). Inclination slightly worse.
deepseek-v3	82.0%	82.0%	0.0pp	7.72	13.70	+5.98 ⚠	Classification unchanged. Inclination severely degraded — calibration section backfired on DeepSeek.

Cost vs accuracy — value analysis

Rank	Model	Accuracy	Cost / batch	Time	$/accuracy pt	Verdict
1	mimo-v2-flash	86.0%	$0.020	7 min	$0.023	Best value Best accuracy at lowest cost and fast turnaround
2	qwen3-235b	84.0%	$0.029	33 min	$0.035	Good Excellent value — slow but cheap, zero parse errors
3	qwen2.5-72b	82.0%	$0.033	16 min	$0.040	Good Primary target model — regressed on V18.4 but fixable
4	deepseek-v3.2	82.0%	$0.064	17 min	$0.078	Caution Good classification, but inclination error badly degraded (+6pts)
5	qwen3.6-plus	74.0%	$0.182	55 min	$0.246	Poor Slowest, expensive, 11 Ambiguous outputs — not viable
6	mimo-v2-pro	52.0%	$0.110	11 min	$0.212	Disqualified 18 parse errors — reported 83.9% is misleading
7	claude-haiku-4.5	72.0%	$0.400	10 min	$0.556	Poor Expensive, 5 Ambiguous outputs, below V18.3 expectations
8	gpt-4o	80.0%	$0.504	4 min	$0.630	Regressed Was 90% on V18.3 — V18.4 prompt broke it. Fastest but most expensive.

Key findings

🚨 V18.4 prompt over-corrected — caused new failures

The Technocratic Gravity Correction section added aggressive BLOCK rules that now fire on Democratic Socialism articles containing economic language. Articles #31 and #38 are now wrong in all 8 models — they were correct in V18.3 for most models. The DemSoc→Neoliberal confusion has replaced the Technocratic gravity problem.

⚡ mimo-v2-flash is the clear winner

86% accuracy at $0.020 per batch — the best cost-per-accuracy-point of any model tested ($0.023). Zero parse errors, 7-minute turnaround, and +2pp improvement over V18.3. The Technocratic fixes that hurt frontier models actually helped mimo. Primary target model for cost-sensitive production use.

⚠ mimo-v2-pro is a reliability trap

Reports 83.9% but true score is 52% — 18 parse errors excluded from denominator. The reported metric is misleading at this error rate. mimo-v2-flash outperforms mimo-v2-pro at 1/5 the cost and with perfect reliability.

📊 Inclination calibration backfired on DeepSeek

The Qwen-specific inclination calibration section in V18.4 caused DeepSeek's inclination error to jump from 7.72 → 13.70 (+6 points). The calibration instructions were applied by DeepSeek but in the wrong direction. Model-specific calibration sections must be isolated — or moved to model-specific prompt variants rather than a shared prompt.

Universal failures — wrong in all 8 models

#	Article	Ground Truth	Most common prediction	Status	Action
31	Public debates continue around welfare policies, centered on fairness and economic sustainability.	Democratic Socialism	Neoliberal Capitalism (4×) · Technocratic (2×)	New in V18.4	V18.4 BLOCK rule over-fired. Fix DemSoc prompt.
38	Economic inequality remains a major concern, with differing opinions on redistribution strategies.	Democratic Socialism	Neoliberal Capitalism (6×)	New in V18.4	"Differing opinions" triggers neutral → Neoliberal. Fix needed.
43	Policy discussions increasingly focus on balancing individual freedom with collective responsibility.	Libertarianism	Technocratic Governance (6×)	GT Error	Confirmed GT labelling error — correct the label.
48	Experts highlight the importance of evidence-based policy, while acknowledging limitations.	Nationalist Conservatism	Technocratic Governance (7×)	GT Error	Confirmed GT labelling error — correct the label.
49	Public opinion is divided on the role of government in economic and social life.	Democratic Socialism	Technocratic (4×) · Neoliberal (1×)	Ambiguous	Neutral observation — GT label may need review.

Recommended next steps

🔴 Priority 1 — Write V18.5 (revert over-correction)

The V18.4 DemSoc BLOCK rules are too aggressive. Specifically: the rule treating "differing opinions on redistribution" as neutral observation fires correctly in some cases but incorrectly blocks DemSoc labelling for articles where redistribution is the subject. V18.5 must soften the DemSoc BLOCK — "welfare or redistribution as the topic of debate" should score DemSoc moderately (50–65), not block it entirely. Also remove or weaken the Qwen-specific calibration section — replace with a universal inclination table that works across all models.

🟡 Priority 2 — Correct GT labels #43 and #48

These two articles (#43: "balancing freedom with collective responsibility" labelled Libertarian, #48: "evidence-based policy experts" labelled Nationalist Conservative) are wrong in all 8 models across all 18 prompt versions. Correcting them adds 2 free accuracy points to every future benchmark run. Do this before running V18.5.

🟢 Priority 3 — Lock in mimo-v2-flash as primary

With V18.3's 86% (and trending upward as the prompt matures), mimo-v2-flash is the optimal production model for cost and reliability. At $0.020 per 50-article batch with zero parse errors, it outperforms every alternative on value. Qwen2.5-72b remains the secondary target — once V18.5 fixes the regression, it should return to 88%+.

Overall verdict

V18.4 is a step backward for frontier models — particularly gpt-4o (−10pp) and qwen2.5-72b (−4.3pp) — caused by over-aggressive Technocratic BLOCK rules that redirected misclassifications from Technocratic into Neoliberal Capitalism rather than eliminating them. The inclination calibration section also backfired on DeepSeek (+6pt error increase).

The positive news: mimo-v2-flash improved +2pp to 86% and remains the standout value model — best accuracy, lowest cost ($0.020), fastest reliable turnaround (7 min), zero parse errors. The Technocratic fixes that hurt larger models actually helped mimo.

The path forward is clear: V18.5 must soften the DemSoc BLOCK (articles about welfare/redistribution debates should score DemSoc, not neutral), remove model-specific calibration from the shared prompt, and correct the 2 confirmed GT errors. With those fixes, mimo-v2-flash should reach 88%+ and qwen2.5-72b should recover to 90%+ — hitting the production target.

V18.4 Prompt — Multi-Model Benchmark Report