Rhetoric Audit · FME Benchmarking Suite

V18.4 Prompt — Multi-Model Benchmark Report

8 models tested · 50 articles · 10 ideology classes · Temperature 0 · April 10, 2026
⚠ V18.4 regressed on frontier models vs V18.3 mimo-v2-flash leads at 86% — best value 3 confirmed GT errors in dataset mimo-v2-pro: 18 parse errors — disqualified
Benchmark highlights
86.0%
Best strict accuracy
mimo-v2-flash · $0.02
−10pp
gpt-4o regression vs V18.3
90% → 80% · Technocratic over-correction
$0.020
Cheapest viable model
mimo-v2-flash · 7 min · 0 parse errors
18
mimo-v2-pro parse errors
True score 52% — not viable
Strict accuracy — all 50 articles (true basis)
Flash
Frontier (Qwen)
Mid (DeepSeek)
Mid (Qwen)
Frontier (OpenAI)
Frontier (Claude)
90% target
mimo-v2-flash
90%
86%
86.0%
$0.020
7 min
qwen3-235b
84%
84.0%
$0.029
33 min
deepseek-v3.2
82%
82.0%
$0.064
17 min
qwen2.5-72b
82%
82.0%
$0.033
16 min
gpt-4o
80%
80.0%
$0.504
4 min
qwen3.6-plus
74%
74.0%
$0.182
55 min
claude-haiku-4.5
72%
72.0%
$0.400
10 min
mimo-v2-pro ⚠
52% (true)
52.0%
$0.110
11 min
Complete scorecard
⚠ V18.4 prompt over-corrected Technocratic gravity — causing regressions on gpt-4o (−10pp) and qwen2.5-72b (−4.3pp) vs V18.3. The new BLOCK rules pushed misclassifications from Technocratic into Neoliberal Capitalism. Articles #31, #38 (Democratic Socialism) are now wrong in all 8 models. See diagnosis section below.
mimo-v2-pro reported 83.9% but had 18 parse errors — the app excluded them from the denominator. True score is 52%. This model is not viable for production use.
ModelTier ReportedTrue /50Excl. GT Incl. ErrParse ErrTech FP CostTime$/Accuracy pt vs V18.3
mimo-v2-flashFlash 86.0%86.0%91.5% 8.5803 $0.0207 min $0.023 +2.0pp
qwen3-235bFrontier 84.0%84.0%89.4% 7.8003 $0.02933 min $0.035 new
deepseek-v3.2Mid 82.0%82.0%87.2% 13.7023 $0.06417 min $0.078 0.0pp
qwen2.5-72bMid 83.7%82.0%87.2% 8.5723 $0.03316 min $0.040 −4.3pp
gpt-4oFrontier 80.0%80.0%85.1% 8.0803 $0.5044 min $0.630 −10.0pp ⚠
qwen3.6-plusMid 81.3%74.0%78.7% 7.48111 $0.18255 min $0.246 new
claude-haiku-4.5Frontier 76.0%72.0%74.5% 9.9053 $0.40010 min $0.556 new
mimo-v2-pro ⚠Mid 83.9%52.0%55.3% 7.7718 ⚠1 $0.11011 min $0.212 new
V18.3 → V18.4 regression analysis (models tested in both)
ModelV18.3 TrueV18.4 TrueDeltaV18.3 InclV18.4 InclIncl DeltaRoot cause
gpt-4o90.0%80.0% −10.0pp 6.128.08+1.96 Technocratic BLOCK rules over-fired → 3 new DemSoc→Neoliberal errors, 2 new Technocratic misses
qwen2.5-72b88.0%82.0% −4.3pp 8.508.57+0.07 Same BLOCK over-correction. Inclination calibration section had minimal effect.
mimo-v2-flash84.0%86.0% +2.0pp 6.728.58+1.86 Classification improved +2pp (Technocratic fixes helped mimo). Inclination slightly worse.
deepseek-v382.0%82.0% 0.0pp 7.7213.70+5.98 ⚠ Classification unchanged. Inclination severely degraded — calibration section backfired on DeepSeek.
Cost vs accuracy — value analysis
RankModelAccuracyCost / batchTime$/accuracy ptVerdict
1mimo-v2-flash 86.0%$0.0207 min$0.023 Best value Best accuracy at lowest cost and fast turnaround
2qwen3-235b 84.0%$0.02933 min$0.035 Good Excellent value — slow but cheap, zero parse errors
3qwen2.5-72b 82.0%$0.03316 min$0.040 Good Primary target model — regressed on V18.4 but fixable
4deepseek-v3.2 82.0%$0.06417 min$0.078 Caution Good classification, but inclination error badly degraded (+6pts)
5qwen3.6-plus 74.0%$0.18255 min$0.246 Poor Slowest, expensive, 11 Ambiguous outputs — not viable
6mimo-v2-pro 52.0%$0.11011 min$0.212 Disqualified 18 parse errors — reported 83.9% is misleading
7claude-haiku-4.5 72.0%$0.40010 min$0.556 Poor Expensive, 5 Ambiguous outputs, below V18.3 expectations
8gpt-4o 80.0%$0.5044 min$0.630 Regressed Was 90% on V18.3 — V18.4 prompt broke it. Fastest but most expensive.
Key findings

🚨 V18.4 prompt over-corrected — caused new failures

The Technocratic Gravity Correction section added aggressive BLOCK rules that now fire on Democratic Socialism articles containing economic language. Articles #31 and #38 are now wrong in all 8 models — they were correct in V18.3 for most models. The DemSoc→Neoliberal confusion has replaced the Technocratic gravity problem.

⚡ mimo-v2-flash is the clear winner

86% accuracy at $0.020 per batch — the best cost-per-accuracy-point of any model tested ($0.023). Zero parse errors, 7-minute turnaround, and +2pp improvement over V18.3. The Technocratic fixes that hurt frontier models actually helped mimo. Primary target model for cost-sensitive production use.

⚠ mimo-v2-pro is a reliability trap

Reports 83.9% but true score is 52% — 18 parse errors excluded from denominator. The reported metric is misleading at this error rate. mimo-v2-flash outperforms mimo-v2-pro at 1/5 the cost and with perfect reliability.

📊 Inclination calibration backfired on DeepSeek

The Qwen-specific inclination calibration section in V18.4 caused DeepSeek's inclination error to jump from 7.72 → 13.70 (+6 points). The calibration instructions were applied by DeepSeek but in the wrong direction. Model-specific calibration sections must be isolated — or moved to model-specific prompt variants rather than a shared prompt.

Universal failures — wrong in all 8 models
#ArticleGround TruthMost common predictionStatusAction
31 Public debates continue around welfare policies, centered on fairness and economic sustainability. Democratic Socialism Neoliberal Capitalism (4×) · Technocratic (2×) New in V18.4 V18.4 BLOCK rule over-fired. Fix DemSoc prompt.
38 Economic inequality remains a major concern, with differing opinions on redistribution strategies. Democratic Socialism Neoliberal Capitalism (6×) New in V18.4 "Differing opinions" triggers neutral → Neoliberal. Fix needed.
43 Policy discussions increasingly focus on balancing individual freedom with collective responsibility. Libertarianism Technocratic Governance (6×) GT Error Confirmed GT labelling error — correct the label.
48 Experts highlight the importance of evidence-based policy, while acknowledging limitations. Nationalist Conservatism Technocratic Governance (7×) GT Error Confirmed GT labelling error — correct the label.
49 Public opinion is divided on the role of government in economic and social life. Democratic Socialism Technocratic (4×) · Neoliberal (1×) Ambiguous Neutral observation — GT label may need review.
Recommended next steps

🔴 Priority 1 — Write V18.5 (revert over-correction)

The V18.4 DemSoc BLOCK rules are too aggressive. Specifically: the rule treating "differing opinions on redistribution" as neutral observation fires correctly in some cases but incorrectly blocks DemSoc labelling for articles where redistribution is the subject. V18.5 must soften the DemSoc BLOCK — "welfare or redistribution as the topic of debate" should score DemSoc moderately (50–65), not block it entirely. Also remove or weaken the Qwen-specific calibration section — replace with a universal inclination table that works across all models.

🟡 Priority 2 — Correct GT labels #43 and #48

These two articles (#43: "balancing freedom with collective responsibility" labelled Libertarian, #48: "evidence-based policy experts" labelled Nationalist Conservative) are wrong in all 8 models across all 18 prompt versions. Correcting them adds 2 free accuracy points to every future benchmark run. Do this before running V18.5.

🟢 Priority 3 — Lock in mimo-v2-flash as primary

With V18.3's 86% (and trending upward as the prompt matures), mimo-v2-flash is the optimal production model for cost and reliability. At $0.020 per 50-article batch with zero parse errors, it outperforms every alternative on value. Qwen2.5-72b remains the secondary target — once V18.5 fixes the regression, it should return to 88%+.

Overall verdict

V18.4 is a step backward for frontier models — particularly gpt-4o (−10pp) and qwen2.5-72b (−4.3pp) — caused by over-aggressive Technocratic BLOCK rules that redirected misclassifications from Technocratic into Neoliberal Capitalism rather than eliminating them. The inclination calibration section also backfired on DeepSeek (+6pt error increase).

The positive news: mimo-v2-flash improved +2pp to 86% and remains the standout value model — best accuracy, lowest cost ($0.020), fastest reliable turnaround (7 min), zero parse errors. The Technocratic fixes that hurt larger models actually helped mimo.

The path forward is clear: V18.5 must soften the DemSoc BLOCK (articles about welfare/redistribution debates should score DemSoc, not neutral), remove model-specific calibration from the shared prompt, and correct the 2 confirmed GT errors. With those fixes, mimo-v2-flash should reach 88%+ and qwen2.5-72b should recover to 90%+ — hitting the production target.