| Model | Tier | Reported | True /50 | Excl. GT | Incl. Err | Parse Err | Tech FP | Cost | Time | $/Accuracy pt | vs V18.3 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| mimo-v2-flash | Flash | 86.0% | 86.0% | 91.5% | 8.58 | 0 | 3 | $0.020 | 7 min | $0.023 | +2.0pp |
| qwen3-235b | Frontier | 84.0% | 84.0% | 89.4% | 7.80 | 0 | 3 | $0.029 | 33 min | $0.035 | new |
| deepseek-v3.2 | Mid | 82.0% | 82.0% | 87.2% | 13.70 | 2 | 3 | $0.064 | 17 min | $0.078 | 0.0pp |
| qwen2.5-72b | Mid | 83.7% | 82.0% | 87.2% | 8.57 | 2 | 3 | $0.033 | 16 min | $0.040 | −4.3pp |
| gpt-4o | Frontier | 80.0% | 80.0% | 85.1% | 8.08 | 0 | 3 | $0.504 | 4 min | $0.630 | −10.0pp ⚠ |
| qwen3.6-plus | Mid | 81.3% | 74.0% | 78.7% | 7.48 | 11 | 1 | $0.182 | 55 min | $0.246 | new |
| claude-haiku-4.5 | Frontier | 76.0% | 72.0% | 74.5% | 9.90 | 5 | 3 | $0.400 | 10 min | $0.556 | new |
| mimo-v2-pro ⚠ | Mid | 83.9% | 52.0% | 55.3% | 7.77 | 18 ⚠ | 1 | $0.110 | 11 min | $0.212 | new |
| Model | V18.3 True | V18.4 True | Delta | V18.3 Incl | V18.4 Incl | Incl Delta | Root cause |
|---|---|---|---|---|---|---|---|
| gpt-4o | 90.0% | 80.0% | −10.0pp | 6.12 | 8.08 | +1.96 | Technocratic BLOCK rules over-fired → 3 new DemSoc→Neoliberal errors, 2 new Technocratic misses |
| qwen2.5-72b | 88.0% | 82.0% | −4.3pp | 8.50 | 8.57 | +0.07 | Same BLOCK over-correction. Inclination calibration section had minimal effect. |
| mimo-v2-flash | 84.0% | 86.0% | +2.0pp | 6.72 | 8.58 | +1.86 | Classification improved +2pp (Technocratic fixes helped mimo). Inclination slightly worse. |
| deepseek-v3 | 82.0% | 82.0% | 0.0pp | 7.72 | 13.70 | +5.98 ⚠ | Classification unchanged. Inclination severely degraded — calibration section backfired on DeepSeek. |
| Rank | Model | Accuracy | Cost / batch | Time | $/accuracy pt | Verdict |
|---|---|---|---|---|---|---|
| 1 | mimo-v2-flash | 86.0% | $0.020 | 7 min | $0.023 | Best value Best accuracy at lowest cost and fast turnaround |
| 2 | qwen3-235b | 84.0% | $0.029 | 33 min | $0.035 | Good Excellent value — slow but cheap, zero parse errors |
| 3 | qwen2.5-72b | 82.0% | $0.033 | 16 min | $0.040 | Good Primary target model — regressed on V18.4 but fixable |
| 4 | deepseek-v3.2 | 82.0% | $0.064 | 17 min | $0.078 | Caution Good classification, but inclination error badly degraded (+6pts) |
| 5 | qwen3.6-plus | 74.0% | $0.182 | 55 min | $0.246 | Poor Slowest, expensive, 11 Ambiguous outputs — not viable |
| 6 | mimo-v2-pro | 52.0% | $0.110 | 11 min | $0.212 | Disqualified 18 parse errors — reported 83.9% is misleading |
| 7 | claude-haiku-4.5 | 72.0% | $0.400 | 10 min | $0.556 | Poor Expensive, 5 Ambiguous outputs, below V18.3 expectations |
| 8 | gpt-4o | 80.0% | $0.504 | 4 min | $0.630 | Regressed Was 90% on V18.3 — V18.4 prompt broke it. Fastest but most expensive. |
The Technocratic Gravity Correction section added aggressive BLOCK rules that now fire on Democratic Socialism articles containing economic language. Articles #31 and #38 are now wrong in all 8 models — they were correct in V18.3 for most models. The DemSoc→Neoliberal confusion has replaced the Technocratic gravity problem.
86% accuracy at $0.020 per batch — the best cost-per-accuracy-point of any model tested ($0.023). Zero parse errors, 7-minute turnaround, and +2pp improvement over V18.3. The Technocratic fixes that hurt frontier models actually helped mimo. Primary target model for cost-sensitive production use.
Reports 83.9% but true score is 52% — 18 parse errors excluded from denominator. The reported metric is misleading at this error rate. mimo-v2-flash outperforms mimo-v2-pro at 1/5 the cost and with perfect reliability.
The Qwen-specific inclination calibration section in V18.4 caused DeepSeek's inclination error to jump from 7.72 → 13.70 (+6 points). The calibration instructions were applied by DeepSeek but in the wrong direction. Model-specific calibration sections must be isolated — or moved to model-specific prompt variants rather than a shared prompt.
| # | Article | Ground Truth | Most common prediction | Status | Action |
|---|---|---|---|---|---|
| 31 | Public debates continue around welfare policies, centered on fairness and economic sustainability. | Democratic Socialism | Neoliberal Capitalism (4×) · Technocratic (2×) | New in V18.4 | V18.4 BLOCK rule over-fired. Fix DemSoc prompt. |
| 38 | Economic inequality remains a major concern, with differing opinions on redistribution strategies. | Democratic Socialism | Neoliberal Capitalism (6×) | New in V18.4 | "Differing opinions" triggers neutral → Neoliberal. Fix needed. |
| 43 | Policy discussions increasingly focus on balancing individual freedom with collective responsibility. | Libertarianism | Technocratic Governance (6×) | GT Error | Confirmed GT labelling error — correct the label. |
| 48 | Experts highlight the importance of evidence-based policy, while acknowledging limitations. | Nationalist Conservatism | Technocratic Governance (7×) | GT Error | Confirmed GT labelling error — correct the label. |
| 49 | Public opinion is divided on the role of government in economic and social life. | Democratic Socialism | Technocratic (4×) · Neoliberal (1×) | Ambiguous | Neutral observation — GT label may need review. |
The V18.4 DemSoc BLOCK rules are too aggressive. Specifically: the rule treating "differing opinions on redistribution" as neutral observation fires correctly in some cases but incorrectly blocks DemSoc labelling for articles where redistribution is the subject. V18.5 must soften the DemSoc BLOCK — "welfare or redistribution as the topic of debate" should score DemSoc moderately (50–65), not block it entirely. Also remove or weaken the Qwen-specific calibration section — replace with a universal inclination table that works across all models.
These two articles (#43: "balancing freedom with collective responsibility" labelled Libertarian, #48: "evidence-based policy experts" labelled Nationalist Conservative) are wrong in all 8 models across all 18 prompt versions. Correcting them adds 2 free accuracy points to every future benchmark run. Do this before running V18.5.
With V18.3's 86% (and trending upward as the prompt matures), mimo-v2-flash is the optimal production model for cost and reliability. At $0.020 per 50-article batch with zero parse errors, it outperforms every alternative on value. Qwen2.5-72b remains the secondary target — once V18.5 fixes the regression, it should return to 88%+.
V18.4 is a step backward for frontier models — particularly gpt-4o (−10pp) and qwen2.5-72b (−4.3pp) — caused by over-aggressive Technocratic BLOCK rules that redirected misclassifications from Technocratic into Neoliberal Capitalism rather than eliminating them. The inclination calibration section also backfired on DeepSeek (+6pt error increase).
The positive news: mimo-v2-flash improved +2pp to 86% and remains the standout value model — best accuracy, lowest cost ($0.020), fastest reliable turnaround (7 min), zero parse errors. The Technocratic fixes that hurt larger models actually helped mimo.
The path forward is clear: V18.5 must soften the DemSoc BLOCK (articles about welfare/redistribution debates should score DemSoc, not neutral), remove model-specific calibration from the shared prompt, and correct the 2 confirmed GT errors. With those fixes, mimo-v2-flash should reach 88%+ and qwen2.5-72b should recover to 90%+ — hitting the production target.