| Model | Tier | Reported % | True / 50 | Excl. GT Errors | Incl. Error | Parse Errors | Tech. FP | JSON Clean |
|---|---|---|---|---|---|---|---|---|
| gpt-4o | Frontier | 91.8% | 90.0% | 95.7% | 6.12 | 1 | 3 | — |
| claude-opus-4.6 | Frontier | 90.0% | 90.0% | 95.7% | 7.58 | 0 | 4 | ✓ |
| claude-sonnet-4.5 | Frontier | 88.0% | 90.0% | 93.6% | 8.08 | 0 | 4 | ✓ |
| qwen-2.5 | Mid | 88.0% | 88.0% | 93.6% | 11.50 | 0 | 4 | ✓ |
| mimo-v2-flash | Flash | 86.0% | 84.0% | 89.4% | 6.72 | 0 | 7 | ✓ |
| deepseek-v3 | Mid | 85.1% | 82.0% | 87.2% | 7.72 | 3 | 4 | — |
| gemma4 | Mid | 82.0% | 82.0% | 85.1% | 7.72 | 0 | 9 | ✓ |
| minimax-m2.5 | Mid | 75.0% | 60.0% | 63.8% | 7.15 | 10 | 6 | ✗ |
| llama-3-8b | Small | 52.0% | 52.0% | 55.3% | 13.50 | 0 | 4 | ✓ |
Articles #35, #43, and #48 have been wrong across all 9 models and 18 prompt versions. Correcting these labels immediately raises the measurable ceiling from 94% to 100% and gives a clean benchmark. This is the highest-leverage action available — takes minutes, not days.
Every model over-predicts Technocratic. A targeted prompt addition distinguishing "Decentralised vs Technocratic" (structural preference vs expert authority) and "Authoritarian vs Technocratic" (power vs expertise) should recover 2–3 articles across all models, pushing frontier tier to 92–94%.
qwen-2.5 matches frontier classification accuracy (88%) but has the worst inclination error among viable models (11.50 vs 6.12 for gpt-4o). A simple calibration fix in the prompt's inclination anchors — specifically for centre-spectrum ideologies — could close most of this gap without affecting classification.
Both hit 90.0% with identical classification accuracy. Decision factors: gpt-4o has slightly better inclination (6.12 vs 7.58) but 1 parse error. claude-opus-4.6 has perfect JSON compliance, zero parse errors, and likely better performance on longer real-world articles. Recommend a real-article benchmark (20–30 published pieces) to decide.
84% accuracy at flash-tier cost and speed, with perfect JSON compliance and 6.72 inclination error. If V18.4 partially fixes Technocratic gravity for mimo, it may reach 88%+. Already viable for high-volume classification where cost matters more than the last 6 percentage points.
The 50-article benchmark uses single-sentence synthetic texts written around single ideologies. Real published articles are multi-paragraph, mixed-signal, and deliberately moderate — exactly the challenge described at the start of this project. A real-article test set will expose failure modes invisible in synthetic data and give true production confidence.
The V18.3 prompt architecture is production-ready for frontier models. Three models independently hit 90% on the same prompt — confirming the architecture is sound and the accuracy gap is now a model selection decision, not a prompt engineering problem.
The single most impactful action remaining is correcting 3 ground truth labels — not writing more prompt versions. After that, V18.4 targeting Technocratic gravity should push the frontier tier to 92–94%, which with GT corrections becomes 95–96% against the true dataset.
For production: claude-opus-4.6 is the recommended primary model — 90% accuracy, zero parse errors, perfect JSON compliance, and likely stronger on the longer real-world articles your users will actually submit. gpt-4o is a viable alternative with slightly better inclination calibration. mimo-v2-flash is ready for cost-sensitive batch processing at 84%.