The market now includes new top-tier models from OpenAI, Anthropic, Google, DeepSeek, Alibaba/Qwen and other providers. Reasoning models have become part of the mainstream discussion, and local or locally deployable models have improved significantly.
TIMETOACT LLM Benchmarks June 2026
After a longer break, we are back with a new edition of the TIMETOACT LLM Benchmarks for enterprise workloads. A lot has changed since the previous benchmark runs and we are excited to share our latest insights with you!
Die Highlights im Überblick
- GPT o1 pro (manual) bleibt Gesamtführer mit Score 97 – aber der Vorsprung schrumpft
- Qwen3.7 Max ist der Durchbruch des Jahres: Score 95, auf Augenhöhe mit den stärksten OpenAI-Modellen
- Kosteneffizienz als neuer Wettbewerbsfaktor – z. B. DeepSeek V4 Flash mit Score 88 für nur 0,09 €
- Lokale Modelle werden praxistauglich – mehrere Modelle über Score 80 ohne Cloud-Abhängigkeit
- Reasoning bleibt die härteste Disziplin – hier trennt sich die Spitze noch klar vom Rest
- Modellstrategie schlägt Modellwahl – die Zukunft liegt im gezielten Einsatz mehrerer Modelle je Aufgabentyp
LLM Benchmarks: 160 Models Compared
In this benchmark, we evaluated 160 models across practical enterprise-oriented capabilities: code generation and engineering tasks, CRM and product catalogue scenarios, work with large documents and knowledge bases, integration with external APIs and services, marketing assistance, and reasoning within a provided context. The final score aggregates performance across all categories. Cost and speed are shown separately as practical decision factors, but they are not included in the final score.
| Model ↑ | Code+Eng ↕ | CRM ↕ | Docs ↕ | Integrate ↕ | Marketing ↕ | Reason ↕ | Final 🏆 ↕ | Cost ↕ | Speed ↕ |
|---|---|---|---|---|---|---|---|---|---|
| 1. GPT o1 pro (manual) ☁️ | 100 | 100 | 97 | 100 | 95 | 87 | 97 | 0.20 € | 1.00 rps |
| 2. Qwen3.7 Max ⚠️ | 100 | 100 | 97 | 100 | 88 | 87 | 95 | 2.94 € | 0.10 rps |
| 3. GPT-5.5 ☁️ | 100 | 100 | 94 | 100 | 88 | 87 | 95 | 3.59 € | 0.43 rps |
| 4. GPT-5.5 Pro ☁️ | 100 | 100 | 93 | 100 | 88 | 87 | 95 | 37.10 € | 0.09 rps |
| 5. GPT-5.4 Pro ☁️ | 100 | 97 | 93 | 100 | 88 | 90 | 94 | 34.69 € | 0.06 rps |
| 6. ChatGPT Chat Latest ☁️ | 100 | 97 | 93 | 98 | 88 | 85 | 93 | 1.93 € | 0.74 rps |
| 7. GPT o1-preview v1/2024-09-12 ☁️ | 95 | 92 | 94 | 95 | 88 | 87 | 92 | 52.32 € | 0.08 rps |
| 8. GPT o1 v1/2024-12-17 ☁️ | 100 | 95 | 94 | 91 | 82 | 83 | 91 | 30.63 € | 0.17 rps |
| 9. Claude Opus 4.7 ☁️ | 100 | 97 | 97 | 98 | 82 | 70 | 91 | 2.05 € | 0.43 rps |
| 10. Google Gemini 3.1 Pro Preview ☁️ | 90 | 96 | 93 | 98 | 82 | 78 | 90 | 0.54 € | 0.44 rps |
| 11. GPT o1-mini v1/2024-09-12 ☁️ | 93 | 96 | 94 | 83 | 82 | 87 | 89 | 8.15 € | 0.16 rps |
| 12. GPT-5.4 ☁️ | 86 | 97 | 100 | 100 | 82 | 71 | 89 | 0.74 € | 0.81 rps |
| 13. GPT-4o v3/2024-11-20 ☁️ | 86 | 97 | 94 | 95 | 88 | 72 | 89 | 0.63 € | 1.14 rps |
| 14. DeepSeek V4 Pro ⚠️ | 100 | 86 | 97 | 77 | 82 | 88 | 88 | 0.35 € | 0.20 rps |
| 15. GPT-4o v1/2024-05-13 ☁️ | 90 | 96 | 100 | 92 | 78 | 74 | 88 | 1.21 € | 1.44 rps |
| 16. Claude Opus 4.8 ☁️ | 82 | 100 | 93 | 98 | 82 | 73 | 88 | 1.92 € | 0.62 rps |
| 17. Google Gemini 1.5 Pro v2 ☁️ | 86 | 97 | 94 | 99 | 78 | 74 | 88 | 1.00 € | 1.18 rps |
| 18. DeepSeek V4 Flash ⚠️ | 100 | 89 | 89 | 77 | 88 | 83 | 88 | 0.09 € | 0.27 rps |
| 19. X-AI Grok 2 v2/1212 ⚠️ | 66 | 95 | 97 | 97 | 88 | 78 | 87 | 0.58 € | 0.99 rps |
| 20. Google Gemini 3 Flash Preview ☁️ | 90 | 91 | 90 | 100 | 88 | 60 | 86 | 0.13 € | 0.51 rps |
| 21. Mistral Medium 3.5 ☁️ | 90 | 89 | 100 | 86 | 88 | 59 | 85 | 0.44 € | 0.37 rps |
| 22. GPT-4 Turbo v5/2024-04-09 ☁️ | 86 | 99 | 98 | 96 | 88 | 43 | 85 | 2.45 € | 0.84 rps |
| 23. Google Gemini 2.0 Flash Exp ☁️ | 63 | 96 | 100 | 100 | 82 | 62 | 84 | 0.03 € | 0.85 rps |
| 24. Google Gemini Exp 1121 ☁️ | 70 | 97 | 97 | 95 | 72 | 72 | 84 | 0.89 € | 0.49 rps |
| 25. Google Gemini 3.5 Flash ☁️ | 86 | 83 | 84 | 100 | 88 | 60 | 83 | 0.41 € | 0.63 rps |
| 26. Google Gemini 3.1 Flash-Lite ☁️ | 77 | 92 | 93 | 86 | 82 | 68 | 83 | 0.07 € | 0.59 rps |
| 27. Qwen3.6 27B ⚠️ | 86 | 97 | 100 | 73 | 82 | 61 | 83 | 0.11 € | 0.82 rps |
| 28. GPT-4o v2/2024-08-06 ☁️ | 90 | 84 | 97 | 86 | 82 | 59 | 83 | 0.63 € | 1.49 rps |
| 29. Google Gemini 1.5 Pro 0801 ☁️ | 84 | 92 | 79 | 100 | 70 | 74 | 83 | 0.90 € | 0.83 rps |
| 30. GPT-4.1 ☁️ | 86 | 83 | 97 | 79 | 82 | 69 | 83 | 0.50 € | 0.84 rps |
| 31. Gemma 4 31B IT ⚠️ | 86 | 91 | 90 | 87 | 88 | 55 | 83 | 0.03 € | 0.27 rps |
| 32. Qwen 2.5 72B Instruct ⚠️ | 79 | 92 | 94 | 97 | 71 | 59 | 82 | 0.10 € | 0.66 rps |
| 33. GLM 5.1 ⚠️ | 77 | 97 | 97 | 75 | 82 | 63 | 82 | 0.23 € | 1.68 rps |
| 34. Gemma 4 26B A4B IT ⚠️ | 86 | 94 | 100 | 73 | 82 | 56 | 82 | 0.02 € | 0.55 rps |
| 35. Nous Llama 3.1 405B Hermes 3🦙 | 68 | 93 | 89 | 98 | 88 | 53 | 81 | 0.54 € | 0.49 rps |
| 36. MiniMax M2.7 ⚠️ | 97 | 87 | 89 | 75 | 69 | 69 | 81 | 0.45 € | 0.61 rps |
| 37. Claude 3.5 Sonnet v2 ☁️ | 82 | 97 | 93 | 84 | 71 | 57 | 81 | 0.95 € | 0.09 rps |
| 38. GPT-4 v1/0314 ☁️ | 90 | 88 | 98 | 73 | 88 | 45 | 80 | 7.04 € | 1.31 rps |
| 39. GPT-5.4 Mini ☁️ | 90 | 94 | 81 | 73 | 82 | 59 | 80 | 0.22 € | 1.02 rps |
| 40. X-AI Grok 2 v1/1012 ⚠️ | 63 | 93 | 87 | 90 | 88 | 58 | 80 | 1.03 € | 0.31 rps |
| 41. GPT-4 v2/0613 ☁️ | 90 | 83 | 95 | 73 | 88 | 45 | 79 | 7.04 € | 2.16 rps |
| 42. Qwen3.6 35B A3B ⚠️ | 66 | 89 | 95 | 84 | 82 | 55 | 78 | 0.05 € | 0.57 rps |
| 43. DeepSeek v3 671B ⚠️ | 62 | 95 | 97 | 85 | 75 | 55 | 78 | 0.03 € | 0.49 rps |
| 44. GPT-4o Mini ☁️ | 63 | 87 | 80 | 73 | 100 | 65 | 78 | 0.04 € | 1.46 rps |
| 45. Claude 3.5 Sonnet v1 ☁️ | 72 | 83 | 89 | 87 | 80 | 58 | 78 | 0.94 € | 0.09 rps |
| 46. GPT-4.1 Mini ☁️ | 66 | 97 | 97 | 75 | 82 | 51 | 78 | 0.10 € | 0.79 rps |
| 47. Nous Llama 3.1 70B Hermes 3🦙 | 74 | 97 | 87 | 82 | 88 | 39 | 78 | 0.03 € | 0.42 rps |
| 48. Claude Opus 4.6 ☁️ | 95 | 100 | 100 | 81 | 38 | 52 | 78 | 1.59 € | 0.40 rps |
| 49. Claude 3 Opus ☁️ | 69 | 88 | 100 | 74 | 76 | 58 | 77 | 4.69 € | 0.41 rps |
| 50. Meta Llama 3.1 405B Instruct🦙 | 81 | 93 | 92 | 75 | 75 | 48 | 77 | 2.39 € | 1.16 rps |
| 51. GPT-4 Turbo v4/0125-preview ☁️ | 66 | 97 | 100 | 83 | 75 | 43 | 77 | 2.45 € | 0.84 rps |
| 52. Google LearnLM 1.5 Pro Experimental ⚠️ | 48 | 97 | 85 | 96 | 64 | 72 | 77 | 0.31 € | 0.83 rps |
| 53. GPT-4 Turbo v3/1106-preview ☁️ | 66 | 75 | 98 | 73 | 88 | 60 | 76 | 2.46 € | 0.68 rps |
| 54. MiniMax M3 ⚠️ | 68 | 91 | 94 | 77 | 88 | 41 | 76 | 0.14 € | 0.39 rps |
| 55. Google Gemini Exp 1206 ☁️ | 52 | 100 | 85 | 77 | 75 | 69 | 76 | 0.88 € | 0.16 rps |
| 56. Qwen 2.5 32B Coder Instruct ⚠️ | 43 | 94 | 98 | 98 | 76 | 46 | 76 | 0.05 € | 0.82 rps |
| 57. Grok 4.3 ⚠️ | 84 | 90 | 82 | 80 | 75 | 43 | 76 | 1.69 € | 0.17 rps |
| 58. Kimi K2.6 ⚠️ | 86 | 86 | 90 | 54 | 76 | 61 | 76 | 0.21 € | 0.94 rps |
| 59. DeepSeek v2.5 236B ⚠️ | 57 | 80 | 91 | 80 | 88 | 57 | 75 | 0.03 € | 0.42 rps |
| 60. Meta Llama 3.1 70B Instruct f16🦙 | 74 | 89 | 90 | 75 | 75 | 48 | 75 | 1.79 € | 0.90 rps |
| 61. Google Gemini 1.5 Flash v2 ☁️ | 64 | 96 | 89 | 76 | 81 | 44 | 75 | 0.06 € | 2.01 rps |
| 62. Claude Opus 4.5 ☁️ | 70 | 88 | 92 | 77 | 82 | 40 | 75 | 1.56 € | 0.43 rps |
| 63. Google Gemini 1.5 Pro 0409 ☁️ | 68 | 97 | 96 | 80 | 75 | 26 | 74 | 0.95 € | 0.59 rps |
| 64. Meta Llama 3 70B Instruct🦙 | 81 | 83 | 84 | 67 | 81 | 45 | 73 | 0.06 € | 0.85 rps |
| 65. GPT-3.5 v2/0613 ☁️ | 68 | 81 | 73 | 87 | 81 | 50 | 73 | 0.34 € | 1.46 rps |
| 66. Amazon Nova Lite ⚠️ | 67 | 78 | 74 | 94 | 62 | 62 | 73 | 0.02 € | 2.19 rps |
| 67. Claude Sonnet 4.5 ☁️ | 72 | 91 | 90 | 73 | 75 | 34 | 72 | 0.97 € | 0.42 rps |
| 68. Mistral Large 123B v2/2407 ☁️ | 68 | 79 | 68 | 75 | 75 | 70 | 72 | 0.57 € | 1.02 rps |
| 69. Google Gemini Flash 1.5 8B ☁️ | 70 | 93 | 78 | 67 | 76 | 48 | 72 | 0.01 € | 1.19 rps |
| 70. Google Gemini 1.5 Pro 0514 ☁️ | 73 | 96 | 79 | 100 | 25 | 60 | 72 | 1.07 € | 0.92 rps |
| 71. Google Gemini 1.5 Flash 0514 ☁️ | 32 | 97 | 100 | 76 | 72 | 52 | 72 | 0.06 € | 1.77 rps |
| 72. Google Gemini 1.0 Pro ☁️ | 66 | 86 | 83 | 79 | 88 | 28 | 71 | 0.37 € | 1.36 rps |
| 73. Meta Llama 3.2 90B Vision🦙 | 74 | 84 | 87 | 77 | 71 | 32 | 71 | 0.23 € | 1.10 rps |
| 74. Claude Sonnet 4.6 ☁️ | 90 | 92 | 90 | 73 | 38 | 43 | 71 | 0.95 € | 0.50 rps |
| 75. GPT-3.5 v3/1106 ☁️ | 68 | 70 | 71 | 81 | 78 | 58 | 71 | 0.24 € | 2.33 rps |
| 76. Claude 3.5 Haiku ☁️ | 52 | 80 | 72 | 75 | 75 | 68 | 70 | 0.32 € | 1.24 rps |
| 77. Meta Llama 3.3 70B Instruct🦙 | 74 | 78 | 74 | 77 | 71 | 46 | 70 | 0.10 € | 0.71 rps |
| 78. Claude Haiku 4.5 ☁️ | 63 | 80 | 91 | 73 | 75 | 37 | 70 | 0.32 € | 0.76 rps |
| 79. GPT-3.5 v4/0125 ☁️ | 63 | 87 | 71 | 77 | 78 | 43 | 70 | 0.12 € | 1.43 rps |
| 80. Cohere Command R+ ☁️ | 63 | 80 | 76 | 72 | 70 | 58 | 70 | 0.83 € | 1.90 rps |
| 81. Mistral Large 123B v3/2411 ☁️ | 68 | 75 | 64 | 76 | 82 | 51 | 70 | 0.56 € | 0.66 rps |
| 82. Qwen1.5 32B Chat f16 ⚠️ | 70 | 90 | 82 | 76 | 78 | 20 | 69 | 0.97 € | 1.66 rps |
| 83. Gemma 2 27B IT ⚠️ | 61 | 72 | 87 | 74 | 89 | 32 | 69 | 0.07 € | 0.90 rps |
| 84. GPT-4.1 Nano ☁️ | 68 | 84 | 77 | 66 | 78 | 41 | 69 | 0.03 € | 0.79 rps |
| 85. Mistral 7B OpenChat-3.5 v3 0106 f16 ✅ | 68 | 87 | 67 | 74 | 88 | 25 | 68 | 0.32 € | 3.39 rps |
| 86. Meta Llama 3 8B Instruct f16🦙 | 79 | 62 | 68 | 70 | 80 | 41 | 67 | 0.32 € | 3.33 rps |
| 87. Gemma 7B OpenChat-3.5 v3 0106 f16 ✅ | 63 | 67 | 84 | 58 | 81 | 46 | 67 | 0.21 € | 5.09 rps |
| 88. NVIDIA Nemotron 3 Super 120B A12B ⚠️ | 48 | 84 | 69 | 68 | 88 | 42 | 67 | 0.03 € | 0.97 rps |
| 89. GPT-5.4 Nano ☁️ | 59 | 90 | 76 | 71 | 82 | 21 | 67 | 0.06 € | 1.06 rps |
| 90. GPT-3.5-instruct 0914 ☁️ | 47 | 92 | 69 | 69 | 88 | 33 | 66 | 0.35 € | 2.15 rps |
| 91. Amazon Nova Pro ⚠️ | 64 | 78 | 82 | 79 | 52 | 41 | 66 | 0.22 € | 1.34 rps |
| 92. GPT-3.5 v1/0301 ☁️ | 55 | 82 | 69 | 81 | 82 | 26 | 66 | 0.35 € | 4.12 rps |
| 93. Llama 3 8B OpenChat-3.6 20240522 f16 ✅ | 76 | 51 | 76 | 65 | 88 | 38 | 66 | 0.28 € | 3.79 rps |
| 94. Mistral 7B OpenChat-3.5 v1 f16 ✅ | 58 | 72 | 72 | 71 | 88 | 33 | 66 | 0.49 € | 2.20 rps |
| 95. Mistral 7B OpenChat-3.5 v2 1210 f16 ✅ | 63 | 73 | 72 | 66 | 88 | 30 | 65 | 0.32 € | 3.40 rps |
| 96. Qwen 2.5 7B Instruct ⚠️ | 48 | 77 | 80 | 68 | 69 | 47 | 65 | 0.07 € | 1.25 rps |
| 97. Starling 7B-alpha f16 ⚠️ | 58 | 66 | 67 | 73 | 88 | 34 | 64 | 0.58 € | 1.85 rps |
| 98. Mistral Nemo 12B v1/2407 ☁️ | 54 | 58 | 51 | 99 | 75 | 49 | 64 | 0.03 € | 1.22 rps |
| 99. Meta Llama 3.2 11B Vision🦙 | 70 | 71 | 65 | 70 | 71 | 36 | 64 | 0.04 € | 1.49 rps |
| 100. Nous Llama 3 8B Hermes 2 Theta🦙 | 61 | 73 | 74 | 74 | 85 | 16 | 64 | 0.05 € | 0.55 rps |
| 101. Claude 3 Haiku ☁️ | 64 | 69 | 64 | 75 | 75 | 35 | 64 | 0.08 € | 0.52 rps |
| 102. Yi 1.5 34B Chat f16 ⚠️ | 47 | 78 | 70 | 74 | 86 | 26 | 64 | 1.18 € | 1.37 rps |
| 103. Gemma 3n E4B IT ⚠️ | 32 | 73 | 73 | 74 | 71 | 54 | 63 | 0.02 € | 0.44 rps |
| 104. Liquid: LFM 40B MoE ⚠️ | 72 | 69 | 65 | 63 | 82 | 24 | 63 | 0.00 € | 1.45 rps |
| 105. Meta Llama 3.1 8B Instruct f16🦙 | 57 | 74 | 62 | 74 | 74 | 32 | 62 | 0.45 € | 2.41 rps |
| 106. Gemma 3 4B IT ⚠️ | 48 | 84 | 60 | 72 | 69 | 37 | 62 | 0.01 € | 0.54 rps |
| 107. Qwen2 7B Instruct f32 ⚠️ | 50 | 81 | 81 | 61 | 66 | 31 | 62 | 0.46 € | 2.36 rps |
| 108. Claude 3 Sonnet ☁️ | 72 | 41 | 74 | 74 | 78 | 28 | 61 | 0.95 € | 0.85 rps |
| 109. MistralAI Ministral 3B 2512 ✅ | 53 | 59 | 69 | 71 | 81 | 32 | 61 | 0.02 € | 0.54 rps |
| 110. Mistral Small v3/2409 ☁️ | 43 | 75 | 71 | 74 | 75 | 26 | 61 | 0.06 € | 0.81 rps |
| 111. NVIDIA Nemotron 3 Nano Omni 30B A3B Reasoning ⚠️ | 50 | 77 | 64 | 32 | 100 | 41 | 61 | 0.00 € | 1.06 rps |
| 112. Mistral Pixtral 12B ✅ | 53 | 69 | 73 | 63 | 64 | 40 | 60 | 0.03 € | 0.83 rps |
| 113. Mixtral 8x22B API (Instruct) ☁️ | 53 | 62 | 62 | 97 | 75 | 7 | 59 | 0.17 € | 3.12 rps |
| 114. Anthropic Claude Instant v1.2 ☁️ | 58 | 75 | 65 | 77 | 65 | 16 | 59 | 2.10 € | 1.49 rps |
| 115. Codestral Mamba 7B v1 ✅ | 53 | 66 | 51 | 97 | 71 | 17 | 59 | 0.30 € | 2.82 rps |
| 116. Inflection 3 Productivity ⚠️ | 46 | 59 | 39 | 70 | 79 | 61 | 59 | 0.92 € | 0.17 rps |
| 117. Anthropic Claude v2.0 ☁️ | 63 | 52 | 55 | 67 | 84 | 34 | 59 | 2.19 € | 0.40 rps |
| 118. Cohere Command R ☁️ | 45 | 66 | 57 | 74 | 84 | 27 | 59 | 0.13 € | 2.50 rps |
| 119. Amazon Nova Micro ⚠️ | 58 | 68 | 64 | 71 | 59 | 31 | 59 | 0.01 € | 2.41 rps |
| 120. Qwen1.5 7B Chat f16 ⚠️ | 56 | 81 | 60 | 56 | 60 | 36 | 58 | 0.29 € | 3.76 rps |
| 121. Mistral Large v1/2402 ☁️ | 37 | 49 | 70 | 83 | 84 | 25 | 58 | 0.58 € | 2.11 rps |
| 122. Microsoft WizardLM 2 8x22B ⚠️ | 48 | 76 | 79 | 59 | 62 | 22 | 58 | 0.13 € | 0.70 rps |
| 123. Qwen1.5 14B Chat f16 ⚠️ | 50 | 58 | 51 | 72 | 84 | 22 | 56 | 0.36 € | 3.03 rps |
| 124. MistralAI Ministral 8B ✅ | 56 | 55 | 41 | 82 | 68 | 30 | 55 | 0.02 € | 1.02 rps |
| 125. Anthropic Claude v2.1 ☁️ | 29 | 58 | 59 | 78 | 75 | 32 | 55 | 2.25 € | 0.35 rps |
| 126. Mistral 7B OpenOrca f16 ☁️ | 54 | 57 | 76 | 36 | 78 | 27 | 55 | 0.41 € | 2.65 rps |
| 127. MistralAI Ministral 3B ✅ | 50 | 48 | 39 | 89 | 60 | 41 | 54 | 0.01 € | 1.02 rps |
| 128. Cohere Command R 7B 12/2024 ☁️ | 48 | 68 | 63 | 65 | 55 | 25 | 54 | 0.01 € | 1.45 rps |
| 129. Llama2 13B Vicuna-1.5 f16🦙 | 50 | 37 | 55 | 62 | 82 | 37 | 54 | 0.99 € | 1.09 rps |
| 130. Mistral 7B Instruct v0.1 f16 ☁️ | 34 | 71 | 69 | 63 | 62 | 23 | 54 | 0.75 € | 1.43 rps |
| 131. Meta Llama 3.2 3B🦙 | 52 | 71 | 66 | 71 | 44 | 14 | 53 | 0.01 € | 1.25 rps |
| 132. Google Recurrent Gemma 9B IT f16 ⚠️ | 58 | 27 | 71 | 64 | 56 | 23 | 50 | 0.89 € | 1.21 rps |
| 133. Codestral 22B v1 ✅ | 38 | 47 | 44 | 84 | 66 | 13 | 49 | 0.06 € | 4.03 rps |
| 134. Qwen: QwQ 32B Preview ⚠️ | 43 | 32 | 74 | 52 | 48 | 40 | 48 | 0.05 € | 0.63 rps |
| 135. Nous Llama2 13B Hermes f16🦙 | 50 | 24 | 37 | 75 | 60 | 42 | 48 | 1.00 € | 1.07 rps |
| 136. IBM Granite 34B Code Instruct f16 ☁️ | 63 | 49 | 34 | 67 | 57 | 7 | 46 | 1.07 € | 1.51 rps |
| 137. Meta Llama 3.2 1B🦙 | 32 | 40 | 33 | 53 | 68 | 51 | 46 | 0.02 € | 1.69 rps |
| 138. Mistral Small v2/2402 ☁️ | 33 | 42 | 45 | 88 | 56 | 8 | 46 | 0.06 € | 3.21 rps |
| 139. Mistral Small v1/2312 (Mixtral) ☁️ | 10 | 67 | 63 | 65 | 56 | 8 | 45 | 0.06 € | 2.21 rps |
| 140. DBRX 132B Instruct ⚠️ | 43 | 39 | 43 | 74 | 59 | 10 | 45 | 0.26 € | 1.31 rps |
| 141. NVIDIA Llama 3.1 Nemotron 70B Instruct🦙 | 68 | 54 | 25 | 72 | 28 | 21 | 45 | 0.09 € | 0.53 rps |
| 142. Mistral Medium v1/2312 ☁️ | 41 | 43 | 44 | 59 | 62 | 12 | 44 | 0.81 € | 0.35 rps |
| 143. Microsoft WizardLM 2 7B ⚠️ | 53 | 34 | 42 | 66 | 53 | 13 | 43 | 0.02 € | 0.89 rps |
| 144. Llama2 13B Puffin f16🦙 | 37 | 15 | 44 | 67 | 56 | 39 | 43 | 4.70 € | 0.23 rps |
| 145. Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️ | 22 | 47 | 59 | 53 | 62 | 8 | 42 | 0.05 € | 2.39 rps |
| 146. Gemma 2 9B IT ⚠️ | 45 | 25 | 47 | 36 | 68 | 13 | 39 | 0.02 € | 0.88 rps |
| 147. Meta Llama2 13B chat f16🦙 | 22 | 38 | 17 | 65 | 75 | 6 | 37 | 0.75 € | 1.44 rps |
| 148. Mistral 7B Zephyr-β f16 ✅ | 37 | 34 | 46 | 62 | 29 | 4 | 35 | 0.46 € | 2.34 rps |
| 149. Meta Llama2 7B chat f16🦙 | 22 | 33 | 20 | 62 | 50 | 18 | 34 | 0.56 € | 1.93 rps |
| 150. Mistral 7B Notus-v1 f16 ⚠️ | 10 | 54 | 25 | 60 | 48 | 4 | 33 | 0.75 € | 1.43 rps |
| 151. Orca 2 13B f16 ⚠️ | 18 | 22 | 32 | 29 | 67 | 20 | 31 | 0.95 € | 1.14 rps |
| 152. Mistral 7B Instruct v0.2 f16 ☁️ | 11 | 30 | 54 | 25 | 58 | 8 | 31 | 0.96 € | 1.12 rps |
| 153. Mistral 7B v0.1 f16 ☁️ | 0 | 9 | 48 | 63 | 52 | 12 | 31 | 0.87 € | 1.23 rps |
| 154. Google Gemma 2B IT f16 ⚠️ | 33 | 28 | 16 | 47 | 15 | 20 | 27 | 0.30 € | 3.54 rps |
| 155. Microsoft Phi 3 Medium 4K Instruct 14B f16 ⚠️ | 5 | 34 | 30 | 32 | 47 | 8 | 26 | 0.82 € | 1.32 rps |
| 156. Orca 2 7B f16 ⚠️ | 22 | 0 | 26 | 26 | 52 | 4 | 22 | 0.78 € | 1.38 rps |
| 157. Microsoft Phi 4 Mini Instruct ⚠️ | 10 | 2 | 12 | 25 | 22 | 4 | 13 | 0.02 € | 0.58 rps |
| 158. Google Gemma 7B IT f16 ⚠️ | 0 | 0 | 0 | 6 | 62 | 0 | 11 | 0.99 € | 1.08 rps |
| 159. Meta Llama2 7B f16🦙 | 0 | 5 | 22 | 3 | 28 | 2 | 10 | 0.95 € | 1.13 rps |
| 160. Yi 1.5 9B Chat f16 ⚠️ | 0 | 4 | 29 | 17 | 0 | 8 | 10 | 1.41 € | 0.76 rps |
The main conclusion is clear: the top model is still on top, but the gap has become much smaller.
GPT o1 pro (manual)
GPT o1 pro (manual) remains the overall leader with a final score of 97. However, the next group is now extremely close: Qwen3.7 Max, GPT-5.5 and GPT-5.5 Pro all reach a final score of 95. This is an important shift. The market no longer looks like a race with one isolated leader. Several models are now operating at a level where the choice depends less on raw benchmark position and more on cost, latency, deployment model, privacy requirements and integration strategy.
Qwen3.7 Max Is the Breakthrough Result
The most striking result in this benchmark is Qwen3.7 Max reaching second place.
With a final score of 95, Qwen3.7 Max performs at the same level as the strongest OpenAI models directly below the leader. It reaches top or near-top scores in several key enterprise categories, including Code+Eng, CRM, Docs, Integrate and Reason.
This is a strong signal for the market. Until recently, many non-frontier or locally oriented models were discussed mainly as "good enough" alternatives for selected use cases. Qwen3.7 Max changes that perception. It shows that models outside the usual Western frontier-model narrative can compete at the very top of enterprise benchmarks.
For companies, this changes the question. It is no longer enough to ask: "Which model is the strongest?" The better question is now: "Which model provides the right quality, at the right cost, with the right deployment and compliance profile for this specific workload?"
OpenAI Still Dominates the Top of the Table
At the same time, OpenAI remains exceptionally strong. OpenAI models occupy many of the leading positions in the benchmark, including the overall top model and several models in the top tier.
This matters because enterprise adoption is rarely about a single isolated task. Companies need models that perform consistently across coding, document processing, structured business data, API integration, reasoning and communication tasks. In this benchmark, OpenAI's portfolio remains very strong across that full spectrum.
However, the results also show that a more expensive model is not automatically the best business choice. GPT-5.5 and GPT-5.5 Pro both achieve a final score of 95, but their estimated costs differ significantly. GPT-5.4 Pro reaches a very strong final score of 94 and has the highest Reason score in the table, but it is also one of the more expensive options. Meanwhile, ChatGPT Chat Latest reaches 93 and looks like a strong balanced model for scenarios where quality, speed and practical usability all matter.
This is exactly why benchmarking only raw quality is not enough. In real projects, model selection must include quality, price, speed and operational constraints.
Cost Has Become a Strategic Factor Again
One of the most interesting findings is that high-quality models are now available at very different price points.
Several models close to the top of the benchmark are far less expensive than the premium frontier options. Google Gemini 3.1 Pro Preview reaches a final score of 90 with an estimated cost of €0.54. GPT-4o v3/2024-11-20 reaches 89 at €0.63. GPT-5.4 reaches 89 at €0.74. DeepSeek V4 Flash is especially notable, with a final score of 88 and an estimated cost of only €0.09.
That does not mean the cheapest model is always the best. But it does mean that companies can now design more efficient AI architectures. Instead of relying on one universal model for everything, they can use a portfolio of models:
- a frontier model for difficult, high-risk or high-value tasks;
- a strong but cheaper model for high-volume workloads;
- a local or locally deployable model for privacy-sensitive or infrastructure-sensitive scenarios;
- and smaller specialized models for routing, extraction, classification or preprocessing.
Local and Locally Deployable Models Are Becoming Practical
Another positive trend is the improvement of models that can be run locally or closer to a company's own infrastructure.
The benchmark includes several non-cloud or locally oriented models with final scores above 80, including Qwen3.6 27B, Gemma 4 31B IT, Qwen 2.5 72B Instruct, GLM 5.1, Gemma 4 26B A4B IT and Nous Llama 3.1 405B Hermes 3.
This is important for organizations with strict requirements around data sovereignty, security, latency, infrastructure control or predictable cost. Local models are no longer just an experimental option. For selected enterprise workflows, they are becoming a realistic part of the architecture.
The most promising use cases are not necessarily full replacement of frontier models. Local models can already be highly valuable for classification, extraction, internal assistants, document preprocessing, workflow automation, retrieval pipelines and low-latency backend tasks.
Some Categories Are Saturating, but Reasoning Still Separates the Best Models
In several categories, the upper part of the table is already very crowded. Multiple models achieve scores of 100 in Code+Eng, CRM, Docs or Integrate. This suggests that many enterprise capabilities are becoming broadly available across providers.
Reasoning remains more difficult.
The highest Reason score in this benchmark is 90, achieved by GPT-5.4 Pro. Many otherwise strong models perform very well in coding, document processing or integration tasks, but score noticeably lower in reasoning. This distinction is important. A model can be good at producing code or extracting structured information while still struggling with multi-step logic, edge cases, business rules or complex decision-making inside a constrained context.
For enterprise adoption, this is a key lesson: generic public leaderboards are useful, but they are not enough. Companies need to test models on workloads that look like their own processes: internal documents, product data, APIs, CRM systems, compliance rules, SAP, Salesforce, ServiceNow, knowledge bases and agentic workflows.
What This Means for Companies
The main practical takeaway is that LLM selection has become an architectural decision.
In 2026, choosing a model is no longer just about selecting the highest score on a leaderboard. A serious enterprise AI architecture must consider:
- model quality on the specific workload;
- cost at realistic token volumes;
- speed and latency;
- availability of local or private deployment;
- integration with the existing cloud stack;
- reasoning quality;
- reliability of structured outputs;
- data protection and compliance requirements;
- and the ability to route tasks between several models
For some workloads, the best choice will still be a top OpenAI model. For others, Gemini, Claude, Qwen or DeepSeek may be more attractive. For privacy-sensitive or cost-sensitive workflows, a local model may be the better architectural fit.
Increasingly, the best answer is not one model. The best answer is a model strategy.
Conclusion
This benchmark shows how quickly the LLM market has matured. The overall leader is still strong, but the distance to the next models has become much smaller. OpenAI continues to dominate the top positions, Qwen3.7 Max delivers the most impressive breakthrough, Google and Claude remain strong enterprise contenders, DeepSeek shows excellent cost efficiency, and local models are becoming increasingly practical.
For businesses, this is good news. Competition is increasing. Quality is improving. Costs are becoming more flexible. Deployment options are expanding.
The next phase of enterprise AI will not be defined only by who uses the newest or most powerful model. It will be defined by who can benchmark, select, combine and integrate the right models into real business processes.