LLM Benchmarks Summer 2025

Who is dominating the field of large language models? Our experts from the AI Strategy & Research Hub present the latest results and highlight the key trends companies should keep an eye on right now.

Go to Benchmarks

Many have asked - when are the LLM benchmarks coming back online on a regular schedule? Here we are, with a lot of new material to catch up:

Sharing the secrets - Schema-Guided Reasoning.
OpenAI GPT-5 Releases are a Big Deal
A structural problem with the GPT-5 release
Grok-4 shares the top place
Gemini 2.5 Pro
Qwen-3 is still quite popular
DeepSeek - incremental improvements
Enterprise Reasoning Challenge (ERCr3)

Benchmarks

Model Scores Summer 2025

Filter table: Highlight models (comma-separated):

Show only local models

	#	Model	bi	compliance	code	reason	Score	Err	Local	Features
#78#	1	openai/gpt-5-2025-08-07	54%	70%	100%	77%	79.4%			SO, Reason
#84#	2	x-ai/grok-4	54%	70%	100%	77%	79.4%			SO, Reason
#18#	3	openai/o3-mini-2025-01-31	45%	70%	100%	74%	76.7%			SO, Reason
#49#	4	openai/o4-mini-2025-04-16	45%	70%	100%	74%	76.7%			SO, Reason
#80#	5	openai/gpt-5-mini-2025-08-07	54%	70%	93%	74%	76.7%			SO, Reason
#76#	6	openai/gpt-oss-120b	54%	67%	92%	72%	75.0%		✓	Open
#68#	7	x-ai/grok-3-mini	54%	62%	97%	71%	74.0%	3
#72#	8	google/gemini-2.5-pro-preview-06-05	45%	70%	100%	71%	73.9%			Reason
#51#	9	google/gemini-2.5-flash-preview:thinking	45%	57%	100%	68%	71.2%	1		Reason
#37#	10	google/gemini-2.5-pro-preview-03-25	45%	70%	93%	68%	71.1%			Reason
#52#	11	qwen/qwen3-32b	54%	40%	96%	68%	71.1%	1	✓	Reason, Open
#29#	12	anthropic/claude-3.7-sonnet:thinking	54%	32%	100%	67%	70.4%	1		Reason
#7#	13	openai/o1-2024-12-17	45%	70%	84%	67%	70.0%			SO, Reason
#62#	14	deepseek/deepseek-r1-0528	45%	62%	93%	66%	68.9%		✓	SO, Reason, Open
#46#	15	openai/gpt-4.1-2025-04-14	45%	70%	77%	67%	67.2%			SO
#79#	16	openai/gpt-5-nano-2025-08-07	36%	67%	90%	63%	66.7%			SO, Reason
#4#	17	deepseek/deepseek-r1	27%	64%	100%	63%	66.1%		✓	SO, Reason, Open
#77#	18	openai/gpt-oss-20b	36%	70%	88%	63%	66.1%		✓	Open
#61#	19	anthropic/claude-opus-4	45%	47%	92%	62%	65.7%			Reason
#53#	20	qwen/qwen3-30b-a3b	45%	37%	96%	61%	65.0%		✓	Reason, Open
#65#	21	anthropic/claude-sonnet-4	45%	67%	78%	61%	64.4%
#82#	22	anthropic/claude-opus-4.1	45%	47%	81%	59%	63.2%			Reason
#54#	23	qwen/qwen3-235b-a22b	36%	45%	100%	59%	62.8%		✓	Reason, Open
#86#	24	qwen/qwen3-235b-a22b-2507	45%	60%	72%	62%	62.8%		✓	SO, Open
#47#	25	openai/gpt-4.1-mini-2025-04-14	36%	80%	63%	60%	61.1%			SO
#12#	26	deepseek/deepseek-r1-distill-llama-70b	36%	32%	96%	56%	60.0%	4	✓	Open
#36#	27	deepseek/deepseek-chat-v3-0324	45%	60%	70%	55%	59.6%		✓	Reason, Open
#50#	28	google/gemini-2.5-flash-preview	45%	60%	70%	58%	59.4%
#85#	29	x-ai/grok-3	36%	65%	69%	55%	59.3%			SO, Reason
#87#	30	deepseek/deepseek-chat-v3.1	36%	62%	68%	57%	58.2%		✓	SO, Open
#64#	31	deepseek/deepseek-r1-0528-qwen3-8b	27%	62%	82%	52%	56.7%	2	✓	Reason, Open
#28#	32	anthropic/claude-3.7-sonnet	45%	47%	65%	55%	56.5%
#55#	33	qwen/qwen3-14b	27%	15%	100%	52%	56.1%		✓	Reason, Open
#1#	34	openai/gpt-4o-2024-11-20	36%	55%	62%	55%	53.6%			SO
#30#	35	openai/gpt-4.5-preview-2025-02-27	45%	47%	62%	53%	51.9%			SO
#23#	36	deepseek-v3	36%	47%	58%	49%	50.6%	1	✓	SO, Open
#9#	37	openai/gpt-4o-2024-08-06	18%	62%	63%	52%	50.5%			SO
#58#	38	mistralai/mistral-medium-3	36%	35%	70%	45%	49.9%			SO, Reason
#11#	39	microsoft/phi-4	36%	62%	57%	48%	49.7%	3	✓	Open
#39#	40	meta-llama/llama-4-maverick	27%	42%	70%	44%	49.1%		✓	SO, Open
#83#	41	mistralai/mistral-medium-3.1	36%	27%	69%	45%	47.5%			SO, Reason
#67#	42	x-ai/grok-3	54%	30%	53%	45%	47.2%
#19#	43	qwen/qwen-max	45%	45%	45%	50%	46.3%	1
#71#	44	mistralai/magistral-medium-2506:thinking	45%	52%	49%	44%	46.1%	1		SO, Reason
#66#	45	google/gemini-2.5-flash-lite-preview-06-17	27%	12%	82%	43%	45.6%	12
#33#	46	google/gemma-3-27b-it	27%	27%	70%	43%	45.0%	2	✓	Open
#10#	47	anthropic/claude-3.5-sonnet	36%	32%	57%	44%	43.6%
#27#	48	meta-llama/llama-3.1-70b-instruct	36%	50%	44%	43%	42.6%		✓	SO, Open
#13#	49	meta-llama/llama-3.3-70b-instruct	27%	50%	48%	41%	40.8%		✓	SO, Open
#22#	50	google/gemini-2.0-flash-001	27%	24%	57%	38%	40.7%
#32#	51	qwen/qwq-32b	36%	52%	41%	37%	40.0%	3	✓	SO, Reason, Open
#8#	52	qwen/qwen-2.5-72b-instruct	27%	30%	47%	39%	39.2%		✓	SO, Open
#35#	53	mistralai/mistral-small-3.1-24b-instruct	36%	42%	41%	39%	39.2%		✓	SO, Open
#48#	54	openai/gpt-4.1-nano-2025-04-14	9%	32%	64%	32%	37.7%			SO
#31#	55	qwen/qwen2.5-32b-instruct	27%	20%	53%	36%	36.6%		✓	Open
#17#	56	qwen/qwen-2.5-coder-32b-instruct	18%	35%	54%	39%	36.5%		✓	SO, Open
#14#	57	meta-llama/llama-3.1-405b-instruct	18%	55%	40%	38%	35.5%		✓	SO, Open
#41#	58	google/gemma-3-12b-it	9%	17%	61%	30%	33.4%		✓	Open
#20#	59	qwen/qwen-plus	18%	25%	40%	31%	31.7%	1
#42#	60	google/gemma-3-12b-it-qat-q4_0-gguf	18%	47%	34%	24%	30.6%		✓	SO, Open
#73#	61	moonshotai/kimi-k2	27%	27%	32%	30%	30.6%	3	✓	SO, Open
#25#	62	mistralai/mixtral-8x22b-instruct	9%	27%	47%	28%	29.2%		✓	SO, Open
#5#	63	openai/gpt-4o-mini-2024-07-18	9%	32%	41%	30%	28.4%			SO
#15#	64	mistral/mistral-small-24b-instruct-2501	27%	22%	33%	30%	27.8%		✓	SO, Open
#21#	65	qwen/qwen-turbo	0%	15%	41%	20%	21.9%	2
#16#	66	deepseek/deepseek-r1-distill-qwen-32b	9%	22%	29%	17%	21.2%	2	✓	SO, Open
#70#	67	mistralai/magistral-small-2506	27%	25%	10%	20%	18.8%	21	✓	SO, Open
#38#	68	meta-llama/llama-4-scout	9%	25%	22%	16%	18.0%		✓	SO, Open
#40#	69	mistral/ministral-8b	18%	0%	20%	13%	14.8%	1	✓	SO, Open
#24#	70	meta-llama/llama-3.2-3b-instruct	0%	17%	16%	11%	10.6%	2	✓	SO, Open
#69#	71	sentientagi/dobby-mini-unhinged-plus-llama-3.1-8b	9%	10%	10%	11%	10.6%	11	✓	SO, Open
#26#	72	mistralai/mistral-large-2411	0%	0%	0%	0%	0.0%	36	✓	SO, Open
#59#	73	ByteDance-Seed/Seed-Coder-8B-Reasoning	0%	0%	0%	0%	0.0%	36	✓	SO, Reason, Open
	Averages	33%	44%	64%	47%

Schema-Guided Reasoning (SGR)

We finally have a term for the Custom Chain-of-Thought (or SO CoT) approach that we’ve been using heavily in various projects.

This approach was initially extracted from the successful cases in our AI portfolio and further refined by AI R&D work in the community (including successful submissions in Enterprise RAG Challenges).

In fact, all evals from our Reasoning LLM Benchmark v2 (starting from January 2025) leverage specialised SGR schemas to drive the reasoning.

You can read more about SGR here or check out the publicly shared demo. This demo demonstrates how to use SGR to build a business assistant capable of:

planning and reasoning while using an inexpensive non-reasoning model
calling tools to help manage customers in a fictional company that sells AGI courses (in the demo we simulate tools to create invoices, send emails and pull customer data)
creating additional rules and memories for itself

All that is done in 160 lines of Python code without the use of any AI frameworks or tool calling. Just OpenAI SDK and Pydantic.

This topic of creating and orchestrating agents for business tasks is something that we are quite interested in and keen to push state-of-the-art in the field further. We’ll have one more interesting announcement along these lines in this report later, but for now let’s check out which models are the best for working with business tasks given Schema-Guided Reasoning.

OpenAI GPT-5 Releases are a Big Deal

Let’s start with the obvious big wins. OpenAI has released a range of models recently:

available via API: gpt-5, gpt-5-mini and gpt-5-nano (announcement)
downloadable: gpt-oss-20b and gpt-oss-120b (announcement / download)

gpt-5 from OpenAI is currently the TOP-1 model on our leaderboard!

gpt-5 is the smartest model, also it is quite large, slow and expensive. It is also an overkill for day-to-day business automation tasks at scale. For that we have smaller models like GPT-5-mini. And this is where the interesting things start.

gpt-5-mini is currently on the 5th place of the leaderboard, when running under Schema-Guided Reasoning. It is a reasonably priced and capable model - very well balanced.

However, in addition to API models, OpenAI has also released two open-weights models that you can freely download and run on your hardware. For example:

The most curious part is that gpt-oss-120b model looks very similar to gpt-5-mini model on our benchmark. It is as if two models were almost the same.

Either way, this is the first time in forever to see a model from the TOP-5 that is shared publicly for free use.

Likewise, gpt-5-nano model scored 16th place on our leaderboard. gpt-oss-20b model had very similar results, taking 18th place.

ℹ️ While the models are called gpt-oss, they are not exactly Open Source models, but rather Open Weights models. This means that one can download and use these models freely, but the original training data and pipelines are not shared.

gpt-oss models leverage Mixture-of-Experts (MoE) architecture, where only a small part of the model is used to generate each new token. This makes these models really fast.

You can run gpt-oss-120B model on a single H100 GPU (it requires a modern GPU with 80 GB VRAM), while gpt-oss-20B requires a modern GPU with 16GB of VRAM like RTX 5090).

Now, the 'modern GPU' requirement normally means such models would not be natively supported by older GPUs like the 4090 or A100. However, there's another unusual catch.

Thanks to the MoE architecture of gpt-oss, you can also run these local models at lower speeds with surprisingly little VRAM as well.

For example, 120B can run fairly well (10-30 tokens per second) on older cards with just 5-8GB of VRAM. In that case we are keeping attention part of the model in GPU, while all the experts reside in usual RAM (you would need 64GB for that). You can read more about configuring llama.cpp to run these models on Reddit (discussion).

ℹ️ TL;DR: there is a new --cpu-moe switch that supports MoE offloading to CPU (not yet supported in Ollama).

A structural problem with the GPT-5 release

There is one problem with the GPT-5 release. It uses a completely new response format for defining conversations, called: OpenAI Harmony. This format doesn’t play well with Structured Outputs as of yet. This means that OpenAI API occasionally fails to fullfill its promise to always return JSON that complies with the provided schema.

We encountered failures with OpenAI SDK, when gpt-5, gpt-5-mini and gpt-5-nano returned responses that were incompatible with the provided schema. Here is the Gist that reliably reproduces the problem with all GPT-5 models: SGR triggers Harmony parsing bug with GPT-5 models. We reported it to OpenAI directly, and also shared it with the OpenAI Community with a repro, so you can check full details yourself.

Not only does the problem go away when switching from gpt-5 models back to gpt-4o, the responses also get much faster. There are two possible reasons for that:

gpt-4o doesn’t use more complicated Harmony response format
gpt-4o is not a reasoning model. GPT-5 models like to think under the hood before responding, while gpt-4o just answers.

Issues with the new gpt-5 models don’t stop at the API, they also affect gpt-oss-120B and gpt-oss-20B models. None of the public LLM providers that offer APIs with Structured Outputs yet provide access to these that works with StructuredOutputs. Even Ollama struggles with this new format (see ticket)

ℹ️ How did we get GPT-5 models to work reliably in our benchmark? We didn’t. We simulated a working constrained decoding by discarding all responses with invalid schema until a valid one was produced.

We are pretty sure that the integration issues will be resolved soon enough. Then we would get a great local model that is smart, fast and can be led to reason within SGR, further boosting its capabilities.

Grok-4 shares the top place with OpenAI GPT-5

Grok models historically scored low on our benchmarks. However Grok-4 suddenly jumped to the top of the leaderboard, getting scores similar to GPT-5 (medium reasoning effort).

The primary gotcha with Grok-4 is that it can get quite expensive and slow. Here is an example of one request from our benchmark, where it took 50 seconds just to start getting response.

Gemini 2.5 Pro

Gemini 2.5 Pro is currently one of the best general-purpose models to use in business automation tasks. It features a large context (that it can actually work with), can handle multiple modalities and is quite cheap.

The only problem is that Google LLMs still don’t have a proper Structured Output (similar to the capabilities of Mistral, OpenAI, Fireworks, Cerberas, Grok, or any local deployment). They feature only a limited subset that can be a pain to work with.

Anthropic models have been mediocre at best in the past months. The highest they reached was TOP-12 with a rather expensive claude-3.7-sonnet in thinking mode. Anthropic, however, doesn’t support Structured Outputs via constrained decoding, which makes integration with their LLMs rather unreliable.

Qwen-3 models are still popular

New Qwen-3 models gained popularity immediately after the initial release at the end of April and are still praised for their quality. In fact, qwen-3-32B holds 11th place on our leaderboard, right above the Claude-3.7-sonnet:thinking

In fact, when one of the community members decided to check out what is the smallest possible LLM that can be used to drive AI agents in business scenarios, they managed to port SGR Demo to run on top of Qwen-3-4B via a local llama.cpp deployment.

ℹ️ The concrete version being used in this sample is quantised version of Qwen3-4B-Instruct-2507: Qwen3-4B-Instruct-2507-Q8_0

This is worthy of two comments:

Teams currently choose Qwen-3 models when they need the smallest capable model.
Obviously, in real-world business cases running something larger would give some safety margin (e.g. gpt-oss-20B or qwen-3-32B). However, it is just crazy that such a small model can make some sense in fairly complex scenarios.

Here is the source code that shows how to upgrade a classical SGR demo to work with Qwen-3-4B. It includes three major changes:

Removes OpenAI SDK and composes raw requests to the API exposed by llama.cpp
Extends the prompt by spelling out business rules in more detail
Adds one more reasoning field in the beginning of the SGR Cascade in NextStep

These changes are enough to make even Qwen-3-4B start making sense in a task that requires a multi-step reasoning agentic behaviour.

DeepSeek - incremental improvements

DeepSeek models were a big hit when they first came out. However multiple better models came out since then, pushing them down the benchmark ladder.

At the time of writing, the initial deepseek-r1 version has fallen to 17th place, getting the same score in SGR-driven mode as gpt-oss-20B model. deepseek-r1-0528 was just an incremental improvement, getting it to a current 14th place.

However, not that many people would be interested in running 671B model, when there is a much better 120B model available. Qwen3-32B is smaller and also better.

Recently released Deepseek Chat v3.1 didn’t perform much better on our SGR benchmark either:

Enterprise Reasoning Challenge (ERCr3)

As you can see, there is a wide variety of capable models showing up. As soon as community figures out how to reliably use Structured Outputs with gpt-oss models and their Harmony response format, we will get into a very interesting situation:

there are LLMs that handle business tasks with SGR really well (within the TOP-20)
you can freely download and use them
even on a rather non-demanding hardware

This is a big deal, but what about pushing state of the art in enterprise automation even further together and finding patterns to do even more with less?

If you followed our previous work, you know that we do that by running massive crowdsourced experiments together with the community of enthusiasts and independent teams (e.g. see Enterprise RAG Challenge round 2).

We are planning to run the 3rd round of our Enterprise Challenges this Fall/Winter. This time we will focus on business automation with agents via APIs.

The objective of the challenge for teams will be: to write an agent that will get human requests like "redo last [email protected] invoice: use 3x discount of [email protected]". It will then need to use available (simulated) APIs to find its way and carry out the operation properly.

We will provide participants with the simulated APIs that their agents can call in order to accomplish the tasks. However, it will be the job of an agent to figure out which APIs to call and in which order to accomplish the task.

Most of the times, the full solution will not be known in advance before a request to some API provides missing piece of the puzzle. So it will be the job of an agent to reason through and use proper tools to get the job done.

ℹ️ Implementation-wise, the solution doesn’t have to be “an agent”. It could be a multi-agent system, orchestrator, a single prompt with MCP plugins, - whatever solves the problem. And we will compare the performance off these radically different approaches within the same setup.

Similar to the spirit of the previous ERC competitions, we will open source and share as much as possible, including:

source code of the simulation runtime
source code of the task generator
all submissions
analysis results and reports

Just like before, we will also hold a public test run before the main event, in order to give everybody a chance to practice and test their agents.

Similar to ERCr1 and ERCr2 we are planning to have multiple leaderboards, including the leaderboard for the local models. This time local models have a fair competing chance against even the best models.

Stay tuned for the updates!

Turning AI Potential into Business Impact

From first insights to enterprise-ready solutions — we help you turn AI into real business value.

Name

Family Name *

Company *

E-Mail *

Your Message *

* required

We use the information you send us only to contact you in connection with your inquiry upon your request. You can find all further information in our Privacy Policy.