The best language models for digital products in june 2024

The TIMETOACT GROUP LLM Benchmarks highlight the most powerful AI language models for digital product development. Discover which large language models performed best in june 2024.

Based on real benchmark data from our own software products, we evaluated the performance of different LLM models in addressing specific challenges. We examined specific categories such as document processing, CRM integration, external integration, marketing support, and code generation.  

The highlights of the month:

 

LLM Benchmarks | June 2024

Our benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama2 license

A more detailed explanation of the respective categories can be found below the table.

Model Code Crm Docs Integrate Marketing Reason Final cost Speed
GPT-4o ☁️ 85 95 100 90 82 75 88 1.24 € 1.49 rps
GPT-4 Turbo v5/2024-04-09 ☁️ 80 99 98 93 88 45 84 2.51 € 0.83 rps
Claude 3.5 Sonnet ☁️ 67 83 89 78 80 59 76 0.97 € 0.09 rps
GPT-4 v1/0314 ☁️ 80 88 98 52 88 50 76 7.19 € 1.26 rps
GPT-4 Turbo v4/0125-preview ☁️ 60 97 100 71 75 45 75 2.51 € 0.82 rps
GPT-4 v2/0613 ☁️ 80 83 95 52 88 50 74 7.19 € 2.07 rps
Claude 3 Opus ☁️ 64 88 100 53 76 59 73 4.83 € 0.41 rps
GPT-4 Turbo v3/1106-preview ☁️ 60 75 98 52 88 62 72 2.52 € 0.68 rps
Gemini Pro 1.5 0514 ☁️ 67 96 75 100 25 62 71 2.06 € 0.91 rps
Gemini Pro 1.5 0409 ☁️ 62 97 96 63 75 28 70 1.89 € 0.58 rps
GPT-3.5 v2/0613 ☁️ 62 81 73 75 81 48 70 0.35 € 1.39 rps
GPT-3.5 v3/1106 ☁️ 62 70 71 63 78 59 67 0.24 € 2.29 rps
GPT-3.5 v4/0125 ☁️ 58 87 71 60 78 47 67 0.13 € 1.41 rps
Gemini 1.5 Flash 0514 ☁️ 32 97 100 56 72 41 66 0.10 € 1.76 rps
Gemini Pro 1.0 ☁️ 55 86 83 60 88 26 66 0.10 € 1.35 rps
Cohere Command R+ ☁️ 58 80 76 49 70 59 65 0.85 € 1.88 rps
Qwen1.5 32B Chat f16 ⚠️ 64 90 82 56 78 15 64 1.02 € 1.61 rps
GPT-3.5-instruct 0914 ☁️ 44 92 69 60 88 32 64 0.36 € 2.12 rps
Gemma 7B OpenChat-3.5 v3 0106 f16 ✅ 62 67 84 33 81 48 63 0.22 € 4.91 rps
Meta Llama 3 8B Instruct f16🦙 74 62 68 49 80 42 63 0.35 € 3.16 rps
GPT-3.5 v1/0301 ☁️ 49 82 69 67 82 24 62 0.36 € 3.93 rps
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅ 56 87 67 52 88 23 62 0.33 € 3.28 rps
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅ 58 73 72 45 88 28 61 0.33 € 3.27 rps
Llama 3 8B OpenChat-3.6 20240522 f16 ✅ 64 51 76 45 88 39 60 0.30 € 3.62 rps
Starling 7B-alpha f16 ⚠️ 51 66 67 52 88 36 60 0.61 € 1.80 rps
Mistral 7B OpenChat-3.5 v1 f16 ✅ 46 72 72 49 88 31 60 0.51 € 2.14 rps
Yi 1.5 34B Chat f16 ⚠️ 44 78 70 52 86 28 60 1.28 € 1.28 rps
Claude 3 Haiku ☁️ 59 69 64 55 75 33 59 0.08 € 0.53 rps
Mixtral 8x22B API (Instruct) ☁️ 47 62 62 94 75 7 58 0.18 € 3.01 rps
Claude 3 Sonnet ☁️ 67 41 74 52 78 30 57 0.97 € 0.85 rps
Qwen2 7B Instruct f32 ⚠️ 44 81 81 39 66 29 57 0.47 € 2.30 rps
Mistral Large v1/2402 ☁️ 33 49 70 75 84 25 56 2.19 € 2.04 rps
Anthropic Claude Instant v1.2 ☁️ 51 75 65 59 65 14 55 2.15 € 1.47 rps
Anthropic Claude v2.0 ☁️ 57 52 55 45 84 35 55 2.24 € 0.40 rps
Cohere Command R ☁️ 39 66 57 55 84 26 54 0.13 € 2.47 rps
Qwen1.5 7B Chat f16 ⚠️ 51 81 60 34 60 36 54 0.30 € 3.62 rps
Anthropic Claude v2.1 ☁️ 36 58 59 60 75 33 53 2.31 € 0.35 rps
Qwen1.5 14B Chat f16 ⚠️ 44 58 51 49 84 17 51 0.38 € 2.90 rps
Meta Llama 3 70B Instruct b8🦙 46 72 53 29 82 18 50 7.32 € 0.22 rps
Mistral 7B OpenOrca f16 ☁️ 42 57 76 21 78 26 50 0.43 € 2.55 rps
Mistral 7B Instruct v0.1 f16 ☁️ 31 71 69 44 62 21 50 0.79 € 1.39 rps
Llama2 13B Vicuna-1.5 f16🦙 36 37 53 39 82 38 48 1.02 € 1.07 rps
Codestral v1 ⚠️ 33 47 43 71 66 13 45 0.31 € 3.98 rps
Google Recurrent Gemma 9B IT f16 ⚠️ 46 27 71 45 56 25 45 0.93 € 1.18 rps
Mistral Small v1/2312 (Mixtral) ☁️ 10 67 65 51 56 8 43 0.19 € 2.17 rps
Llama2 13B Hermes f16🦙 38 24 30 61 60 43 43 1.03 € 1.06 rps
Mistral Small v2/2402 ☁️ 27 42 36 82 56 8 42 0.19 € 3.14 rps
Llama2 13B Hermes b8🦙 32 25 29 61 60 43 42 4.94 € 0.22 rps
Mistral Medium v1/2312 ☁️ 36 43 27 59 62 12 40 0.83 € 0.35 rps
IBM Granite 34B Code Instruct f16 ☁️ 52 49 30 44 57 5 40 1.12 € 1.46 rps
Llama2 13B Puffin f16🦙 37 15 38 48 56 41 39 4.89 € 0.22 rps
Llama2 13B Puffin b8🦙 37 14 37 46 56 39 38 8.65 € 0.13 rps
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️ 13 47 57 40 59 8 37 0.05 € 2.30 rps
Llama2 13B chat f16🦙 15 38 17 45 75 8 33 0.76 € 1.43 rps
Llama2 13B chat b8🦙 15 38 15 45 75 6 32 3.35 € 0.33 rps
Mistral 7B Notus-v1 f16 ⚠️ 16 54 25 41 48 4 31 0.80 € 1.37 rps
Mistral 7B Zephyr-β f16 ✅ 28 34 46 44 29 4 31 0.51 € 2.14 rps
Llama2 7B chat f16🦙 20 33 20 42 50 20 31 0.59 € 1.86 rps
Orca 2 13B f16 ⚠️ 15 22 32 22 67 19 29 0.99 € 1.11 rps
Mistral 7B Instruct v0.2 f16 ☁️ 7 30 50 13 58 8 28 1.00 € 1.10 rps
Microsoft Phi 3 Mini 4K Instruct f16 ⚠️ 36 35 31 1 50 6 27 0.87 € 1.26 rps
Mistral 7B v0.1 f16 ☁️ 0 9 42 42 52 12 26 0.93 € 1.17 rps
Microsoft Phi 3 Medium 4K Instruct f16 ⚠️ 12 34 30 13 47 8 24 0.85 € 1.28 rps
Google Gemma 2B IT f16 ⚠️ 20 28 14 39 15 20 23 0.32 € 3.44 rps
Orca 2 7B f16 ⚠️ 13 0 24 18 52 4 19 0.81 € 1.34 rps
Google Gemma 7B IT f16 ⚠️ 0 0 0 9 62 0 12 1.03 € 1.06 rps
Llama2 7B f16🦙 0 5 18 3 28 2 9 1.01 € 1.08 rps
Yi 1.5 9B Chat f16 ⚠️ 0 4 29 8 0 8 8 1.46 € 0.75 rps

The benchmark categories in detail

Here's exactly what we're looking at with the different categories of LLM Leaderboards

Docs

How well can the model work with large documents and knowledge bases?

CRM

How well does the model support work with product catalogs and marketplaces?

Integrate

Can the model easily interact with external APIs, services and plugins?

Marketing

How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

Reason

How well can the model reason and draw conclusions in a given context?

Code

Can the model generate code and help with programming?

Cost

The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

Speed

The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Docs

Deeper insights

Claude 3.5 Sonnet - Anthropic did it again

Remember how Anthropic made a big quality improvement in their models in March?

 

They have just done it again by releasing Claude 3.5 Sonnet. This mid-range model is not only more powerful than the top-of-the-range Opus model, but also about five times cheaper.

Improved performance with Claude 3.5 Sonnet

Claude 3.5 Sonnet better follows instructions and has same reasoning capabilities as their top model - Haiku, so this is a huge improvement.

NEW: ARTIFACTS FOR A BETTER USER EXPERIENCE

There is one more big improvement in the product line of Anthropic, though. It is called Artifacts, and it isn’t even about LLM capability, but rather about user experience and LLM integration.

ARTIFACTS: WORKING EFFICIENTLY WITH DOCUMENTS AND CODE

The idea of Artifacts is: when you are working on some document or a piece of code, Claude web chat, will pull this document into a convenient separate window. This document will now become an entity of its own, not just a snippet that is repeated in the web chat. Artifacts are versioned, and you can properly iterate on them.

This may seem like a small feature, but together with Claude 3.5 Sonnet, it becomes a huge productivity boost that makes it worthwhile to use Claude Chat instead of ChatGPT when working with documents and code snippets.

Small, efficient models are getting better and better

Last month we tested several local LLMs. There were some pleasant surprises:

First of all, it was about Google Gemma 7B Instruct. This Google model is often criticized for being too restricted and limited.

However, the OpenChat 3.5 fine-tuning of this model reveals its true capabilities and places this 7B model above the first version of GPT-3.5.

It is rumored that GPT-3.5 had about 20-175B parameters, and this small 7B model (which can run on a laptop) manages to outperform it! The rate of progress is impressive.

In fact, the only local LLM that performs better than this model (in our benchmarks) is AliBaba's Qwen1.5-32B model. However, this model has a non-standard license and requires more than four times as many resources to run.

As you can see from the picture, there are already many 7B models with performance comparable to early versions of GPT-3.5. Based on the trends, the progress will not just end there.

Poorer performing models

Not all local models performed so well in our benchmark. Here are some that performed poorly (mostly because they couldn't follow even basic instructions accurately):

- Yi 1.5 34B Chat

- Google Recurrent Gemma 9B IT

- Microsoft Phi 3 Mini/Medium

- Google Gemma 2B/7B

Apple Privacy Model and Confidential Computing

In its latest announcement, Apple has started to introduce more AI features to its ecosystem. One of the most interesting aspects was the concept of Private Cloud Compute.

Essentially, the iPhone will use a small and efficient LLM model to process all incoming requests. This LLM is not very powerful and comparable to modern 7B models. However, it is fast and will process all requests in a secure way - locally.

It becomes particularly interesting when the LLM-controlled system recognizes that it needs more computing power to process the request.

In this case, it has two options:

  • It can ask the user for permission to send the specific request to OpenAI GPT.

  • It can securely forward the request to a private cloud compute managed by Apple.

 

What is private cloud compute?

It is a protected Apple datacenter that uses their own chips to host powerful Large Language Models. The setup gives strong guarantees that your personal requests will be handled securely and nobody, not even Apple, will even see questions and answers.

This is done through a combination of special hardware, encryption, secured VM images and mutual attestation between the software and hardware. Ultimately, they do their best to make it very hard and expensive to break this setup even by Apple or governments.

Apple is all about consumer electronics, is there anything comparable for companies?

Yes, it does exist. It's called confidential computing. The concept has been around for some time (see the Confidential Computing Consortium), but has only recently been properly applied to GPUs by Nvidia. Nvidia introduced it in the Hopper architecture (H100 GPUs) and almost completely eliminated the performance penalty in the Blackwell architecture.

The concept is the same as Apple's PCC:

  • data is encrypted in transit and at rest

  • data is decrypted during the computation time

  • hardware and software are designed to make it impossible (really hard and expensive) to take a look at the data while it is decrypted.

Major cloud providers are already testing VMs with confidential GPU calculation (e.g. Microsoft Azure with H100 since 2023, Google Cloud with H100 since 2024).

This approach is interesting because it offers a third option to companies that need to build a secure LLM-driven system:

OptionsGuaranteesInvestments in advanceCosts for operation
OpenAI from MicrosoftMedium. Not everyone likes sending data to third parties. But many already use MS OfficeNoneHigh - we pay per request
Our own data center with GPUsVery high - data remains within our security perimeter.Huge - GPUs are expensive, lead times are also long.Low
Renting confidential GPU calculationHigh - there are many guarantees that our data is protected from everyone else.Low - we can pay as we goHigh - we pay per rental period

Just like with hybrid clouds (they were a big thing in the past, but are a norm these days), we can mix-and-match these options for a cost-effective and secure solution, just like Apple does with PCC. For example:

  • Have a small local deployment that runs cost-effective 7B models on our own hardware. It will handle all requests locally.

  • If a user request needs more powerful AI/LLM and doesn’t involve critical information - route requests to Azure OpenAI

  • If a user request is both sensitive and requires a lot of GPU compute, then - route it to a confidential compute in the cloud.

Ultimately, if the powerful-and-confidential workload is steady enough, it might make sense to add a few local and powerful GPUs to handle it. During the peaks we can still rent confidential compute in the cloud.

With an H100 setup, you can expect high performance even with a single GPU if you use the right software and optimization profile. For example, you can achieve +20-50% throughput with Llama 3 8B at fp16 by changing the backend from vLLM to TensorRT backend with Nvidia NIM-setup. Since the H100 hardware also supports fp8 quantization, we can even achieve +10-30% performance by switching from fp16 to fp8. NB: Performance gains will depend on the overall context size, batch size and nature of the workload.

LLM Benchmarks Archive

Interested in the benchmarks of the past months? You can find all the links on our LLM Benchmarks overview page!

Transform your digital projects with the best AI language models!

Discover the transformative power of the best LLM and revolutionize your digital products with AI! Stay future-oriented, increase efficiency and secure a clear competitive advantage. We support you in taking your business value to the next level.

* required

We use the data you send us only for contacting you in connection with your request. You can find all further information in our privacy policy.

Solve captcha, please!

captcha image
Martin Warnung
Sales Consultant TIMETOACT GROUP Österreich GmbH +43 664 881 788 80
Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Blog 11/12/24

ChatGPT & Co: LLM Benchmarks for October

Find out which large language models outperformed in the October 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 10/1/24

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 2/3/25

ChatGPT & Co: LLM Benchmarks for January

Find out which large language models outperformed in the January 2025 benchmarks. Stay informed on the latest AI developments and performance metrics.

Insights 3/17/25

LLM Benchmarks: February 2025

Discover the latest insights from our independent LLM benchmarks for February 2025. Find out which large language models performed best.

Blog

LLM Benchmarks Summer 2025

Ranking the best performing large language models for digital product development.

Blog

LLM Benchmarks April 2025

Ranking the best performing large language models for digital product development.

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 10/1/24

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 11/12/24

ChatGPT & Co: LLM Benchmarks for October

Find out which large language models outperformed in the October 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Wissen 8/30/24

LLM-Benchmarks August 2024

Instead of our general LLM benchmarks, we present the first benchmark of different AI architectures in August.