The best large language models for digital products in april 2024

The TIMETOACT GROUP LLM Benchmarks highlight the most powerful AI language models for digital product development. Discover which large language models performed best in april 2024.

Based on real benchmark data from our own software products, we evaluated the performance of different LLM models in addressing specific challenges. We examined specific categories such as document processing, CRM integration, external integration, marketing support, and code generation.  

Highlights:

LLM Benchmarks | April 2024

Our benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama2 license

A more detailed explanation of the respective categories can be found below the table.

Model Code Crm Docs Integrate Marketing Reason Final cost Speed
GPT-4 Turbo v5/2024-04-09 ☁️ 80 99 98 93 88 45 84 2.51 € 0.83 rps
GPT-4 v1/0314 ☁️ 80 88 98 52 88 50 76 7.19 € 1.26 rps
GPT-4 Turbo v4/0125-preview ☁️ 60 97 100 71 75 45 75 2.51 € 0.82 rps
GPT-4 v2/0613 ☁️ 80 83 95 52 88 50 74 7.19 € 2.07 rps
Claude 3 Opus ☁️ 64 88 100 53 76 59 73 4.83 € 0.41 rps
GPT-4 Turbo v3/1106-preview ☁️ 60 75 98 52 88 62 72 2.52 € 0.68 rps
Gemini Pro 1.5 ☁️ 62 97 96 63 75 28 70 1.89 € 0.58 rps
GPT-3.5 v2/0613 ☁️ 62 79 73 75 81 48 70 0.35 € 1.39 rps
GPT-3.5 v3/1106 ☁️ 62 68 71 63 78 59 67 0.24 € 2.29 rps
GPT-3.5 v4/0125 ☁️ 58 85 71 60 78 47 66 0.13 € 1.41 rps
Gemini Pro 1.0 ☁️ 55 86 83 60 88 26 66 0.10 € 1.35 rps
Cohere Command R+ ☁️ 58 77 76 49 70 59 65 0.85 € 1.88 rps
GPT-3.5-instruct 0914 ☁️ 44 90 69 60 88 32 64 0.36 € 2.12 rps
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅ 56 86 67 52 88 26 62 0.37 € 2.99 rps
Meta Llama 3 8B Instruct f16🦙 74 60 68 49 80 42 62 0.35 € 3.16 rps
GPT-3.5 v1/0301 ☁️ 49 75 69 67 82 24 61 0.36 € 3.93 rps
Starling 7B-alpha f16 ⚠️ 51 66 67 52 88 36 60 0.61 € 1.80 rps
Mistral 7B OpenChat-3.5 v1 f16 ✅ 46 72 72 49 88 31 60 0.51 € 2.14 rps
Claude 3 Haiku ☁️ 59 69 64 55 75 33 59 0.08 € 0.53 rps
Mixtral 8x22B API (Instruct) ☁️ 47 62 62 94 75 7 58 0.18 € 3.01 rps
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅ 51 74 72 41 75 31 57 0.36 € 3.05 rps
Claude 3 Sonnet ☁️ 67 41 74 52 78 30 57 0.97 € 0.85 rps
Mistral Large v1/2402 ☁️ 33 49 70 75 84 25 56 2.19 € 2.04 rps
Anthropic Claude Instant v1.2 ☁️ 51 75 65 59 65 14 55 2.15 € 1.47 rps
Anthropic Claude v2.0 ☁️ 57 52 55 45 84 35 55 2.24 € 0.40 rps
Cohere Command R ☁️ 39 63 57 55 84 26 54 0.13 € 2.47 rps
Anthropic Claude v2.1 ☁️ 36 58 59 60 75 33 53 2.31 € 0.35 rps
Meta Llama 3 70B Instruct b8🦙 46 72 53 29 82 18 50 7.32 € 0.22 rps
Mistral 7B OpenOrca f16 ☁️ 42 57 76 21 78 26 50 0.43 € 2.55 rps
Mistral 7B Instruct v0.1 f16 ☁️ 31 70 69 44 62 21 50 0.79 € 1.39 rps
Llama2 13B Vicuna-1.5 f16🦙 36 37 53 39 82 38 48 1.02 € 1.07 rps
Llama2 13B Hermes f16🦙 38 23 30 61 60 43 42 1.03 € 1.06 rps
Llama2 13B Hermes b8🦙 32 24 29 61 60 43 42 4.94 € 0.22 rps
Mistral Small v1/2312 (Mixtral) ☁️ 10 58 65 51 56 8 41 0.19 € 2.17 rps
Mistral Small v2/2402 ☁️ 27 35 36 82 56 8 41 0.19 € 3.14 rps
Llama2 13B Puffin f16🦙 37 12 38 48 56 41 39 4.89 € 0.22 rps
Mistral Medium v1/2312 ☁️ 36 30 27 59 62 12 38 0.83 € 0.35 rps
Llama2 13B Puffin b8🦙 37 9 37 46 56 39 37 8.65 € 0.13 rps
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️ 13 39 57 40 59 8 36 0.05 € 2.30 rps
Llama2 13B chat f16🦙 15 38 17 45 75 8 33 0.76 € 1.43 rps
Llama2 13B chat b8🦙 15 38 15 45 75 6 32 3.35 € 0.33 rps
Mistral 7B Zephyr-β f16 ✅ 28 34 46 44 29 4 31 0.51 € 2.14 rps
Llama2 7B chat f16🦙 20 33 20 42 50 20 31 0.59 € 1.86 rps
Mistral 7B Notus-v1 f16 ⚠️ 16 43 25 41 48 4 30 0.80 € 1.37 rps
Orca 2 13B f16 ⚠️ 15 22 32 22 67 19 29 0.99 € 1.11 rps
Mistral 7B Instruct v0.2 f16 ☁️ 7 21 50 13 58 8 26 1.00 € 1.10 rps
Mistral 7B f16 ☁️ 0 4 42 42 52 12 25 0.93 € 1.17 rps
Orca 2 7B f16 ⚠️ 13 0 24 18 52 4 19 0.81 € 1.34 rps
Llama2 7B f16🦙 0 2 18 3 28 2 9 1.01 € 1.08 rps

The benchmark categories in detail

Here's exactly what we're looking at with the different categories of LLM Leaderboards

Docs

How well can the model work with large documents and knowledge bases?

CRM

How well does the model support work with product catalogs and marketplaces?

Integrate

Can the model easily interact with external APIs, services and plugins?

Marketing

How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

Reason

How well can the model reason and draw conclusions in a given context?

Code

Can the model generate code and help with programming?

Cost

The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

Speed

The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Docs

Deeper insights

Google Gemini Pro 1.5

The newer version, Gemini 1.5 Pro shows significantly better performance than Gemini 1.0 Pro in the last month. It almost reaches the performance of the GPT-4 Turbo.

This model performs particularly well on tasks related to working on documents and information. It also performs almost perfectly on CRM-related tasks. However, complex reasoning tasks are below the level of GPT-3.5.

Gemini Pro 1.5 is about 20 times more expensive than Pro 1.0 on our workloads, which is to be expected given the quality level of GPT-4.

Both models are now available in Google Vertex AI, finally making them usable for enterprise customers in the EU.

Command R models from Cohere

Cohere AI specializes in enterprise-oriented LLMs. They have the Command R family of models - LLMs designed for document-oriented tasks: "Command R" and "Command R Plus".

These models are available both as API SaaS and as downloadable models on Hugging Face. Downloadable models are published for non-commercial purposes.

The Command-R model is roughly comparable to the Anthropic Claude models of the first two generations, but significantly cheaper. Nevertheless, there are better models in this price category, such as Gemini Pro 1.0 and Claude 3 Haiku.

The Command R+ is a significantly better model with capabilities in the GPT-3.5 range, but at two to three times the price.

OpenAI reaches another milestone with new ChatGPT-4 Turbo

OpenAI has released the new GPT-4 Turbo model with the version number 2023-04-09. This is outstanding for two reasons.

  • First, OpenAI has finally used sensible version numbers. It only took one year of progress.

  • Second, this model beats all the other models on our LLM benchmarks. It takes the top place with a substantial gap to the second place.

This score jump comes from nearly perfect stores in CRM and Docs categories. Plus OpenAI has finally fixed the instruction following problem with few-shots, that was causing Integrate category to be so low.

GPT-4 Turbo 2023-04-09 is currently our default recommendation for the new LLM-driven projects that need the best performing LLM to get started.

Llama 3 70B and 8B

Meta has just released new models in its third generation. We have tested instruct versions of 70B and 8B for the usability in LLM-driven products.

Llama 3 70B had a bumpy start - the upload to HuggingFace had bugs with tokens in chat template processing. Once these were fixed, the model started working better, on the level of old generations of Anthropic Claude v2.

Note that we tested the b8 quantized model to properly fit 2xA100 80GB SMX cards. There is a possibility that f16 might give slightly better results.

Llama 3 8B Instruct performed much better on the benchmarks, pushing forward state of the art that is made available by Meta. This model has surprisingly good overall scores and a good “Reason” capability. There is a strong chance that product-oriented fine-tune of Llama3 8B Instruct would be able to push this model to the TOP-10.

Long-term trends

Now let's look at the bigger picture: Where is the industry heading with all this?

More cost-effective & more powerful models

Firstly, models are generally getting better and more affordable. This is the general trend that Sam Altman recently outlined in his interview.

Further long-term LLM trends

  • New functional capabilities of LLMs

    LLMs gain new functional capabilities that are not even covered in this benchmark: Function calls, multimodality, data grounding. The latest version of LLM Under the Hood expands on this theme.
     

  • Experiments with new LLM architectures

    Companies also get bold and try to experiment with new LLM architectures outside the classical transformers' architecture. Mixture of Experts was popularised by Mistral, although many believe that GPT also uses it. Recurrent Neural Networks also experience a comeback as a way to solve context size limitations. For example: RWKV Language Model, Recurrent Gemma from Google Deep Mind (Griffin architecture).

Powerful models with low computing power

What is interesting about these models - they show decent capabilities, while requiring substantially less compute. For instance, we got a report of 0.4B version of RWKV running on a low-end Android phone with a tolerable speed (CPU-only inference).

 

Where are we heading with all this?

DEMOCRATIZATION OF AI

Expect the models to continue to get better, cheaper and more powerful. Sam Altman calls this the "democratization of AI". This applies to both cloud models and locally available models.

If you are in the process of building an LLM-driven system, expect that by the time the system is delivered, the underlying LLM will be much more powerful. In fact, you can take this into account and build a long-term strategy around it.

ADAPTABLE SYSTEMS

You can do that, for example, by designing LLM-driven systems to be transparent, auditable and capable of continuously adapting to the changing context.

Transform your digital projects with the best AI language models!

Discover the transformative power of the best LLM and revolutionize your digital products with AI! Stay future-oriented, increase efficiency and secure a clear competitive advantage. We support you in taking your business value to the next level.

* required

We use the data you send us only for contacting you in connection with your request. You can find all further information in our privacy policy.

Solve captcha, please!

captcha image
Martin Warnung
Sales Consultant TIMETOACT GROUP Österreich GmbH +43 664 881 788 80
Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Blog

LLM Benchmarks April 2025

Ranking the best performing large language models for digital product development.

Blog 11/12/24

ChatGPT & Co: LLM Benchmarks for October

Find out which large language models outperformed in the October 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 10/1/24

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 2/3/25

ChatGPT & Co: LLM Benchmarks for January

Find out which large language models outperformed in the January 2025 benchmarks. Stay informed on the latest AI developments and performance metrics.

Insights 3/17/25

LLM Benchmarks: February 2025

Discover the latest insights from our independent LLM benchmarks for February 2025. Find out which large language models performed best.

Blog

LLM Benchmarks Summer 2025

Ranking the best performing large language models for digital product development.

Blog 11/12/24

ChatGPT & Co: LLM Benchmarks for October

Find out which large language models outperformed in the October 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 10/1/24

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Wissen 8/30/24

LLM-Benchmarks August 2024

Instead of our general LLM benchmarks, we present the first benchmark of different AI architectures in August.