LLM Performance Series: Batching

Starting from the September TIMETOACT GROUP Austria LLM Benchmark we put special emphasis on the enterprise workloads. These include types of LLM tasks that are common to business digitalisation at scale:

generation of marketing texts that present a specific product to a given audience within JTBD framework;
information retrieval and Q&A systems;
CRM automation;
automated lead generation.

While many customers are happy with using ChatGPT from OpenAI or Microsoft Azure, some are still interested in running models locally, completely under their control. This is what the benchmarks are for! They help to track improvement of SOTA (State of the Art) models and pick the best ones for the task at hand.

While cloud models are usually managed (pay-as-you-go), running models locally requires a different investment scheme. Companies either rent GPUs per hour or buy them and install into their own servers.

Either of these options involves a different type of investment. It will also come with different Return on Investment.

GPUs aren’t exactly cheap or easy to get these days. So we want our customers to make the best use of their investments.

The means that while evaluating models for different projects, we need to take into the account not only their accuracy on the specific tasks, but also performance and cost. To help with that we have added corresponding columns: cost and speed:

Cost is a relative number that estimates “how much money would it cost to run the entire benchmark on this configuration”. Since our benchmarks represent a mix from the real business workloads, costs should help to compare different models.

For the SaaS models (like OpenAI) we use price per token (prompt + completion). For the local models we estimate, how much would it take to rent a sufficiently large GPU from a major cloud vendor for the duration of the benchmark.

Speed represents number of requests per second that we can get from a model while running benchmarks in a single-inference mode (batch_size of 1).

But can we get better results? Indeed, there is a number of performance optimisations that could further boost performance and even quality of these models. The first one is batching.

Performance optimisation: batching

GPU batching is based on the fact that GPU is a very special piece of hardware. In certain cases, it doesn’t care much, if it needs to process 1 request or 10, as long as there is enough memory. Computations will roughly take the same amount of time.

In other words, if we have 10 requests, we could run them on the same GPU roughly in the same amount of time.

To illustrate the concept, we took a recently released LLama2 model - Nous Hermes 13B. It is open for the commercial use and strikes a good balance between accuracy and cost.

The workload involved to generating 20 more tokens in a text completion. The model was run on Nvidia A100 80GB PCIe using HuggingFace transformers library. We tested batch sizes from 4 to 400 with step of 4. Here are the results:

As you can see, adding more requests doesn’t result in a proportional increase of the processing time. This leads to ever growing throughput, measured tokens per second (TPS), until the number hits the ceiling around 2500 tokens per second.

If we stayed at batch size of 1, we would’ve never made it past 200 tokens per second.

Note: GPU memory consumption jumps back and forth because it is gets released only when there is a memory pressure. You can watch lower peaks to see the absolute minimal consumption.

Doesn’t this mean that the highest batch size is absolutely the best? Not necessarily. So far we have fixed a number of tokens for LLM to generate at 20. In some tasks, like generation of marketing texts, we would like to get more.

As we generate longer texts, GPU memory requirements tend to grow, while the speed decreases. This comes from the fact that after we generated one token, we need to pass the entire text back to the model for generating the next one. Rinse and repeat.

Let’s see how our generation capabilities change, as we prompt LLM model for more tokens.

The chart below shows that prompting for more tokens increases computation time. However, we can lower that time a bit by making our batch sizes smaller:

How does this translate into efficiency of the models? Here is another chart that compares throughput (tokens per second) of the same experiments:

Best throughput comes with a large batch size. However, as the completion length increases (more iterations are required), we need to lower our batch size in order to stay within 10 second time budget for the entire completion.

Depending on the expected prompt length in our solutions, we could optimise for the highest throughput and lowest latency, leading to a better customer satisfaction and better ROI. Charts like the one above help with such tasks.

But can we get better performance? There are still many other knobs to tune, for example: quantisation, speculative execution, using different inference runtimes. Each comes with its own trade-offs. Stay tuned!

Blog 6/22/23

Strategic Impact of Large Language Models

This blog discusses the rapid advancements in large language models, particularly highlighting the impact of OpenAI's GPT models.

Blog 7/22/24

So You are Building an AI Assistant?

So you are building an AI assistant for the business? This is a popular topic in the companies these days. Everybody seems to be doing that. While running AI Research in the last months, I have discovered that many companies in the USA and Europe are building some sort of AI assistant these days, mostly around enterprise workflow automation and knowledge bases. There are common patterns in how such projects work most of the time. So let me tell you a story...

Blog 11/10/23

Part 1: Data Analysis with ChatGPT

In this new blog series we will give you an overview of how to analyze and visualize data, create code manually and how to make ChatGPT work effectively. Part 1 deals with the following: In the data-driven era, businesses and organizations are constantly seeking ways to extract meaningful insights from their data. One powerful tool that can facilitate this process is ChatGPT, a state-of-the-art natural language processing model developed by OpenAI. In Part 1 pf this blog, we'll explore the proper usage of data analysis with ChatGPT and how it can help you make the most of your data.

Blog 11/24/23

Part 3: How to Analyze a Database File with GPT-3.5

In this blog, we'll explore the proper usage of data analysis with ChatGPT and how you can analyze and visualize data from a SQLite database to help you make the most of your data.

Blog 3/11/25

Answering Business Questions with LLMs

8th place in Enterprise RAG Challenge 2025: Answering Business Questions with LLMs

Blog 11/12/24

ChatGPT & Co: LLM Benchmarks for October

Find out which large language models outperformed in the October 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 10/1/24

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 2/3/25

ChatGPT & Co: LLM Benchmarks for January

Find out which large language models outperformed in the January 2025 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 11/5/24

AIM Hackathon 2024: Sustainability Meets LLMs

Focusing on impactful AI applications, participants addressed key issues like greenwashing detection, ESG report relevance mapping, and compliance with the European Green Deal.

Blog 4/16/24

The Intersection of AI and Voice Manipulation

The advent of Artificial Intelligence (AI) in text-to-speech (TTS) technologies has revolutionized the way we interact with written content. Natural Readers, standing at the forefront of this innovation, offers a comprehensive suite of features designed to cater to a broad spectrum of needs, from personal leisure to educational support and commercial use. As we delve into the capabilities of Natural Readers, it's crucial to explore both the advantages it brings to the table and the ethical considerations surrounding voice manipulation in TTS technologies.

Blog 7/22/24

Let's build an Enterprise AI Assistant

Let’s take the basic principles of building AI assistants for a spin with a product case that we worked on: using AI to support enterprise sales pipeline.

Insights

LLM Benchmarks March 2025

What's new in the world of LLMs? Find out and read why Google DeepMind managed to surprise us more than once last month.

Blog 4/28/23

Creating a Social Media Posts Generator Website with ChatGPT

Using the GPT-3-turbo and DALL-E models in Node.js to create a social post generator for a fictional product can be really helpful. The author uses ChatGPT to create an API that utilizes the openai library for Node.js., a Vue component with an input for the title and message of the post. This article provides step-by-step instructions for setting up the project and includes links to the code repository.

Blog 10/31/23

5 Inconvenient Questions when hiring an AI company

This article discusses five questions you should ask when buying an AI. These questions are inconvenient for providers of AI products, but they are necessary to ensure that you are getting the best product for your needs. The article also discusses the importance of testing the AI system on your own data to see how it performs.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.