Overview
About Japeto’s healthcare chatbots
We have experience building healthcare chatbots. One of our projects, Pat, created in partnership with Positive East, is a sexual health chatbot that answers questions to visitors of NHS sexual health services. Behind the scenes, Pat uses a type of artificial intelligence called natural language processing, specifically AWS Lex.
We have had good results with Lex, but also there have been an explosion of large language models available since first working on Pat. As part of our work to create the best conversational AI solutions we wanted to benchmark the different large language models out there.
The ABOVE grant
The ABOVE grant was awarded to us to explore a MedTech project.
Arise has launched the ABOVE grant funding programme, offering up to £5,000 to SMEs, business owners, and entrepreneurs in the HealthTech and MedTech sectors based in Essex or considering relocating there. The initiative aims to stimulate and support the ecosystem by fostering collaboration, creating jobs, and facilitating business-academic projects. The programme seeks to create an innovation culture, connect businesses to support providers like the NHS, and encourage the development of viable new start-ups.
ABOVE grant award
We were awarded the ABOVE grant to test out how well different large language models performed. We wanted to put a range of models to the test – that includes Open Source and also some big new proprietary models.
To make it a fair competition, we created a data set of anonymised chat messages and recorded the expected best results.
We would train 14 different chatbots, each using a different AI powering them and run the same set of chat messages to them. Then, we’d check to see how accurate each model was.
An added challenge of the project was to run open source models locally wherever possible. Could we get good results on our own rig?
We’re going to go into this project over the next four blogs, including how our experience building a computer, training these models, and comparing the data – and finally, the results. Stay tuned!
Running large language models in-house
Or, in other words: creating Blue Fairy
Before we get into the hardware build, there are a few things you should know about running LLMs locally:
Large Language Models really are LARGE
The clue is in the name! The largest model we planned on running was Meta’s Llama 3 70B model. The general rule of thumb is that a model’s memory requirements are 2x the number of parameters. This is because unquantised models are usually stored using a 16-bit float per model.
Llama 3 70B, with 70 billion parameters, therefore requires around 140GB. You can run quantised versions of the model, representing parameters with lower precision and some accuracy cost. However, you’ll still need around 70GB of memory using 8-bit and 40GB using 4-bit.
This is a challenge because consumer graphics cards have a maximum of 24GB of memory.
LLMs can be split between multiple graphics cards and the CPU
Some of the most common ways of running Open-Source LLMs let you split a model between the video memory of multiple graphics cards, and also your system’s RAM. Llama.cpp, for example, allows you to specify how many layers of the LLM are running on GPUs, with the rest of the layers running in main memory.
This means that you don’t need current generation data centre cards with 80GB memory to run capable LLMs – you can run them on a combination of smaller devices.
LLMs run much faster on GPUs than on CPUs
When LLMs are working, they run a large number of matrix operations in parallel. GPUs are designed for graphics processing and excel at parallel processing – they have a very large number of cores that can conduct many operations in parallel. In comparison, CPUs have fewer cores and can do fewer operations in parallel. GPUs typically have more memory bandwidth, meaning data can be transferred faster than CPUs.
This means that the entire model should be run on GPUs rather than CPUs wherever possible.
The hardware build
We built our LLM server using mostly off-the-shelf components. We were looking for something easy to build and maintain, able to run 70 billion parameter models in GPU memory, but with enough main memory to run larger models.
Components
Our build budget was £3,000.
The list of components we used is as follows:
- ASUS ROG Strix B650-E motherboard
- 128 GB Corsair Vengeance CL30 DDR5 RAM (4x32GB)
- NZXT H9 Mid-Tower case
- AMD Ryzen 9 7950X3D processor
- NZXT C1200 1200 watt PSU
- Noctua NH-D15 CPU cooler
- 2x WD_BLACK SN770 NVMe drive
- EVGA GeForce RTX 3090 graphics card
- 2x Nvidia Tesla P40 datacenter graphics card
Core hardware build
We wanted good performance even on models that were too large to fit into GPU memory, so at the core of the hardware build was an AMD Ryzen 9 7950X3D processor, one of the best consumer CPUs available at the time of the build. We chose a beefy Noctua NH-D15 CPU cooler.
We also bought 4x32GB Corsair Vengeance CL30 DDR5 6000MHz RAM, making a total of 128GB memory. Combined with the video memory, this gave us 190GB total memory, which allowed us to run very large models such as Mixtral 8x22B. Memory bandwidth is often a bottleneck in LLM inference, so we chose high-performance DDR-5 memory.
We chose the ASUS ROG Strix B650-E motherboard because it was the most economical consumer motherboard and could support an AM5 CPU, DDR5 memory, and 3 graphics cards.
We topped the build off with 4TB storage (again, Large Language Models are LARGE!) in the form of 2 WD_BLACK SN770 NVMes, and an NZXT H9 Mid-Tower case capable of holding the cooler and all the GPUs.
Graphics cards
The graphics cards made up the bulk of the hardware build. We added specialist graphics cards that can run large language models – Tesla P40s. Also a higher end consumer graphics card, an RTX 3090.
Tradeoffs
We built this server to run batch tasks that use LLMs to generate text. We chose our components because of their capability to run very large models for the hardware cost. We didn’t optimise for speed. It’s important to bear in mind that if you are looking to build a machine for real-time inference (like chatbots), it might be too slow for your needs. We found that in the most capable model we ran (Mixtral 8x7B), we were got around 20-30 tokens per second throughout the model.
We also used some quite old GPUs in this build – Tesla P40s. Newer versions of CUDA – Nvidia’s library for using GPUs for computational tasks – are not compatible with P40s. This means that some ways of running LLMs, and some newer features of libraries, were not available to us.
If we were building a machine designed for real-time inference, we would likely have chosen to use newer cards at a higher budget than this build.
Cooling the P40s
The 3090 is a consumer graphics card and had an inbuilt cooling fan. The P40s, however, don’t – they are made for servers and don’t have any active cooling hardware. They can run very hot and do require some active cooling, so we had to get creative here.
We 3D printed adapters using this model. This adapter fits straight onto the card, and lets you directly attach a 40mm server fan to blow air straight through the card. If you don’t have access to a 3D printer, you can search eBay for a P40 fan to buy similar adapters.
The end result
We were able to put together a computer that could run large language models.