LLMs: our future overlords are hungry and thirsty

generative AI   microservice architecture  

New public workshop: Architecting for fast, sustainable flow - enabling DevOps and Team Topologies thru architecture. Learn more and enroll.


Since early this year, the news around generative AI technologies, such as ChatGPT, has been never ending. Some have even suggested that humanities existence is at stake. While there’s a massive amount of hype, there’s also a lot of potential as shown by tools such as ChatGPT and Copilot. Consequently, I decided to explore Generative AI from the perspective of an enterprise software architect. This article is the first in a series about generative AI - specifically Large Language Models (LLMs) - and the microservice architecture.

A large language model is a function….

Large Language Models (LLMs) are a generative AI technology for natural language processing. Simply put, an LLM is a function that takes a sequence of words as input - the prompt - and returns a sequence of words that’s the most likely completion of the prompt.

$ python3
Python 3.11.4 ..
>>> from langchain.llms import Ollama
>>> llm = Ollama(model="llama2")
llm("who is Oppenheimer")
' J. Robert Oppenheimer was an American theoretical physicist and professor who made significant contributions to...

Not particularly threatening, right?

…that is implemented by a neural network…

An LLM is implemented by a neural network. The details are quite complicated. But the basic idea is that the neural network is trained on a large amount of text to predict the next word (or more accurately a token, which is an encoded word fragment) given an input sequence of words. The entire completion is constructed by iteratively predicting the next token given the input sequence and the previous predicted tokens.

… with numerous NLP use cases

LLMs have numerous use cases, including text generation, summarization, rewriting, classification, entity extraction, semantic search and classification. To learn more about LLM use cases see Large Language Models and Where to Use Them: Part 1 and Hugging Face Transformers.

LLMs are hungry

LLMs are rather large, with billions of parameters, which are the weights of the neural net’s neurons and connections. For example, the GPT-3 LLM has 175 billion parameters. And, Facebook LLamas is a collection of language models ranging from 7B to 65B parameters. Consequently, LLMs involve lots of math and are hungry for computational resources, specifically expensive GPUs.

The resource requirements depend on the phase of LLM’s lifecycle. The three phases of a LLM’s life cycle are:

  • training - creating the LLM for scratch
  • fine tuning - tailoring the LLM to specific tasks
  • inferencing - performing tasks with the LLM

Training and inferencing both require a lot of compute resources. Let’s look at each of these in turn.

Training

Training an LLM from scratch is an extremely computationally intensive task. For example, training the GPT-3 LLM required 355 years of GPU time. The training cost was estimated at $4.6 million. Training is so costly because it requires lots of GPUs that have large amounts of memory, which are expensive. For example, the AWS EC2 p5.48xlarge instances, which have 8 GPUs each with 80G of memory and costs $98.32 per hour. Consequently, most organizations will use a 3rd party, pretrained LLM.

Fine-tuning

Fine tuning an LLM, which tailors an LLM to a specific task by adjusting the parameters, is much less computationally intensive than tuning. As a result it’s fairly inexpensive but it still requires GPUs.

Inferencing

Inferencing with an LLM, which is using the LLM to perform tasks, is less computationally intensive than training, but typically still requires GPUs. Moreover, each GPU must have sufficient memory to store all of the LLM’s billions of parameters. By default, each parameter is 16-bits, so 2 bytes per parameter is required. Sometimes, however, quantization can be applied to use fewer bits per parameter although there’s a trade-off between accuracy and memory usage. Machines with GPUs that can run LLMs are more expensive than machines without GPUs. For example, an AWS g5.48xlarge instance which has 8 GPUs each with 24G of memory costs $16/hour. A comparable non-GPU instance, such as a m7i.48xlarge costs $9.6768/hour.

LLMs are also thirsty

Since LLMs are computationally expensive they are also thirsty for water. Lots of computation requires a lot of electricity, which requires a lot of water to cool the data centers. For example, studies estimate that a ChatGPT conversation consumes as much as 500ml of water. So much for the environment!

LLMs and the microservice architecture

Let’s imagine that you want to add an LLM to your enterprise Java application. It’s quite likely that you will want to deploy the LLM inferencing code as a separate service for the following two reasons: efficient utilization of expensive GPU resources and the need to use a different technology stack. Let’s look at each reason in turn.

Efficient utilization of expensive GPU resources

There are two ways to run an LLM: self-hosted or SaaS. If you are self-hosting an LLM, then you are running software that has very distinctive resource requirements. LLMs must run on more expensive machines that have GPUs. Therefore, in order to utilize those resources efficiently you will need to resolve the dark energy force segregate by characteristics, and deploy your LLM as a separate service running on specialized infrastructure. You typically wouldn’t want to run non-GPU services on the same machines as your LLM services since that could result in over-provisioning of GPUs.

Separate technology stack

The second reason to run your LLM related code as a separate service is that it will likely use Python instead of Java. While you might run the LLM using a Java technology stack, it appears that Python has a much richer ecosystem. Furthermore, even if you are using a SaaS-based LLM, you will often write Python code to interact with the LLM. For example, the prompt tuning/engineering code that tailors an ‘off the shelf’ LLM to a specific task is often written in Python. Consequently, in order to resolve the the dark energy force multiple technology stacks you will need to deploy your Python-based LLM logic as a separate service.

What’s next?

In future articles, I’ll explore the topic of LLMs and microservices in more details.

Need help with accelerating software delivery?

I’m available to help your organization improve agility and competitiveness through better software architecture: training workshops, architecture reviews, etc.

Learn more about how I can help


generative AI   microservice architecture  


Copyright © 2024 Chris Richardson • All rights reserved • Supported by Kong.

About www.prc.education

www.prc.education is brought to you by Chris Richardson. Experienced software architect, author of POJOs in Action, the creator of the original CloudFoundry.com, and the author of Microservices patterns.

New workshop: Architecting for fast, sustainable flow

Enabling DevOps and Team Topologies thru architecture

DevOps and Team topologies are vital for delivering the fast flow of changes that modern businesses need.

But they are insufficient. You also need an application architecture that supports fast, sustainable flow.

Learn more and register for my June 2024 online workshops....

NEED HELP?

I help organizations improve agility and competitiveness through better software architecture.

Learn more about my consulting engagements, and training workshops.

LEARN about microservices

Chris offers numerous other resources for learning the microservice architecture.

Get the book: Microservices Patterns

Read Chris Richardson's book:

Example microservices applications

Want to see an example? Check out Chris Richardson's example applications. See code

Virtual bootcamp: Distributed data patterns in a microservice architecture

My virtual bootcamp, distributed data patterns in a microservice architecture, is now open for enrollment!

It covers the key distributed data management patterns including Saga, API Composition, and CQRS.

It consists of video lectures, code labs, and a weekly ask-me-anything video conference repeated in multiple timezones.

The regular price is $395/person but use coupon ILFJODYS to sign up for $95 (valid until April 12, 2024). There are deeper discounts for buying multiple seats.

Learn more

Learn how to create a service template and microservice chassis

Take a look at my Manning LiveProject that teaches you how to develop a service template and microservice chassis.

Signup for the newsletter


BUILD microservices

Ready to start using the microservice architecture?

Consulting services

Engage Chris to create a microservices adoption roadmap and help you define your microservice architecture,


The Eventuate platform

Use the Eventuate.io platform to tackle distributed data management challenges in your microservices architecture.

Eventuate is Chris's latest startup. It makes it easy to use the Saga pattern to manage transactions and the CQRS pattern to implement queries.


Join the microservices google group