How to Approach LLM Application Development: A Practical Guide

There are a lot of studies and demos that show how large language models (LLMs) can perform impressive tasks. While there is no one-size-fits-all approach, we’ve tried to create a set of guidelines that will help you better steer your way around all the innovation and confusion surrounding LLMs.

This post was written by Ben Dickson, a seasoned engineer, tech blogger, and mentor at our AI/ML Simulator for Product Managers.

I use the following three-stage framework when considering if and how to use LLMs in a product. It helps me with defining the problem, choosing the right models, creating the right prompts and making sure my process is efficient when going into production.

Stage I: Prepare

In this stage, the goal is to get a good sense of what you want to accomplish and where is the best place to start.

Define the task: With all the publicity surrounding LLMs, it is easy to think that they are general problem-solvers that can take a complex task and come up with a solution. But if you want to get good results, you should pick one specific task and try to formulate it as an input-output problem that can be categorized into one of known categories (classification, regression, question-answering, summarization, translation, text generation, etc.).

Choose a benchmark that is closely related to the problem you want to solve. This will help you determine good prompting techniques and models. For example, if you’re solving a reasoning problem, HellaSwag is a good benchmark. For language tasks, MMLU gives a good impression of how different LLMs perform. This guide from Confident AI is a good overview of different LLM benchmarks.

Create a basic test set: Create at least five examples that are descriptive of the problem you want to solve. The examples should be created manually and be directly related to your product or industry. You can use the benchmark examples as a guide on how to format your examples.

Choose a model: Look at LLM leaderboards to choose up to three models that perform the best on the benchmark that is related to your task.

Create a basic prompt template: Create a prompt for your test set. Use very simple prompting techniques to get a feel of the baseline performance for each model. A basic prompt usually includes the role, instructions, and problem.

Stage II: Refine

In this stage, the goal is to get the chosen LLM to have the best performance on the preparation set. We’re not interested in fast inference or low costs. We just want to make the LLM work.

Use advanced prompting techniques: If the model’s performance is not satisfactory on basic prompting, try more advanced techniques such as few-shot examples, chain-of-thought reasoning, additional instructions, and in-context learning. Try to separate problems with the responses and tackle them one at a time.

Add in-context information: If your application requires proprietary information about your product, such as code and documentation, add it to the prompt. With today’s long context LLMs, you don’t need to set up retrieval-augmentation generation (RAG) pipelines in the beginning—just dump the knowledge into the prompt as a separate section.

Use prompt pipelines: If the task turns out to be too complicated, try to break it down into several subtasks where the output of one becomes the input of the other. This will enable you to create and optimize separate prompts for each subtask and get better results.

Create a larger test set: Generate at least 50-100 examples to test your models and prompts. Try to create as many manual prompts as possible. Recruit colleagues who will be end users of the product to help create a more diverse dataset. You can also use frontier models to generate examples through few-shot learning.

Continue to refine your prompt(s) on the large dataset: With a larger dataset, you will probably see more errors. Repeat the refinement process until your prompts reach acceptable performance.

Stage III: Scale

In this stage, the premise is that you already have a high-performance prompt pipeline. The goal is to make the LLM-based product scalable by reducing costs, increasing speed, and continuing to improve performance.

Reduce the size of your prompt: Use methods to reduce the number of tokens in your prompt with techniques such as RAG and prompt optimization). When you’re processing thousands or millions of requests per day, reducing the number of tokens per request by even a small number can make a big difference.

Increase the speed of inference: Use techniques such as prompt caching, advanced hardware, or better inference algorithms, to reduce the time to first token and the number of tokens your application produces per second. Faster response rates usually result in better user experience.

Gather data as you go: Within the bounds of privacy laws, try to gather more input-output examples as you deploy the mode in production. This data will be a gold mine for analyzing how users interact with your application, discovering pain points, and creating better versions of your application in the future.

Fine-tune smaller models: With enough examples, you can train a much smaller model to perform as well as large LLMs on your specific task. This can help reduce costs by orders of magnitude and increase speed at the same time.

Remove LLMs where possible: It is worth noting that in many cases, LLMs are not the best solution to your problem. For example, if you’re using an LLM for classification, you can get more accurate and less expensive results by using a linear classification model or a BERT classifier. If you’re using the model for recommendations, you can try more classic recommendation models. Sometimes, a simple regex or a rule-based function can replace what the LLM has been doing for you. As you gather more data and refine your pipeline, try to see where you can remove LLMs.

Some caveats

LLMs are not silver bullets: In many—if not most—cases, LLMs are not the best solution for your problem. In some cases, even the most advanced LLMs will not provide reasonable results. As mentioned in the last part of the scale stage, once your application matures, try to replace LLMs with more reliable tools.

So why use LLMs? There are some areas where LLMs are unrivaled, such as text generation. In other areas, LLMs can be great enhancers, such as design and coding. This is why it’s important to know the task you want to accomplish and what is the end goal. In many cases, even having a solution that cuts 10-20% of the labor can result in real productivity. LLMs are especially useful for tasks where there is a human in the loop. They can help reduce the amount of energy that experts have to put into sifting through data.

LLMs for prototyping. LLMs are great tools for testing hypotheses and creating prototypes. Product managers who don’t have the technical chops to create machine learning models can use LLMs to create prototypes of desired products and applications, refine them, and find their product-market fit. Once in the scale and deployment phase, they can bring in experts to help them optimize the components of their applications and replace LLMs with more efficient solutions.

All posts from the series

How to approach LLM application development

Stage I: Prepare

Stage II: Refine

Stage III: Scale

Some caveats