How to choose the right LLM for your needs | TechTarget


How to choose the right LLM for your needs | TechTarget

With so many LLMs, selecting the right one can be challenging. Organizations must compare factors such as model size, accuracy, agent functionality, language support and benchmark performance, and consider practical components such as cost, scalability, inference speed and compatibility with existing infrastructure.

When choosing an LLM, it's essential to assess both the various model aspects and the use cases it is intended to address.

Evaluating models holistically creates a clearer picture of their overall effectiveness. For example, some models offer advanced capabilities, such as multimodal inputs, function calling or fine-tuning, but those features might come with trade-offs in terms of availability or infrastructure demands.

Key aspects to consider when deciding on an LLM include model performance across various benchmarks, context window size, unique features and infrastructure requirements.

When GPT-4 was released in March 2023, OpenAI boasted of the model's strong performance on benchmarks such as MMLU, TruthfulQA and HellaSwag. Other LLM vendors similarly reference benchmark performance when rolling out new models or updates. But what do these benchmarks really mean?

Among these benchmarks and others like them, MMLU is the most widely used to measure an LLM's overall performance. Although MMLU offers a good indicator of a model's quality, it doesn't cover every aspect of reasoning and knowledge. To get a well-rounded view of an LLM's performance, it's important to evaluate models on multiple benchmarks to see how they perform across different tasks and domains.

Another factor to consider when evaluating an LLM is its context window: the amount of input it can process at one time. Different LLMs have different context windows -- measured in tokens, which represent small chunks of text -- and vendors are constantly upgrading context window size to stay competitive.

For example, Anthropic's Claude 2.1 was released in November 2023 with a context window of 200,000 tokens, or roughly 150,000 words. Despite this increase in capacity over previous versions, however, users noted that Claude's performance tended to decline when handling large amounts of information. This suggests that a larger context window doesn't necessarily translate to better processing quality.

While performance benchmarks and context window size cover some LLM capabilities, organizations also must evaluate other model features, such as language capabilities, multimodality, fine-tuning, availability and other specific characteristics that align with their needs.

Take Google's Gemini 1.5 as an example. The table below breaks down some of its main features.

While Gemini 1.5 has some impressive properties -- including being the only model capable of handling up to 2 million tokens as of publication time -- it's only available as a cloud service through Google. This could be a drawback for organizations that use another cloud provider, want to host LLMs on their infrastructure or need to run LLMs on a small device.

Fortunately, a wide range of LLMs supports on-premises deployment. For example, Meta's Llama 3 series of models offers a variety of model sizes and functionalities, providing more flexibility for organizations with specific infrastructure requirements.

Another essential component to evaluate when choosing an LLM is its infrastructure requirements.

Larger models with more parameters need more GPU VRAM to run effectively on an organization's infrastructure. A general rule of thumb is to double the number of parameters (in billions) to estimate the amount of GPU VRAM a model requires. For example, a model with 1 billion parameters would require approximately 2 GB of GPU VRAM to function effectively.

As an example, the table below shows the features, capabilities and GPU requirements of several Llama models.

When considering GPU requirements, an organization's choice of LLM will depend heavily on its intended use case. For instance, if the goal is to run an LLM application with vision features on a standard end-user device, Llama 3.2 11B could be a good fit, as it supports vision tasks while requiring only moderate memory. However, if the application is intended for mobile devices, Llama 3.2 1B might be more suitable thanks to its lower memory needs, which enable it to run on smaller devices.

Previous articleNext article

POPULAR CATEGORY

corporate

11827

tech

11464

entertainment

14620

research

6729

misc

15587

wellness

11885

athletics

15523