Generative AI models are notoriously unpredictable, producing different outputs from the same input. This variability makes traditional software testing methods insufficient for AI architectures. To overcome this challenge, developers can employ evaluations (evals) - structured tests that measure a model's performance and ensure accuracy, performance, and reliability.
What Are Evaluations?
Evals are designed to test an AI system despite its nondeterministic nature. They help improve the performance of LLM-based applications through fine-tuning. When developing evals for your use case, you can leverage industry benchmarks like MMLU or HuggingFace's leaderboard, standard numerical scores like ROUGE or BERTScore, or implement specific tests to measure your application's performance.
Types of Evaluations
When referring to "evals," it could refer to various types:
- Industry benchmarks for comparing models in isolation
- Standard numerical scores you can use as you design evals for your use case
- Specific tests you implement to measure your LLM application's performance
This guide focuses on designing your own evals - a crucial step in fitness app development.
How to Read Evaluations
When reviewing eval results, you'll often see numerical scores between 0 and 1. However, there's more to evals than just scores. Combine metrics with human judgment to ensure you're answering the right questions and improving your AI system.
Evaluations Tips
To create effective evals:
- Adopt an eval-driven development approach: Evaluate early and often, writing scoped tests at every stage
- Design task-specific evals: Make tests reflect model capability in real-world distributions
- Log everything: Record as you develop so you can mine your logs for good eval cases
- Automate when possible: Structure evaluations to allow for automated scoring
- It's a journey, not a destination: Evaluation is a continuous process
Anti-Patterns
Avoid these common pitfalls:
- Overly generic metrics: Relying solely on academic metrics like perplexity or BLEU score
- Biased design: Creating eval datasets that don't faithfully reproduce production traffic patterns
- Vibe-based evals: Using "it seems like it's working" as an evaluation strategy, or waiting until you ship before implementing any evals
- Ignoring human feedback: Not calibrating your automated metrics against human evals
Design Your Evaluation Process
To create a comprehensive eval workflow:
- Define eval objective: Determine the success criteria for the eval
- Collect dataset: Gather data that helps evaluate against your objective, considering synthetic eval data, domain-specific eval data, purchased eval data, human-curated eval data, production data, and historical data
- Define eval metrics: Establish how you'll check that the success criteria are met
- Run and compare evals: Iterate and improve model performance for your task or system
- Continuously evaluate: Set up continuous evaluation (CE) to run evals on every change, monitor your app to identify new cases of nondeterminism, and grow the eval set over time
Let's walk through a few examples:
Example: Summarizing Transcripts
To test an LLM-based application's ability to summarize transcripts, your eval design might involve:
- Defining the eval objective: The model should compete with reference summaries for relevance and accuracy
- Collecting dataset: Using a mix of production data (collected from user feedback on generated summaries) and datasets created by domain experts (writers)
- Defining eval metrics: Achieving a ROUGE-L score of at least 0.40 and coherence score of at least 80% using G-Eval on a held-out set of 1000 reference transcripts → summaries
- Running and comparing evals: Using the Evals API to create and run evals in the OpenAI dashboard
- Continuously evaluating: Setting up CE to run evals on every change, monitoring your app to identify new cases of nondeterminism, and growing the eval set over time
LLMs excel at discriminating between options. Therefore, evaluations should focus on tasks like pairwise comparisons, classification, or scoring against specific criteria instead of open-ended generation. Aligning evaluation methods with LLMs' strengths in comparison leads to more reliable assessments of LLM outputs or model comparisons.
Example: Q&A over Docs
To test an LLM-based application's ability to do Q&A over docs, your eval design might involve:
- Defining the eval objective: The model should provide precise answers, recall context as needed to reason through user prompts, and provide an answer that satisfies the user's need
- Collecting dataset: Using a mix of production data (collected from users' satisfaction with answers provided to their questions), hard-coded correct answers to questions created by domain experts, and historical data from logs
- Defining eval metrics: Context recall of at least 0.85, context precision of over 0.7, and 70+% positively rated answers
- Running and comparing evals: Using the Evals API to create and run evals in the OpenAI dashboard
- Continuously evaluating: Setting up CE to run evals on every change, monitoring your app to identify new cases of nondeterminism, and growing the eval set over time
When creating an eval dataset, consider using o3 and GPT-4.1 to help you generate a diverse set of test data across various scenarios. Ensure your test data includes typical cases, edge cases, and adversarial cases. Use human expert labellers.
Identifying Where You Need Evaluations
As complexity increases with more complex architectures, it's essential to identify where nondeterminism enters your system - that's where you'll want to implement evals. Here are four common architecture patterns:
- Single-turn model interactions
- Multi-turn model interactions
- Conditional generation
- Generative text completion
Read about each architecture pattern below to identify where nondeterminism enters your system, and where you'll need to implement evals.
Note: This article is for informational purposes only. It's not a substitute for professional advice or guidance on fitness app development.