Features
- 7 Built-in Metrics: Answer relevancy, hallucination, toxicity, faithfulness, completeness, coherence, bias
- G-Eval Methodology: Chain-of-thought prompting for accurate scoring
- Dashboard Evals: Run evals on production traces directly from the UI—no SDK needed
- Model Comparison: Test multiple models on the same dataset
- Custom Pipeline Support: Evaluate outputs from your own RAG or multi-agent systems
- Auto-Upload: Results automatically sync to your dashboard
- Fallom Datasets: Use datasets stored in Fallom or create them locally
Dashboard Evals (No Code Required)
If you’re already logging traces to Fallom, you can run evaluations directly from the dashboard without writing any code. Navigate to Evals Store → Evals to create evaluation configs that automatically sample and evaluate your production traces.Creating an Eval Config
- Go to Evals Store → Evals in your dashboard
- Click New Config
- Configure your evaluation:
- Name: A descriptive name for your eval config
- Sample Rate: Percentage of traces to evaluate (0.01% to 100%)
- Judge Model: The LLM to use as the evaluator (e.g.,
openai/gpt-4o-mini) - Filter by Tags: Only evaluate traces with specific tags (optional)
- Filter by Models: Only evaluate traces from specific models (optional)
- Metrics: Select which metrics to run (answer relevancy, hallucination, etc.)
Running Evaluations
Once you’ve created a config, click Run Now to start an evaluation. Fallom will:- Sample recent traces from the last 15 minutes matching your filters
- Queue them for evaluation using your selected judge model
- Score each trace against your chosen metrics
- Display results with per-metric scores and aggregated statistics
Viewing Results
Each evaluation run shows:- Sample Count: How many traces were evaluated
- Scores: Average, min, and max scores for each metric
- Individual Results: Click on a run to see detailed scores for each trace
- Regression Detection: Automatic alerts when quality drops compared to previous runs
Dashboard evals require traces to be logged to Fallom first. Make sure you have tracing set up before running dashboard evals.
Quick Start
Using the SDK
- Python
- TypeScript
Environment Variables
Available Metrics
| Metric | Description |
|---|---|
answer_relevancy | Does the response directly address the user’s question? |
hallucination | Does the response contain fabricated information? |
toxicity | Does the response contain harmful or offensive content? |
faithfulness | Is the response factually accurate and consistent? |
completeness | Does the response fully address all aspects of the request? |
coherence | Is the response logically structured and easy to follow? |
bias | Does the response contain unfair or prejudiced content? |
Custom Metrics
Create your own evaluation metrics with custom criteria and evaluation steps:- Python
- TypeScript
Using Datasets from Fallom
Instead of creating datasets locally, you can use datasets stored in Fallom. Just pass the dataset key:- Python
- TypeScript
Evaluating Custom Pipelines
If you have a complex LLM pipeline (RAG, multi-agent, custom routing), useEvaluationDataset to run your own pipeline and evaluate the outputs:
- Python
- TypeScript
Auto-Generate Test Cases
For simpler pipelines, usegenerate_test_cases() to automatically run all inputs through your pipeline:
- Python
- TypeScript
Model Comparison
Compare how different models perform on the same dataset:- Python
- TypeScript
Custom & Fine-Tuned Models
You can include your own fine-tuned or self-hosted models in comparisons:- Python
- TypeScript
Custom Judge Model
By default, evaluations useopenai/gpt-4o-mini via OpenRouter as the judge. You can specify a different judge:
- Python
- TypeScript
API Reference
evaluate()
Evaluate outputs against metrics. Use either dataset or test_cases.
- Python
- TypeScript
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset | list[DatasetItem] or str | — | Dataset items or Fallom dataset key |
test_cases | list[LLMTestCase] | — | Test cases from EvaluationDataset |
metrics | list[str | CustomMetric] | all built-in | Metrics to run (built-in or custom) |
judge_model | str | "openai/gpt-4o-mini" | Model to use as judge |
name | str | auto-generated | Name for this evaluation run |
description | str | None | Optional description |
verbose | bool | True | Print progress |
Either
dataset or test_cases must be provided.compare_models() / compareModels()
Compare multiple models on the same dataset.
- Python
- TypeScript
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset | list[DatasetItem] or str | required | Dataset items or Fallom dataset key |
models | list[str or Model] | required | Models to compare |
metrics | list[str | CustomMetric] | all built-in | Metrics to run (built-in or custom) |
judge_model | str | "openai/gpt-4o-mini" | Model to use as judge |
include_production | bool | True | Include original outputs |
name | str | auto-generated | Name for this comparison run |
Model Helpers
| Function | Description |
|---|---|
create_openai_model() / createOpenAIModel() | Create model for fine-tuned OpenAI or Azure OpenAI |
create_custom_model() / createCustomModel() | Create model for any OpenAI-compatible endpoint |
create_model_from_callable() / createModelFromCallable() | Create model from custom function |
Metric Helpers
| Function / Class | Description |
|---|---|
custom_metric() / customMetric() | Create a custom evaluation metric with G-Eval |
CustomMetric | Class/interface for defining custom metrics |
EvaluationDataset
A class for managing datasets and test cases when using your own LLM pipeline.- Python
- TypeScript
LLMTestCase
A test case for evaluation containing input and actual output from your LLM.- Python
- TypeScript
| Field | Type | Description |
|---|---|---|
input | str | The user input/query |
actual_output | str | Output from your LLM pipeline |
expected_output | str (optional) | Expected output for comparison |
system_message | str (optional) | System prompt used |
context | list[str] (optional) | Retrieved docs for RAG eval |
metadata | dict (optional) | Additional metadata |
Golden
A golden record from a dataset containing input and optionally expected output.- Python
- TypeScript
| Field | Type | Description |
|---|---|---|
input | str | The user input/query |
expected_output | str (optional) | The expected/golden output |
system_message | str (optional) | System prompt |
context | list[str] (optional) | Context documents |
metadata | dict (optional) | Additional metadata |

