Skip to main content
Fallom Evals lets you evaluate your LLM outputs using G-Eval with an LLM-as-judge approach. You can run evaluations locally using our SDK, or directly from the Fallom dashboard on your production traces—no code required.

Features

  • 7 Built-in Metrics: Answer relevancy, hallucination, toxicity, faithfulness, completeness, coherence, bias
  • G-Eval Methodology: Chain-of-thought prompting for accurate scoring
  • Dashboard Evals: Run evals on production traces directly from the UI—no SDK needed
  • Model Comparison: Test multiple models on the same dataset
  • Custom Pipeline Support: Evaluate outputs from your own RAG or multi-agent systems
  • Auto-Upload: Results automatically sync to your dashboard
  • Fallom Datasets: Use datasets stored in Fallom or create them locally

Dashboard Evals (No Code Required)

If you’re already logging traces to Fallom, you can run evaluations directly from the dashboard without writing any code. Navigate to Evals Store → Evals to create evaluation configs that automatically sample and evaluate your production traces.

Creating an Eval Config

  1. Go to Evals Store → Evals in your dashboard
  2. Click New Config
  3. Configure your evaluation:
    • Name: A descriptive name for your eval config
    • Sample Rate: Percentage of traces to evaluate (0.01% to 100%)
    • Judge Model: The LLM to use as the evaluator (e.g., openai/gpt-4o-mini)
    • Filter by Tags: Only evaluate traces with specific tags (optional)
    • Filter by Models: Only evaluate traces from specific models (optional)
    • Metrics: Select which metrics to run (answer relevancy, hallucination, etc.)

Running Evaluations

Once you’ve created a config, click Run Now to start an evaluation. Fallom will:
  1. Sample recent traces from the last 15 minutes matching your filters
  2. Queue them for evaluation using your selected judge model
  3. Score each trace against your chosen metrics
  4. Display results with per-metric scores and aggregated statistics

Viewing Results

Each evaluation run shows:
  • Sample Count: How many traces were evaluated
  • Scores: Average, min, and max scores for each metric
  • Individual Results: Click on a run to see detailed scores for each trace
  • Regression Detection: Automatic alerts when quality drops compared to previous runs
Dashboard evals require traces to be logged to Fallom first. Make sure you have tracing set up before running dashboard evals.

Quick Start

No code required? If you’re already logging traces to Fallom, you can skip the SDK setup and run evals directly from the dashboard.

Using the SDK

from fallom import evals

# Initialize (enables auto-upload to dashboard)

evals.init(api_key="your-fallom-api-key")

# Create a dataset

dataset = [
evals.DatasetItem(
input="What is the capital of France?",
output="The capital of France is Paris.",
system_message="You are a helpful assistant."
),
]

# Run evaluation - results auto-upload!

results = evals.evaluate(
dataset=dataset,
metrics=["answer_relevancy", "faithfulness", "completeness"]
)

Environment Variables

# Required
FALLOM_API_KEY=your-fallom-api-key     # For uploading results & fetching datasets
OPENROUTER_API_KEY=your-openrouter-key # For judge model & model comparison

Available Metrics

MetricDescription
answer_relevancyDoes the response directly address the user’s question?
hallucinationDoes the response contain fabricated information?
toxicityDoes the response contain harmful or offensive content?
faithfulnessIs the response factually accurate and consistent?
completenessDoes the response fully address all aspects of the request?
coherenceIs the response logically structured and easy to follow?
biasDoes the response contain unfair or prejudiced content?
All metrics return a score from 0.0 to 1.0, where higher is better (except hallucination, toxicity, and bias, where higher means more problematic content detected).

Custom Metrics

Create your own evaluation metrics with custom criteria and evaluation steps:
from fallom import evals

evals.init()

# Define a custom metric

brand_metric = evals.CustomMetric(
name="brand_alignment",
criteria="Brand Alignment - Does the response follow brand voice guidelines?",
steps=[
"Check if the tone is professional yet friendly",
"Verify no competitor brands are mentioned",
"Ensure the response uses approved terminology",
"Check for appropriate emoji usage (none in formal contexts)"
]
)

# Use with built-in metrics

results = evals.evaluate(
dataset=dataset,
metrics=["answer_relevancy", brand_metric]
)

Custom metrics use the same G-Eval methodology as built-in metrics - the LLM judge follows your steps and provides reasoning and a score.

Using Datasets from Fallom

Instead of creating datasets locally, you can use datasets stored in Fallom. Just pass the dataset key:
from fallom import evals

evals.init()

# Just pass the dataset key - it auto-fetches from Fallom!

results = evals.evaluate(
dataset="my-dataset-key", # Dataset key from Fallom
metrics=["answer_relevancy", "faithfulness"]
)

Evaluating Custom Pipelines

If you have a complex LLM pipeline (RAG, multi-agent, custom routing), use EvaluationDataset to run your own pipeline and evaluate the outputs:
from fallom import evals

evals.init(api_key="your-fallom-api-key")

# Pull a dataset from Fallom

dataset = evals.EvaluationDataset()
dataset.pull("customer-support-qa")

# Run each input through YOUR pipeline

for golden in dataset.goldens: # Your custom pipeline (RAG, agent routing, etc.)
actual_output = my_rag_pipeline(golden.input)

    dataset.add_test_case(evals.LLMTestCase(
        input=golden.input,
        actual_output=actual_output,
        context=retrieved_docs  # Optional: for faithfulness eval
    ))

# Evaluate your outputs

results = evals.evaluate(
test_cases=dataset.test_cases,
metrics=["answer_relevancy", "faithfulness", "completeness"]
)

Auto-Generate Test Cases

For simpler pipelines, use generate_test_cases() to automatically run all inputs through your pipeline:
from fallom import evals

evals.init()

def my_pipeline(messages):
"""Your pipeline function"""
user_query = messages[-1]["content"]
response = my_rag_app(user_query)
return {"content": response.text}

dataset = evals.EvaluationDataset()
dataset.pull("my-dataset")
dataset.generate_test_cases(my_pipeline) # Runs all inputs

results = evals.evaluate(
test_cases=dataset.test_cases,
metrics=["answer_relevancy", "faithfulness"]
)

Model Comparison

Compare how different models perform on the same dataset:
from fallom import evals

evals.init()

comparison = evals.compare_models(
dataset="my-dataset-key",
models=["anthropic/claude-3.5-sonnet", "openai/gpt-4o", "google/gemini-2.0-flash"],
metrics=["answer_relevancy", "faithfulness"],
name="Model Comparison Q4 2024"
)

# Results show scores for each model:

# - production (your original outputs)

# - anthropic/claude-3.5-sonnet

# - openai/gpt-4o

# - google/gemini-2.0-flash

Custom & Fine-Tuned Models

You can include your own fine-tuned or self-hosted models in comparisons:
from fallom import evals

evals.init()

# Fine-tuned OpenAI model

fine_tuned = evals.create_openai_model(
"ft:gpt-4o-2024-08-06:my-org::abc123",
name="my-fine-tuned"
)

# Self-hosted model (vLLM, Ollama, etc.)

my_llama = evals.create_custom_model(
name="my-llama-70b",
endpoint="http://localhost:8000/v1/chat/completions",
model_value="meta-llama/Llama-3.1-70B-Instruct"
)

# Compare all together

comparison = evals.compare_models(
dataset="my-dataset-key",
models=[fine_tuned, my_llama, "openai/gpt-4o"],
metrics=["answer_relevancy", "faithfulness"]
)

Custom Judge Model

By default, evaluations use openai/gpt-4o-mini via OpenRouter as the judge. You can specify a different judge:
results = evals.evaluate(
    dataset="my-dataset-key",
    metrics=["answer_relevancy"],
    judge_model="anthropic/claude-3.5-sonnet"  # Use Claude as judge
)

API Reference

evaluate()

Evaluate outputs against metrics. Use either dataset or test_cases.
ParameterTypeDefaultDescription
datasetlist[DatasetItem] or strDataset items or Fallom dataset key
test_caseslist[LLMTestCase]Test cases from EvaluationDataset
metricslist[str | CustomMetric]all built-inMetrics to run (built-in or custom)
judge_modelstr"openai/gpt-4o-mini"Model to use as judge
namestrauto-generatedName for this evaluation run
descriptionstrNoneOptional description
verboseboolTruePrint progress
Either dataset or test_cases must be provided.

compare_models() / compareModels()

Compare multiple models on the same dataset.
ParameterTypeDefaultDescription
datasetlist[DatasetItem] or strrequiredDataset items or Fallom dataset key
modelslist[str or Model]requiredModels to compare
metricslist[str | CustomMetric]all built-inMetrics to run (built-in or custom)
judge_modelstr"openai/gpt-4o-mini"Model to use as judge
include_productionboolTrueInclude original outputs
namestrauto-generatedName for this comparison run

Model Helpers

FunctionDescription
create_openai_model() / createOpenAIModel()Create model for fine-tuned OpenAI or Azure OpenAI
create_custom_model() / createCustomModel()Create model for any OpenAI-compatible endpoint
create_model_from_callable() / createModelFromCallable()Create model from custom function

Metric Helpers

Function / ClassDescription
custom_metric() / customMetric()Create a custom evaluation metric with G-Eval
CustomMetricClass/interface for defining custom metrics

EvaluationDataset

A class for managing datasets and test cases when using your own LLM pipeline.
dataset = evals.EvaluationDataset()

# Methods

dataset.pull(alias, version=None) # Pull dataset from Fallom
dataset.add_golden(golden) # Add a golden record
dataset.add_test_case(test_case) # Add a test case
dataset.generate_test_cases(llm_app) # Auto-generate test cases
dataset.clear_test_cases() # Clear all test cases

# Properties

dataset.goldens # List of Golden records
dataset.test_cases # List of LLMTestCase records
dataset.dataset_key # Fallom dataset key (if pulled)

LLMTestCase

A test case for evaluation containing input and actual output from your LLM.
FieldTypeDescription
inputstrThe user input/query
actual_outputstrOutput from your LLM pipeline
expected_outputstr (optional)Expected output for comparison
system_messagestr (optional)System prompt used
contextlist[str] (optional)Retrieved docs for RAG eval
metadatadict (optional)Additional metadata

Golden

A golden record from a dataset containing input and optionally expected output.
FieldTypeDescription
inputstrThe user input/query
expected_outputstr (optional)The expected/golden output
system_messagestr (optional)System prompt
contextlist[str] (optional)Context documents
metadatadict (optional)Additional metadata

Next Steps