Skip to main content
Fallom Evals lets you evaluate your LLM outputs using G-Eval with an LLM-as-judge approach. Run evaluations locally in your environment, and results are automatically uploaded to your Fallom dashboard.

Features

  • 5 Built-in Metrics: Answer relevancy, hallucination, toxicity, faithfulness, completeness
  • G-Eval Methodology: Chain-of-thought prompting for accurate scoring
  • Model Comparison: Test multiple models on the same dataset
  • Auto-Upload: Results automatically sync to your dashboard
  • Fallom Datasets: Use datasets stored in Fallom or create them locally

Quick Start

from fallom import evals

# Initialize (enables auto-upload to dashboard)

evals.init(api_key="your-fallom-api-key")

# Create a dataset

dataset = [
evals.DatasetItem(
input="What is the capital of France?",
output="The capital of France is Paris.",
system_message="You are a helpful assistant."
),
]

# Run evaluation - results auto-upload!

results = evals.evaluate(
dataset=dataset,
metrics=["answer_relevancy", "faithfulness", "completeness"]
)

Environment Variables

# Required
FALLOM_API_KEY=your-fallom-api-key     # For uploading results & fetching datasets
OPENROUTER_API_KEY=your-openrouter-key # For judge model & model comparison

Available Metrics

MetricDescription
answer_relevancyDoes the response directly address the user’s question?
hallucinationDoes the response contain fabricated information?
toxicityDoes the response contain harmful or offensive content?
faithfulnessIs the response factually accurate and consistent?
completenessDoes the response fully address all aspects of the request?
All metrics return a score from 0.0 to 1.0, where higher is better (except hallucination and toxicity, where higher means more hallucinated/toxic content detected).

Using Datasets from Fallom

Instead of creating datasets locally, you can use datasets stored in Fallom. Just pass the dataset key:
from fallom import evals

evals.init()

# Just pass the dataset key - it auto-fetches from Fallom!

results = evals.evaluate(
dataset="my-dataset-key", # Dataset key from Fallom
metrics=["answer_relevancy", "faithfulness"]
)

Model Comparison

Compare how different models perform on the same dataset:
from fallom import evals

evals.init()

comparison = evals.compare_models(
dataset="my-dataset-key",
models=["anthropic/claude-3.5-sonnet", "openai/gpt-4o", "google/gemini-2.0-flash"],
metrics=["answer_relevancy", "faithfulness"],
name="Model Comparison Q4 2024"
)

# Results show scores for each model:

# - production (your original outputs)

# - anthropic/claude-3.5-sonnet

# - openai/gpt-4o

# - google/gemini-2.0-flash

Custom & Fine-Tuned Models

You can include your own fine-tuned or self-hosted models in comparisons:
from fallom import evals

evals.init()

# Fine-tuned OpenAI model

fine_tuned = evals.create_openai_model(
"ft:gpt-4o-2024-08-06:my-org::abc123",
name="my-fine-tuned"
)

# Self-hosted model (vLLM, Ollama, etc.)

my_llama = evals.create_custom_model(
name="my-llama-70b",
endpoint="http://localhost:8000/v1/chat/completions",
model_value="meta-llama/Llama-3.1-70B-Instruct"
)

# Compare all together

comparison = evals.compare_models(
dataset="my-dataset-key",
models=[fine_tuned, my_llama, "openai/gpt-4o"],
metrics=["answer_relevancy", "faithfulness"]
)

Custom Judge Model

By default, evaluations use openai/gpt-4o-mini via OpenRouter as the judge. You can specify a different judge:
results = evals.evaluate(
    dataset="my-dataset-key",
    metrics=["answer_relevancy"],
    judge_model="anthropic/claude-3.5-sonnet"  # Use Claude as judge
)

API Reference

evaluate()

Evaluate production outputs against metrics.
| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | dataset | list[DatasetItem] or str | required | Dataset items or Fallom dataset key | | metrics | list[str] | all metrics | Metrics to run | | judge_model | str | "openai/gpt-4o-mini" | Model to use as judge | | name | str | auto-generated | Name for this evaluation run | | description | str | None | Optional description | | verbose | bool | True | Print progress |

compare_models() / compareModels()

Compare multiple models on the same dataset.
ParameterTypeDefaultDescription
datasetlist[DatasetItem] or strrequiredDataset items or Fallom dataset key
modelslist[str or Model]requiredModels to compare
metricslist[str]all metricsMetrics to run
judge_modelstr"openai/gpt-4o-mini"Model to use as judge
include_productionboolTrueInclude original outputs
namestrauto-generatedName for this comparison run

Model Helpers

FunctionDescription
create_openai_model() / createOpenAIModel()Create model for fine-tuned OpenAI or Azure OpenAI
create_custom_model() / createCustomModel()Create model for any OpenAI-compatible endpoint
create_model_from_callable() / createModelFromCallable()Create model from custom function

Next Steps