Evaluations

Fallom Evals lets you evaluate your LLM outputs using G-Eval with an LLM-as-judge approach. Run evaluations locally in your environment, and results are automatically uploaded to your Fallom dashboard.

Features

5 Built-in Metrics: Answer relevancy, hallucination, toxicity, faithfulness, completeness
G-Eval Methodology: Chain-of-thought prompting for accurate scoring
Model Comparison: Test multiple models on the same dataset
Auto-Upload: Results automatically sync to your dashboard
Fallom Datasets: Use datasets stored in Fallom or create them locally

Quick Start

Python
TypeScript

from fallom import evals

# Initialize (enables auto-upload to dashboard)

evals.init(api_key="your-fallom-api-key")

# Create a dataset

dataset = [
evals.DatasetItem(
input="What is the capital of France?",
output="The capital of France is Paris.",
system_message="You are a helpful assistant."
),
]

# Run evaluation - results auto-upload!

results = evals.evaluate(
dataset=dataset,
metrics=["answer_relevancy", "faithfulness", "completeness"]
)

import fallom from "@fallom/trace";

// Initialize (enables auto-upload to dashboard)
fallom.evals.init({ apiKey: "your-fallom-api-key" });

// Create a dataset
const dataset = [
  {
    input: "What is the capital of France?",
    output: "The capital of France is Paris.",
    systemMessage: "You are a helpful assistant."
  }
];

// Run evaluation - results auto-upload!
const results = await fallom.evals.evaluate({
  dataset,
  metrics: ["answer_relevancy", "faithfulness", "completeness"]
});

Environment Variables

# Required
FALLOM_API_KEY=your-fallom-api-key     # For uploading results & fetching datasets
OPENROUTER_API_KEY=your-openrouter-key # For judge model & model comparison

Available Metrics

Metric	Description
`answer_relevancy`	Does the response directly address the user’s question?
`hallucination`	Does the response contain fabricated information?
`toxicity`	Does the response contain harmful or offensive content?
`faithfulness`	Is the response factually accurate and consistent?
`completeness`	Does the response fully address all aspects of the request?

All metrics return a score from 0.0 to 1.0, where higher is better (except hallucination and toxicity, where higher means more hallucinated/toxic content detected).

Using Datasets from Fallom

Instead of creating datasets locally, you can use datasets stored in Fallom. Just pass the dataset key:

Python
TypeScript

from fallom import evals

evals.init()

# Just pass the dataset key - it auto-fetches from Fallom!

results = evals.evaluate(
dataset="my-dataset-key", # Dataset key from Fallom
metrics=["answer_relevancy", "faithfulness"]
)

import fallom from "@fallom/trace";

fallom.evals.init({});

// Just pass the dataset key - it auto-fetches from Fallom!
const results = await fallom.evals.evaluate({
  dataset: "my-dataset-key",  // Dataset key from Fallom
  metrics: ["answer_relevancy", "faithfulness"]
});

Model Comparison

Compare how different models perform on the same dataset:

Python
TypeScript

from fallom import evals

evals.init()

comparison = evals.compare_models(
dataset="my-dataset-key",
models=["anthropic/claude-3.5-sonnet", "openai/gpt-4o", "google/gemini-2.0-flash"],
metrics=["answer_relevancy", "faithfulness"],
name="Model Comparison Q4 2024"
)

# Results show scores for each model:

# - production (your original outputs)

# - anthropic/claude-3.5-sonnet

# - openai/gpt-4o

# - google/gemini-2.0-flash

import fallom from "@fallom/trace";

fallom.evals.init({});

const comparison = await fallom.evals.compareModels({
  dataset: "my-dataset-key",
  models: ["anthropic/claude-3.5-sonnet", "openai/gpt-4o", "google/gemini-2.0-flash"],
  metrics: ["answer_relevancy", "faithfulness"],
  name: "Model Comparison Q4 2024"
});

Custom & Fine-Tuned Models

You can include your own fine-tuned or self-hosted models in comparisons:

Python
TypeScript

from fallom import evals

evals.init()

# Fine-tuned OpenAI model

fine_tuned = evals.create_openai_model(
"ft:gpt-4o-2024-08-06:my-org::abc123",
name="my-fine-tuned"
)

# Self-hosted model (vLLM, Ollama, etc.)

my_llama = evals.create_custom_model(
name="my-llama-70b",
endpoint="http://localhost:8000/v1/chat/completions",
model_value="meta-llama/Llama-3.1-70B-Instruct"
)

# Compare all together

comparison = evals.compare_models(
dataset="my-dataset-key",
models=[fine_tuned, my_llama, "openai/gpt-4o"],
metrics=["answer_relevancy", "faithfulness"]
)

import fallom from "@fallom/trace";

fallom.evals.init({});

// Fine-tuned OpenAI model
const fineTuned = fallom.evals.createOpenAIModel(
  "ft:gpt-4o-2024-08-06:my-org::abc123",
  { name: "my-fine-tuned" }
);

// Self-hosted model (vLLM, Ollama, etc.)
const myLlama = fallom.evals.createCustomModel("my-llama-70b", {
  endpoint: "http://localhost:8000/v1/chat/completions",
  modelValue: "meta-llama/Llama-3.1-70B-Instruct"
});

// Compare all together
const comparison = await fallom.evals.compareModels({
  dataset: "my-dataset-key",
  models: [fineTuned, myLlama, "openai/gpt-4o"],
  metrics: ["answer_relevancy", "faithfulness"]
});

Custom Judge Model

By default, evaluations use openai/gpt-4o-mini via OpenRouter as the judge. You can specify a different judge:

Python
TypeScript

results = evals.evaluate(
    dataset="my-dataset-key",
    metrics=["answer_relevancy"],
    judge_model="anthropic/claude-3.5-sonnet"  # Use Claude as judge
)

const results = await fallom.evals.evaluate({
  dataset: "my-dataset-key",
  metrics: ["answer_relevancy"],
  judgeModel: "anthropic/claude-3.5-sonnet"  // Use Claude as judge
});

API Reference

`evaluate()`

Evaluate production outputs against metrics.

Python
TypeScript

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | dataset | list[DatasetItem] or str | required | Dataset items or Fallom dataset key | | metrics | list[str] | all metrics | Metrics to run | | judge_model | str | "openai/gpt-4o-mini" | Model to use as judge | | name | str | auto-generated | Name for this evaluation run | | description | str | None | Optional description | | verbose | bool | True | Print progress |

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | dataset | DatasetItem[] or string | required | Dataset items or Fallom dataset key | | metrics | MetricName[] | all metrics | Metrics to run | | judgeModel | string | "openai/gpt-4o-mini" | Model to use as judge | | name | string | auto-generated | Name for this evaluation run | | description | string | undefined | Optional description | | verbose | boolean | true | Print progress |

`compare_models()` / `compareModels()`

Compare multiple models on the same dataset.

Python
TypeScript

Parameter	Type	Default	Description
`dataset`	`list[DatasetItem]` or `str`	required	Dataset items or Fallom dataset key
`models`	`list[str or Model]`	required	Models to compare
`metrics`	`list[str]`	all metrics	Metrics to run
`judge_model`	`str`	`"openai/gpt-4o-mini"`	Model to use as judge
`include_production`	`bool`	`True`	Include original outputs
`name`	`str`	auto-generated	Name for this comparison run

Parameter	Type	Default	Description
`dataset`	`DatasetItem[]` or `string`	required	Dataset items or Fallom dataset key
`models`	`Array<string \| Model>`	required	Models to compare
`metrics`	`MetricName[]`	all metrics	Metrics to run
`judgeModel`	`string`	`"openai/gpt-4o-mini"`	Model to use as judge
`includeProduction`	`boolean`	`true`	Include original outputs
`name`	`string`	auto-generated	Name for this comparison run

Model Helpers

Function	Description
`create_openai_model()` / `createOpenAIModel()`	Create model for fine-tuned OpenAI or Azure OpenAI
`create_custom_model()` / `createCustomModel()`	Create model for any OpenAI-compatible endpoint
`create_model_from_callable()` / `createModelFromCallable()`	Create model from custom function

Getting Started

Features

TypeScript Integrations

Python Integrations

No-Code Integrations

Features

Quick Start

Environment Variables

Available Metrics

Using Datasets from Fallom

Model Comparison

Custom & Fine-Tuned Models

Custom Judge Model

API Reference

`evaluate()`

`compare_models()` / `compareModels()`

Model Helpers

Next Steps

Model Testing

Prompt Testing

Getting Started

Features

TypeScript Integrations

Python Integrations

No-Code Integrations

​Features

​Quick Start

​Environment Variables

​Available Metrics

​Using Datasets from Fallom

​Model Comparison

​Custom & Fine-Tuned Models

​Custom Judge Model

​API Reference

​evaluate()

​compare_models() / compareModels()

​Model Helpers

​Next Steps

Model Testing

Prompt Testing

Features

Quick Start

Environment Variables

Available Metrics

Using Datasets from Fallom

Model Comparison

Custom & Fine-Tuned Models

Custom Judge Model

API Reference

`evaluate()`

`compare_models()` / `compareModels()`

Model Helpers

Next Steps