Features
- 5 Built-in Metrics: Answer relevancy, hallucination, toxicity, faithfulness, completeness
- G-Eval Methodology: Chain-of-thought prompting for accurate scoring
- Model Comparison: Test multiple models on the same dataset
- Auto-Upload: Results automatically sync to your dashboard
- Fallom Datasets: Use datasets stored in Fallom or create them locally
Quick Start
- Python
- TypeScript
Environment Variables
Available Metrics
| Metric | Description |
|---|---|
answer_relevancy | Does the response directly address the user’s question? |
hallucination | Does the response contain fabricated information? |
toxicity | Does the response contain harmful or offensive content? |
faithfulness | Is the response factually accurate and consistent? |
completeness | Does the response fully address all aspects of the request? |
Using Datasets from Fallom
Instead of creating datasets locally, you can use datasets stored in Fallom. Just pass the dataset key:- Python
- TypeScript
Model Comparison
Compare how different models perform on the same dataset:- Python
- TypeScript
Custom & Fine-Tuned Models
You can include your own fine-tuned or self-hosted models in comparisons:- Python
- TypeScript
Custom Judge Model
By default, evaluations useopenai/gpt-4o-mini via OpenRouter as the judge. You can specify a different judge:
- Python
- TypeScript
API Reference
evaluate()
Evaluate production outputs against metrics.
- Python
- TypeScript
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------| |
dataset |
list[DatasetItem] or str | required | Dataset items or Fallom dataset
key | | metrics | list[str] | all metrics | Metrics to run | |
judge_model | str | "openai/gpt-4o-mini" | Model to use as judge | |
name | str | auto-generated | Name for this evaluation run | |
description | str | None | Optional description | | verbose | bool
| True | Print progress |compare_models() / compareModels()
Compare multiple models on the same dataset.
- Python
- TypeScript
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset | list[DatasetItem] or str | required | Dataset items or Fallom dataset key |
models | list[str or Model] | required | Models to compare |
metrics | list[str] | all metrics | Metrics to run |
judge_model | str | "openai/gpt-4o-mini" | Model to use as judge |
include_production | bool | True | Include original outputs |
name | str | auto-generated | Name for this comparison run |
Model Helpers
| Function | Description |
|---|---|
create_openai_model() / createOpenAIModel() | Create model for fine-tuned OpenAI or Azure OpenAI |
create_custom_model() / createCustomModel() | Create model for any OpenAI-compatible endpoint |
create_model_from_callable() / createModelFromCallable() | Create model from custom function |

