Use this file to discover all available pages before exploring further.
Fallom Evals lets you evaluate your LLM outputs using G-Eval with an LLM-as-judge approach. You can run evaluations locally using our SDK, or directly from the Fallom dashboard on your production traces—no code required.
If you’re already logging traces to Fallom, you can run evaluations directly from the dashboard without writing any code. Navigate to Evals Store → Evals to create evaluation configs that automatically sample and evaluate your production traces.
from fallom import evals# Initialize (enables auto-upload to dashboard)evals.init(api_key="your-fallom-api-key")# Create a datasetdataset = [evals.DatasetItem(input="What is the capital of France?",output="The capital of France is Paris.",system_message="You are a helpful assistant."),]# Run evaluation - results auto-upload!results = evals.evaluate(dataset=dataset,metrics=["answer_relevancy", "faithfulness", "completeness"])
import fallom from "@fallom/trace";// Initialize (enables auto-upload to dashboard)fallom.evals.init({ apiKey: "your-fallom-api-key" });// Create a datasetconst dataset = [ { input: "What is the capital of France?", output: "The capital of France is Paris.", systemMessage: "You are a helpful assistant." }];// Run evaluation - results auto-upload!const results = await fallom.evals.evaluate({ dataset, metrics: ["answer_relevancy", "faithfulness", "completeness"]});
# RequiredFALLOM_API_KEY=your-fallom-api-key # For uploading results & fetching datasetsOPENROUTER_API_KEY=your-openrouter-key # For judge model & model comparison
Does the response directly address the user’s question?
hallucination
Does the response contain fabricated information?
toxicity
Does the response contain harmful or offensive content?
faithfulness
Is the response factually accurate and consistent?
completeness
Does the response fully address all aspects of the request?
coherence
Is the response logically structured and easy to follow?
bias
Does the response contain unfair or prejudiced content?
All metrics return a score from 0.0 to 1.0, where higher is better (except hallucination, toxicity, and bias, where higher means more problematic content detected).
Create your own evaluation metrics with custom criteria and evaluation steps:
Python
TypeScript
from fallom import evalsevals.init()# Define a custom metricbrand_metric = evals.CustomMetric(name="brand_alignment",criteria="Brand Alignment - Does the response follow brand voice guidelines?",steps=["Check if the tone is professional yet friendly","Verify no competitor brands are mentioned","Ensure the response uses approved terminology","Check for appropriate emoji usage (none in formal contexts)"])# Use with built-in metricsresults = evals.evaluate(dataset=dataset,metrics=["answer_relevancy", brand_metric])
import fallom from "@fallom/trace";fallom.evals.init({});// Define a custom metricconst brandMetric = { name: "brand_alignment", criteria: "Brand Alignment - Does the response follow brand voice guidelines?", steps: [ "Check if the tone is professional yet friendly", "Verify no competitor brands are mentioned", "Ensure the response uses approved terminology", "Check for appropriate emoji usage (none in formal contexts)" ]};// Use with built-in metricsconst results = await fallom.evals.evaluate({ dataset, metrics: ["answer_relevancy", brandMetric]});
Custom metrics use the same G-Eval methodology as built-in metrics - the LLM judge follows your steps and provides reasoning and a score.
Instead of creating datasets locally, you can use datasets stored in Fallom. Just pass the dataset key:
Python
TypeScript
from fallom import evalsevals.init()# Just pass the dataset key - it auto-fetches from Fallom!results = evals.evaluate(dataset="my-dataset-key", # Dataset key from Fallommetrics=["answer_relevancy", "faithfulness"])
import fallom from "@fallom/trace";fallom.evals.init({});// Just pass the dataset key - it auto-fetches from Fallom!const results = await fallom.evals.evaluate({ dataset: "my-dataset-key", // Dataset key from Fallom metrics: ["answer_relevancy", "faithfulness"]});
If you have a complex LLM pipeline (RAG, multi-agent, custom routing), use EvaluationDataset to run your own pipeline and evaluate the outputs:
Python
TypeScript
from fallom import evalsevals.init(api_key="your-fallom-api-key")# Pull a dataset from Fallomdataset = evals.EvaluationDataset()dataset.pull("customer-support-qa")# Run each input through YOUR pipelinefor golden in dataset.goldens: # Your custom pipeline (RAG, agent routing, etc.)actual_output = my_rag_pipeline(golden.input) dataset.add_test_case(evals.LLMTestCase( input=golden.input, actual_output=actual_output, context=retrieved_docs # Optional: for faithfulness eval ))# Evaluate your outputsresults = evals.evaluate(test_cases=dataset.test_cases,metrics=["answer_relevancy", "faithfulness", "completeness"])
import fallom from "@fallom/trace";fallom.evals.init({ apiKey: "your-fallom-api-key" });// Pull a dataset from Fallomconst dataset = new fallom.evals.EvaluationDataset();await dataset.pull("customer-support-qa");// Run each input through YOUR pipelinefor (const golden of dataset.goldens) { // Your custom pipeline (RAG, agent routing, etc.) const actualOutput = await myRAGPipeline(golden.input); dataset.addTestCase({ input: golden.input, actualOutput, context: retrievedDocs // Optional: for faithfulness eval });}// Evaluate your outputsconst results = await fallom.evals.evaluate({ testCases: dataset.testCases, metrics: ["answer_relevancy", "faithfulness", "completeness"]});
Compare how different models perform on the same dataset:
Python
TypeScript
from fallom import evalsevals.init()comparison = evals.compare_models(dataset="my-dataset-key",models=["anthropic/claude-3.5-sonnet", "openai/gpt-4o", "google/gemini-2.0-flash"],metrics=["answer_relevancy", "faithfulness"],name="Model Comparison Q4 2024")# Results show scores for each model:# - production (your original outputs)# - anthropic/claude-3.5-sonnet# - openai/gpt-4o# - google/gemini-2.0-flash
A class for managing datasets and test cases when using your own LLM pipeline.
Python
TypeScript
dataset = evals.EvaluationDataset()# Methodsdataset.pull(alias, version=None) # Pull dataset from Fallomdataset.add_golden(golden) # Add a golden recorddataset.add_test_case(test_case) # Add a test casedataset.generate_test_cases(llm_app) # Auto-generate test casesdataset.clear_test_cases() # Clear all test cases# Propertiesdataset.goldens # List of Golden recordsdataset.test_cases # List of LLMTestCase recordsdataset.dataset_key # Fallom dataset key (if pulled)
const dataset = new fallom.evals.EvaluationDataset();// Methodsawait dataset.pull(alias, version?) // Pull dataset from Fallomdataset.addGolden(golden) // Add a golden recorddataset.addTestCase(testCase) // Add a test caseawait dataset.generateTestCases(llmApp) // Auto-generate test casesdataset.clearTestCases() // Clear all test cases// Propertiesdataset.goldens // Array of Golden recordsdataset.testCases // Array of LLMTestCase recordsdataset.datasetKey // Fallom dataset key (if pulled)