llm-evaluation-framework

Here are 47 public repositories matching this topic...

confident-ai / deepeval

The LLM Evaluation Framework

python evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Mar 4, 2026
Python

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Mar 4, 2026
TypeScript

msoedov / agentic_security

Star

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

agent-framework ai-red-team prompt-testing llm-security llm-vulnerabilities llm-evaluation llm-fuzzing llm-evaluation-framework llm-guardrails llm-scanner llm-jailbreaks llm-fuzzer llm-fuzzer-aggregator agent-security

Updated Feb 3, 2026
Python

rhesis-ai / rhesis

Star

Open-source platform & SDK for testing LLM and agentic apps. Define expected behavior, generate and run test scenarios, and review failures collaboratively.

open-source test-generation quality-assessment test-management test-execution responsible-ai trustworthy-ai generative-ai llmops llm-evaluation llm-evaluation-framework

Updated Mar 3, 2026
Python

JinjieNi / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Nov 10, 2024
Python

cvs-health / langfair

Star

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

python ai artificial-intelligence bias fairness ai-safety fairness-testing bias-detection fairness-ai fairness-ml responsible-ai ethical-ai large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Jan 9, 2026
Python

Addepto / contextcheck

Star

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

open-source ci testing-tools chatbot-framework testing-framework chatbot-testing rag ai-chat large-language-models llm ai-testing llm-evaluation llm-evaluation-framework prompt-test llm-testing ai-testing-tool generative-ai-testing rag-testing summarization-testing

Updated Dec 11, 2024
Python

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Feb 13, 2025
Python

zli12321 / qa_metrics

Star

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

qa-automation-test rl-training llm exact-matching llm-evaluation llm-evaluation-toolkit llm-evaluation-framework reward-modeling

Updated Jul 18, 2025
Python

multinear / multinear

Star

Develop reliable AI apps

reliability evaluation llm llms llm-eval llm-evaluation llms-benchmarking llm-evaluation-framework

Updated Sep 2, 2025
Python

flexpa / llm-fhir-eval

Star

Benchmarking Large Language Models for FHIR

fhir fhirpath fhir-resources llm evals llm-evaluation-framework fhir-llm

Updated Feb 4, 2026
TypeScript

zhuohaoyu / KIEval

Star

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

machine-learning explainable-ai llm llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics acl2024

Updated Jul 19, 2024
Python

vero-labs-ai / vero-eval

Star

Open source framework for evaluating AI Agents

python testing evaluation datasets dataset-generation evaluation-metrics evaluation-framework testing-framework testing-library synthetic-dataset-generation user-persona evals llm-evaluation rag-evaluation llm-evaluation-framework langgraph rag-testing

Updated Feb 24, 2026
Python

aws-samples / fm-leaderboarder

Star

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

llm-evaluation llm-evaluation-framework llm-benchmarking

Updated Oct 31, 2024
Python

honeyhiveai / realign

Star

Realign is a testing and simulation framework for AI applications.

ai simulation evaluation alignment red-teaming rag prompt-engineering llms llmops llm-eval llm-evaluation aiengineering llm-evaluation-framework

Updated Dec 4, 2024
Python

dokimos-dev / dokimos

Star

Evaluation Framework for LLM applications in Java and Kotlin

Updated Mar 1, 2026
Java

Human-Centric-Machine-Learning / prediction-powered-ranking

Star

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

ranking-algorithm llm-eval llm-evaluation llm-evaluation-framework prediction-powered-inference rank-sets

Updated Oct 28, 2024
Jupyter Notebook

ronniross / confidence-scorer

Sponsor

Star

Measure of estimated confidence for non-hallucinative nature of outputs generated by Transformer-based Language Models.

dataset datasets llm llms llm-training llm-evaluation llms-reasoning llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework llm-evaluation-metrics llms-efficency llms-evalution

Updated Feb 26, 2026
Python

pyladiesams / eval-llm-based-apps-jan2025

Star

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

workshop llm llms llmops llm-eval llm-test llm-evaluation-framework llm-evaluation-metrics llm-monitoring llm-testing llm-evals

Updated May 6, 2025
Jupyter Notebook

petmal / MindTrial

Star

MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments and tool use. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI, Alibaba, Moonshot AI, OpenRouter), custom tasks in YAML, and HTML/CSV reports.

Updated Feb 28, 2026
Go

Improve this page

Add a description, image, and links to the llm-evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evaluation-framework

Here are 47 public repositories matching this topic...

confident-ai / deepeval

promptfoo / promptfoo

msoedov / agentic_security

rhesis-ai / rhesis

JinjieNi / MixEval

cvs-health / langfair

Addepto / contextcheck

parea-ai / parea-sdk-py

zli12321 / qa_metrics

multinear / multinear

flexpa / llm-fhir-eval

zhuohaoyu / KIEval

vero-labs-ai / vero-eval

aws-samples / fm-leaderboarder

honeyhiveai / realign

dokimos-dev / dokimos

Human-Centric-Machine-Learning / prediction-powered-ranking

ronniross / confidence-scorer

pyladiesams / eval-llm-based-apps-jan2025

petmal / MindTrial

Improve this page

Add this topic to your repo