Case Study

Eval Studio

Which prompt, which model, at what cost?

A browser-based LLM evaluation tool. Test prompts and models against your own data with a multi-model judge council, per-row cost tracking, and a ranked leaderboard. BYO API keys, nothing stored.

Try It GitHub Back to Portfolio

Multi-provider

Anthropic, OpenAI, Gemini

Judge Council

Cross-provider scoring

Cost Tracking

Per-row, per-config

Live

eval.harshit.ai

Shipped in 2 Hours

Side project

At a Glance

AI Eval Tool

Type

LLM Infra

Domain

Live

Status

Next.js + TS

Stack

Sections

The Problem

I Needed This Myself

I was designing an AI interview agent and faced a dilemma: which model should power it, and what prompt would actually perform before I committed to building it into the agent pipeline? I knew the options. Going the right way was what mattered for the product.

I tried comparing outputs manually across Claude, GPT, and Gemini. It was slow, messy, and impossible to keep track of. I thought something like this should exist. So I built it as a side project and shipped Eval Studio in two hours.

When I talked to other founders and PMs building Gen AI and agentic AI products, they described the same problem. Everyone was manually iterating through models and prompts, copying outputs into spreadsheets, losing track of which version performed better. It felt productive but it was productive procrastination. No rigor, no reproducibility, no cost visibility.

Market Context

What Exists and Where It Falls Short

LLM evaluation is not a new idea. The tools exist. But they are all built for different users than the PM or founder who just needs to pick a model and a prompt before building.

Prompt Cannon

Casual model comparison

One prompt at a time, side-by-side outputs. No dataset, no scoring rubric, no cost tracking. Good for curiosity, not for product decisions.

Promptfoo (acquired by OpenAI)

Developer CLI tool

Powerful but requires Node.js, YAML config files, and terminal setup. Pivoted heavily toward red teaming and security testing. Not built for quick prompt/model comparison.

Braintrust

Enterprise eval platform

Full observability suite with CI/CD integration, tracing, and experiment tracking. Pro plan at $249/month. Built for teams with production AI, not for a PM prototyping an agent.

LangSmith

LangChain ecosystem

Tightly coupled to LangChain. Strong for teams already in that ecosystem, but requires SDK integration. Not model-agnostic in practice.

Google LLM Comparator

Research tool

Python library that ingests JSON files. Built for ML researchers comparing Gemma model versions, not for PMs comparing prompt strategies across providers.

Langfuse

Open-source observability

Excellent for production logging and tracing. Eval features exist but are secondary to observability. Requires self-hosting or cloud account setup.

The gap is not eval tooling. The gap is zero-setup eval tooling. Every tool above requires CLI installation, SDK integration, cloud accounts, or Python scripting before you can compare a single prompt. Eval Studio is a URL. Open it, paste your API keys, upload a CSV, and get scored results in minutes. That is the product bet: the fastest path from question to answer for anyone choosing a model or prompt.

Use Cases

Two Entry Points, Two Decisions

Test Models

Same prompt, different models. Upload your dataset, write one system prompt, pick 2-4 models across Anthropic, OpenAI, and Gemini. See which model serves your data best and at what cost.

Test Prompts

Same model, different prompts. Upload your dataset, pick one model, write 2-4 system prompts. See which prompt produces better outputs on your actual data.

Workflow

How It Works

Step 1

Enter API keys

Bring your own keys for Anthropic, OpenAI, or Google Gemini. Nothing stored, nothing logged. Keys live in React state only.

Step 2

Upload a dataset

Upload a CSV (max 50 rows) or use the built-in 50-row sample dataset covering 8 task categories.

Step 3

Configure 2-4 prompt/model combinations

Mix and match models and system prompts. Test Models mode locks the prompt; Test Prompts mode locks the model.

Step 4

Define a scoring rubric

Named criteria, weights, and colors. The rubric tells the judge council exactly what to evaluate and how to weight it.

Step 5

Choose a judge council

1-2 models, cross-provider recommended to reduce bias. A bias warning appears when any judge provider matches any config provider.

Step 6

Run the eval

Each row is scored independently by each judge, scores averaged. Rows where judges disagree by more than 15 points are flagged as outliers.

Step 7

Get ranked results

A ranked leaderboard with per-row scores, cost breakdown, outlier flags, and CSV export.

Stack

Next.jsTypeScriptAnthropic APIOpenAI APIGemini APIVercel

Design Decisions

Why It Works This Way

Judge council over single judge

A single model judging its own provider's outputs has a documented bias. Eval Studio uses two independent judges from different providers, averages their scores, and flags rows where judges disagree by more than 15 points. A bias warning appears when any judge provider matches any config provider.

BYO key, no backend storage

Users bring their own API keys. Keys live in React state only, never stored, never logged. API calls go through a server-side proxy to handle CORS, but nothing is persisted. This builds trust and eliminates server cost.

Cost tracking as a first-class metric

Every eval shows total tokens, total cost, and average cost per row per config. Pricing is hardcoded from provider docs with a last-updated timestamp. Cost is a real selection criterion alongside quality.

N-way configs (2-4) over fixed A vs B

Fixed A/B does not scale. N-way with dynamic add/remove is more flexible. Capped at 4 to stay within Tier 1 rate limits at 50 rows.

Ranked leaderboard over side-by-side columns

With N configs there is no natural A vs B. A ranked list with gold/silver/bronze badges scales cleanly and is easier to scan.

API key error recovery without losing state

When 50%+ of rows error (bad key), an inline fix panel appears on the results page with editable key inputs. Retry reruns the full eval without navigating away. Dataset, configs, and rubric are all preserved.

Limitations

Honest Limitations

No persistence
Closing the tab clears everything. This is intentional for V1 (no server cost, no auth complexity) but means you cannot compare runs across sessions.
50-row cap
Sequential processing (one row at a time to avoid rate limits) means larger datasets would be slow. Batch API support is the V2 path.
Judge scoring is LLM pattern matching
The rubric scores are the judges' assessments, not grounded metrics. Two-judge averaging reduces noise but does not eliminate subjectivity.
Cost data is hardcoded
Pricing tables are manually maintained from provider documentation, not pulled from a live API.

Product Screenshots

The Product

Roadmap

What's Next

Synthetic golden dataset generator: create high-precision evaluation datasets from minimal inputs, so you do not need 50 hand-labeled rows to start
Demo mode with pre-loaded mock results so visitors can browse without API keys
Persistent run history across sessions
Batch API support for larger datasets
Hybrid scoring: auto-detect exact match for structured outputs, LLM judge for open-ended