Case Study
Eval Studio
Which prompt, which model, at what cost?
A browser-based LLM evaluation tool. Test prompts and models against your own data with a multi-model judge council, per-row cost tracking, and a ranked leaderboard. BYO API keys, nothing stored.
Multi-provider
Anthropic, OpenAI, Gemini
Judge Council
Cross-provider scoring
Cost Tracking
Per-row, per-config
Live
eval.harshit.ai
Shipped in 2 Hours
Side project
At a Glance
AI Eval Tool
Type
LLM Infra
Domain
Live
Status
Next.js + TS
Stack
Sections
The Problem
I Needed This Myself
I was designing an AI interview agent and faced a dilemma: which model should power it, and what prompt would actually perform before I committed to building it into the agent pipeline? I knew the options. Going the right way was what mattered for the product.
I tried comparing outputs manually across Claude, GPT, and Gemini. It was slow, messy, and impossible to keep track of. I thought something like this should exist. So I built it as a side project and shipped Eval Studio in two hours.
When I talked to other founders and PMs building Gen AI and agentic AI products, they described the same problem. Everyone was manually iterating through models and prompts, copying outputs into spreadsheets, losing track of which version performed better. It felt productive but it was productive procrastination. No rigor, no reproducibility, no cost visibility.
Market Context
What Exists and Where It Falls Short
LLM evaluation is not a new idea. The tools exist. But they are all built for different users than the PM or founder who just needs to pick a model and a prompt before building.
Prompt Cannon
Casual model comparison
One prompt at a time, side-by-side outputs. No dataset, no scoring rubric, no cost tracking. Good for curiosity, not for product decisions.
Promptfoo (acquired by OpenAI)
Developer CLI tool
Powerful but requires Node.js, YAML config files, and terminal setup. Pivoted heavily toward red teaming and security testing. Not built for quick prompt/model comparison.
Braintrust
Enterprise eval platform
Full observability suite with CI/CD integration, tracing, and experiment tracking. Pro plan at $249/month. Built for teams with production AI, not for a PM prototyping an agent.
LangSmith
LangChain ecosystem
Tightly coupled to LangChain. Strong for teams already in that ecosystem, but requires SDK integration. Not model-agnostic in practice.
Google LLM Comparator
Research tool
Python library that ingests JSON files. Built for ML researchers comparing Gemma model versions, not for PMs comparing prompt strategies across providers.
Langfuse
Open-source observability
Excellent for production logging and tracing. Eval features exist but are secondary to observability. Requires self-hosting or cloud account setup.
The gap is not eval tooling. The gap is zero-setup eval tooling. Every tool above requires CLI installation, SDK integration, cloud accounts, or Python scripting before you can compare a single prompt. Eval Studio is a URL. Open it, paste your API keys, upload a CSV, and get scored results in minutes. That is the product bet: the fastest path from question to answer for anyone choosing a model or prompt.
Use Cases
Two Entry Points, Two Decisions
Test Models
Same prompt, different models. Upload your dataset, write one system prompt, pick 2-4 models across Anthropic, OpenAI, and Gemini. See which model serves your data best and at what cost.
Test Prompts
Same model, different prompts. Upload your dataset, pick one model, write 2-4 system prompts. See which prompt produces better outputs on your actual data.
Workflow
How It Works
Step 1
Enter API keys
Bring your own keys for Anthropic, OpenAI, or Google Gemini. Nothing stored, nothing logged. Keys live in React state only.
Step 2
Upload a dataset
Upload a CSV (max 50 rows) or use the built-in 50-row sample dataset covering 8 task categories.
Step 3
Configure 2-4 prompt/model combinations
Mix and match models and system prompts. Test Models mode locks the prompt; Test Prompts mode locks the model.
Step 4
Define a scoring rubric
Named criteria, weights, and colors. The rubric tells the judge council exactly what to evaluate and how to weight it.
Step 5
Choose a judge council
1-2 models, cross-provider recommended to reduce bias. A bias warning appears when any judge provider matches any config provider.
Step 6
Run the eval
Each row is scored independently by each judge, scores averaged. Rows where judges disagree by more than 15 points are flagged as outliers.
Step 7
Get ranked results
A ranked leaderboard with per-row scores, cost breakdown, outlier flags, and CSV export.
Stack
Next.jsTypeScriptAnthropic APIOpenAI APIGemini APIVercelDesign Decisions
Why It Works This Way
Judge council over single judge
A single model judging its own provider's outputs has a documented bias. Eval Studio uses two independent judges from different providers, averages their scores, and flags rows where judges disagree by more than 15 points. A bias warning appears when any judge provider matches any config provider.
BYO key, no backend storage
Users bring their own API keys. Keys live in React state only, never stored, never logged. API calls go through a server-side proxy to handle CORS, but nothing is persisted. This builds trust and eliminates server cost.
Cost tracking as a first-class metric
Every eval shows total tokens, total cost, and average cost per row per config. Pricing is hardcoded from provider docs with a last-updated timestamp. Cost is a real selection criterion alongside quality.
N-way configs (2-4) over fixed A vs B
Fixed A/B does not scale. N-way with dynamic add/remove is more flexible. Capped at 4 to stay within Tier 1 rate limits at 50 rows.
Ranked leaderboard over side-by-side columns
With N configs there is no natural A vs B. A ranked list with gold/silver/bronze badges scales cleanly and is easier to scan.
API key error recovery without losing state
When 50%+ of rows error (bad key), an inline fix panel appears on the results page with editable key inputs. Retry reruns the full eval without navigating away. Dataset, configs, and rubric are all preserved.
Limitations
Honest Limitations
No persistence
Closing the tab clears everything. This is intentional for V1 (no server cost, no auth complexity) but means you cannot compare runs across sessions.
50-row cap
Sequential processing (one row at a time to avoid rate limits) means larger datasets would be slow. Batch API support is the V2 path.
Judge scoring is LLM pattern matching
The rubric scores are the judges' assessments, not grounded metrics. Two-judge averaging reduces noise but does not eliminate subjectivity.
Cost data is hardcoded
Pricing tables are manually maintained from provider documentation, not pulled from a live API.
Product Screenshots
The Product
Roadmap
What's Next
- Synthetic golden dataset generator: create high-precision evaluation datasets from minimal inputs, so you do not need 50 hand-labeled rows to start
- Demo mode with pre-loaded mock results so visitors can browse without API keys
- Persistent run history across sessions
- Batch API support for larger datasets
- Hybrid scoring: auto-detect exact match for structured outputs, LLM judge for open-ended