Evals are a scam. And we're being gaslit into believing they aren't.
Most AI teams don't need evals. They need logging, QA, and taste.
If you’re building a non-deterministic AI system, wouldn’t evaluation tools be essential?
That’s why VCs are investing hundreds of millions of dollars on startups building in the evals space. The prevailing discourse is focused on convincing engineering teams that evals are critical to building reliable AI products, and engineering teams should spend budget on software that provides it. And I’m no exception— I myself run a VC-funded startup that raised to sell evals!

But now I’ve had enough 1:1 conversations with AI engineers to firmly hold this conviction:
The evals discourse is wrong and distracting. Most AI products don’t need evals. Vendors selling evals are not actually selling evals.
Foundation Model evals != Product evals
First, I want to make an important distinction: Foundation model labs and product companies measure two very different things.
Model Evals approximate the general effectiveness of an LLM across various tasks and domains. For example:
Reasoning (ARC-AGI, MMLU)
Coding (Terminal Bench, SWE Bench)
Human feedback (Chatbot Arena, “Vibes”)
When I used to train ML models, most of my time was spent curating datasets and tweaking loss functions to measure the model’s performance in general. Foundation model labs have the same goal; train generally capable LLMs.
Unless you work at OpenAI, Anthropic, Meta, etc. you do not need model evals.
On the other hand, product evals represent the effectiveness and reliability of applied AI systems to real use cases. These are wildly subjective and extremely specific to the market segment you’re building against.
Eval vendors are trying to sell you something they will never be able to deliver: Expertise on your own product.
Product evals are subjective, ambiguous, and messy. And nobody wants to do them.
If you survey the docs of the most well-funded evals companies out there, they are all more or less identical.
They are usually some set of:
Prompt registry and experimentation sandbox
LLM-as-judge or human labeling mechanism for arbitrary metrics (i.e. correctness, toxicity, relevancy)
Binary labels or continuous scorers for comparing outputs
These features may solve evals for some teams, but they don’t address the core challenges with building great AI products.
The best way to evaluate AI systems is to build a QA function within your org to measure the effectiveness of your product. And building effective QA is really, really hard.
Annotation is laborious and requires specialized knowledge.
Product experts and engineers want to build. Nobody wants to label data.
Quantitative measures of quality rarely capture the system’s “vibe.” Quality becomes obvious as you’re building product.
Concrete example: Agent simulations
At AgentOps, we ran a POC with a large enterprise consumer goods company. They wanted to survey new product ideas to LLM-simulated agents instead of expensive, humans. The underlying thesis is agents can roughly approximate human opinions.
What’s the eval?
Survey a large, diverse demographic of humans (i.e. “how do you feel about coconut chocolate?”) with Likert scale rankings.
Run the same survey on a large sample of LLM respondents prompted to role play as the demographic.
Measure the difference between the results
How do you measure the difference? A Chi-squared test of homogeneity is probably your best bet.
Let’s compare this with the evals we see on the market.
Did we need “factuality” or any other kind of LLM-as-judge (or human) scores to measure this? No— we needed Pandas and SciKit Learn. Or heck, even Google Sheets. The dashboard here was, frankly, overkill.
Here’s my contention: The best way to build evals will always be to do it yourself.
It’s your product, it’s your eval. Build and own a system that supports your use case. Eval vendors will sell you dashboard theater. You can’t outsource taste.
Concrete example: AI video editing
A friend is an engineer at an AI video editing company. They’re building new functionality to allow users to edit videos with prompts.
What’s the eval?
Check whether the resulting edits align with the user’s intentions
Ensure the edits produced are high quality
First, you need someone with actually good taste to rank these outputs. ML engineers and PMs, although skilled in other ways, do not want to spend time labeling data.
Ok, so now you’ve recruited a team of (expensive) experienced video editors to rank your product.
They all disagree with each other on what quality means.
Given enough iterations, you’ll arrive at a quality product. But in the meantime, it’s expensive, boring, and worst of all, subjective. Almost every tangible AI use case will have this issue.
Off the shelf eval products will not solve this problem.
Evals are QA. Why would you ever outsource QA?
I asked a friend working at a notable AI voice company about a well-funded startup claimed to provide their evals.
“We don’t buy evals from anyone.”
That was a surprise. I dug deeper. Was evaluating their product a priority? Absolutely. Then why not invest in a product for it?
“Evals are too core to our product experience to outsource to a vendor.”
Imagine building a product (with or without LLMs), but your QA is entirely outsourced to someone with zero expertise about your product. Why would you ever do such a thing?
That’s the problem.
“When our AI isn’t performing, our customers let us know. And before then, we usually see the limits when we’re working on it.”
If you’re building good software, you are already using evals. They’re called product requirements.
Answer relevancy? Toxicity? Correctness? All bullshit.
You don’t need to know these. You need to know the most important metrics specific to your product experience to build against them.
The only evals that matter are the relationship between you and your user. You will never be able to buy that from a vendor.
Merchants of Complexity
If evals are a fake product, then why are they raising so much VC? After running countless 1:1 interviews with engineers using products like these, I’ve come to a surprising and unintuitive conclusion:
None of these companies actually sell evals.
They are Merchants of Complexity. That is, they pitch the idea of a framework or mental model in order to convince you to offload design thinking to an expert has sunk time into it already. They market courses, e-books, blog posts, webinars, YouTube channels, and conference talks. The material sounds impartial, but it all arrives at the same conclusion: “You must agree with the ontology we are proposing and the software attached to it.”
After all, if a panel of experts has arrived at a conclusive methodology, why challenge them and recreate it yourself? In reality, the Evals Guild is really trying to sell you something else:
Logging.

Logging is the gateway to evals. But logging ends up being the only service you end up using. It’s a very sticky product. And expensive. In fact, ~10-30% of all of cloud spend goes just to observability!
If you read an e-book or join a webinar talking about the benefits of evals and why you absolutely must have them, it probably makes sense to look into buying some. And prima facie, it makes sense.
But if you agree with my points above, evals clearly don’t cut it. Logging, however, has innate value both for observability purposes as well as the potential to unlock evals.
Announcing our latest feature… Evals!
So why do evals companies exist? Marketing. Merchants of Complexity win the mindshare.
And that’s why I’m officially announcing Evals for AgentOps 🤡
Shaped by years of ML engineering experience, I guarantee AgentOps “Evals” will address (well, at least document) most of the pain points discussed in public today.
It will look complex. It will look pretty. And I can also guarantee this feature will be almost entirely useless. Satirical even. But hey, it’s marketing.
And our next batch of customers will think they’re buying evals too.





Love this!
I totally agree with this take, Alex. However, what would you suggest the platforms focus on given this? The evals they are often selling are, in my opinion, merely starting points and general infrastructure that supports businesses' more complex evals. So, do i understand you correctly in suggesting that the logging + eval framework is the core contribution of these platforms, not the evals themselves?