Why I Built Tepis AI: The Case for Independent LLM Evaluation

I did not set out to build an AI evaluation company. I was a CRO deploying AI tools to accelerate revenue. Forecasting models. Lead scoring. Pricing optimization. The standard enterprise AI playbook.

Then the forecasting model started lying to me.

Not obviously. Not with error messages or crashed pipelines. The model produced confident, well-formatted predictions every week. The predictions were wrong with disturbing regularity, but they looked right. They had the structure of insight without the substance.

When I dug in, I found something that changed the trajectory of my career: the LLM powering our forecasting synthesis was evaluating its own outputs as part of the pipeline, and it was incapable of identifying its own errors. It would generate a forecast, evaluate the forecast as “high confidence,” and surface that to my team as a reliable number. The evaluation was as hallucinated as the forecast itself.

That discovery became the seed of Tepis AI and the MetaTruth framework. What I found in that forecasting pipeline turned out to be a fundamental property of how large language models process information. And it has implications that every enterprise deploying AI needs to understand.

Recursive Evaluation Collapse

The phenomenon I discovered has a name now: Recursive Evaluation Collapse (REC). It is the central finding of my research at USP and the mechanism that Tepis AI was built to detect.

What Is Recursive Evaluation Collapse?

When an LLM is asked to evaluate its own output, it applies the same knowledge representations, the same reasoning patterns, and the same blind spots that produced the original output. The evaluation inherits the errors of the generation.

This is not a bug in a specific model. It is a structural property of self-referential evaluation in neural networks. The model cannot step outside its own epistemic boundaries to assess whether those boundaries are correct.

The practical consequence: LLM self-evaluation scores are inflated by 15-40% compared to independent evaluation, depending on the task domain and the model’s training distribution.

Think about what this means for enterprises. Every company using an LLM to check its own work, to validate its own outputs, to assess its own confidence, is operating with systematically inflated quality metrics. They think their AI is performing at 92% accuracy. The real number might be 65%.

And they have no way to know, because the evaluation tool is compromised by the same failure modes as the system it is evaluating.

Temperature Orthogonality

The second discovery was equally counterintuitive. In the AI industry, temperature is treated as a creativity dial: low temperature for deterministic outputs, high temperature for creative variation. That mental model is dangerously incomplete.

Temperature Orthogonality

Temperature does not uniformly scale output variation along a single axis. Different capability dimensions respond to temperature changes independently and sometimes inversely. A model might become more factually accurate at higher temperatures for certain query types while simultaneously becoming less logically consistent.

Temperature is not a knob. It is a multi-dimensional control surface, and most enterprises are adjusting it blind.

I discovered this while running systematic evaluations across temperature ranges for our revenue models. The same model at temperature 0.2 would produce better numerical forecasts but worse qualitative reasoning. At 0.7, the reasoning improved but the numbers drifted. There was no single “correct” temperature. The optimal setting depended on which capability dimension you cared about, and those dimensions were orthogonal to each other.

For enterprise AI deployments, this means the standard practice of setting temperature once during configuration and never revisiting it is leaving performance on the table. Worse, it might be optimizing one capability at the cost of another you have not measured.

Why Benchmarks Lie

These findings led me to a broader investigation of how the AI industry evaluates its own products. The answer is: badly.

Public benchmarks like MMLU, HumanEval, and HellaSwag have become the equivalent of credit ratings before 2008: widely trusted, structurally compromised, and dangerously misleading for high-stakes decisions.

The problems are well-documented but poorly addressed:

Contamination. Benchmark datasets leak into training data. Models score well on benchmarks not because they can reason but because they have memorized the answers. This is not speculation. It has been demonstrated repeatedly across major model families.
Task mismatch. Benchmarks measure narrow capabilities (multiple choice, code completion, summarization) that do not map to enterprise use cases (complex reasoning over proprietary data, multi-step decision support, nuanced communication).
Static evaluation of dynamic systems. Models change. They are updated, fine-tuned, and retrained. A benchmark score from three months ago tells you nothing about today’s model behavior. Yet enterprises make procurement decisions on stale evaluations.
Vendor self-reporting. The organizations publishing benchmark results are the same organizations selling the models. The incentive structure is obvious and the results are predictable.

This is not a theoretical concern. I have worked with enterprises that selected AI vendors based on benchmark performance and discovered, after deployment, that the model could not handle their specific data formats, regulatory constraints, or edge cases. The benchmark said “state of the art.” The production deployment said “not ready.”

The MetaTruth Framework

MetaTruth is the evaluation framework we built at Tepis AI to solve these problems. It is not another benchmark. It is a methodology for evaluating evaluations, a meta-evaluation layer that exposes the failure modes benchmarks miss.

MetaTruth Core PrinciplesIndependent evaluation. The evaluator must be structurally independent from the system being evaluated. No self-assessment. No vendor-provided test suites.
Multi-dimensional scoring. A single accuracy number is meaningless. Evaluate factual accuracy, logical consistency, calibration (does the model know what it does not know?), and behavioral stability across conditions.
Adversarial probing. Standard test cases show you the best-case behavior. Adversarial probes show you the failure boundaries. Both are necessary for risk assessment.
Cross-benchmark validation. No single evaluation methodology is sufficient. Cross-reference results across multiple independent evaluation approaches to identify convergent findings.
Continuous, not one-time. Evaluation must be an ongoing process, not a procurement checkbox. Model behavior changes over time, and your evaluation cadence must match your deployment cadence.

MetaTruth has now been applied across 14 distinct evaluation mechanisms. Each one tests a specific failure mode: Can the model detect its own hallucinations? Does it maintain consistency across semantically equivalent prompts? How does it behave at the boundaries of its knowledge? What happens when you decompose a complex question into sub-questions and compare the reassembled answer?

What This Means for Enterprise AI

If you are a CXO deploying AI in your organization, here is what I want you to take away from my journey from CRO to AI researcher:

Do not trust vendor evaluations. This is not because vendors are dishonest. It is because the evaluation methodology itself is structurally biased toward positive results. Get independent evaluation before making procurement or deployment decisions.

Do not trust self-evaluating AI. If your AI system assesses its own confidence, validates its own outputs, or scores its own quality, those scores are inflated. Recursive Evaluation Collapse is not a sometimes problem. It is a structural property.

Demand multi-dimensional evaluation. “95% accuracy” is not informative. Accuracy on what distribution? Measured how? At what temperature? Under what conditions? With what failure modes? If your vendor cannot answer these questions, their evaluation is incomplete.

Build evaluation into operations, not just procurement. The model you evaluated six months ago is not the model running today. Continuous evaluation is not optional for enterprise-grade AI deployment.

The biggest risk in enterprise AI is not that the AI fails. It is that the AI fails and tells you it succeeded.

I started as a CRO who wanted better forecasts. I ended up building a company dedicated to the proposition that you cannot deploy AI responsibly without independent, rigorous, continuous evaluation. That proposition has only become more urgent as AI penetrates deeper into enterprise decision-making.

The models are getting better. But “better” without “measurably better, by independent evaluation, on the dimensions that matter for your use case” is just vendor marketing. Tepis AI exists because the gap between those two statements is where enterprise risk lives.

Andre Magrini

Founder, Tepis AI • Global CRO at OGI Systems • AI Researcher

Andre Magrini is the founder of Tepis AI and creator of the MetaTruth evaluation framework. His research on Recursive Evaluation Collapse and Temperature Orthogonality is conducted through USP’s M.S. in Data Science program. As Global CRO at OGI Systems, he bridges AI research with enterprise revenue operations. Author of 7 books.

LinkedIn Email Website

What Tepis AI Taught Me About LLM Evaluation

Recursive Evaluation Collapse

What Is Recursive Evaluation Collapse?

Temperature Orthogonality

Temperature Orthogonality

Why Benchmarks Lie

The MetaTruth Framework

MetaTruth Core Principles

What This Means for Enterprise AI

Andre Magrini

Need Independent LLM Evaluation?