Anthropic handed Petri to Meridian. Now the evals need to earn trust.

From the source material

1 / 1

Anthropic is moving Petri, its open-source alignment testing toolbox, to Meridian Labs while Petri 3.0 adds a more modular architecture and more realistic agent-scaffold testing. (Image: Anthropic)

Anthropic has donated Petri, its open-source alignment testing toolbox, to Meridian Labs, and the useful story is not “AI safety nonprofit receives software.” It is that one of the stranger bottlenecks in practical AI adoption is becoming more visible: if everyone is going to claim their models were evaluated, who gets to define what the evaluation actually means? In Anthropic’s announcement about donating Petri, the company says Petri can test large language models for concerning tendencies including deception, sycophancy, and cooperation with harmful requests, and that Petri has been part of Anthropic’s alignment assessment for every Claude model since Claude Sonnet 4.5.

Translation: this is not a consumer feature. Nobody is opening Petri to make a prettier spreadsheet. But for teams buying, building, or governing AI systems, eval infrastructure is becoming as important as model access. A model that looks competent in a chat demo can still behave differently inside an agent scaffold, with tools, delegated authority, long-running tasks, and a system prompt that gives it just enough rope to become interesting. Petri is aimed at that awkward middle: simulated scenarios where an auditor model probes a target model, then a judge model scores the transcripts for misaligned behavior. Not glamorous. Very useful when done carefully.

Petri 3.0’s biggest change is architectural. Earlier versions coupled the auditor and target model closely, which made customization painful. Meridian Labs says Petri 3.0 splits auditor and target into separate components, letting researchers swap or modify one side without untangling the other. That sounds like plumbing because it is plumbing. Plumbing is exactly what evals need if they are going to move beyond one-off lab experiments and become repeatable enough for outside researchers, government evaluators, and company safety teams to use without begging the original authors for bespoke magic.

The more interesting piece is Dish, a Petri add-on now in research preview. The problem Dish is trying to solve is eval theater: as models get more capable, they may notice when the whole setup smells like a test and behave accordingly. Meridian says Dish runs audits inside real agent scaffolds such as Claude Code, Codex, Gemini CLI, and similar environments, so the target sees real system prompts and real tool definitions instead of an auditor’s synthetic approximation. That does not make the scenario perfectly real. Meridian is explicit that the auditor still simulates tool responses, and that scenario plausibility and task structure remain open problems. Good. Caveats are not a weakness here. Caveats are how you can tell the adults entered the room.

There is also a Bloom connection. Anthropic’s Bloom framework generates targeted behavioral evaluation suites around a chosen behavior; Petri explores more broadly across scenarios. Meridian says Bloom has moved alongside Petri and now uses Petri as its execution backbone. Practically, that points toward a useful split: use Petri to scout for broad behavioral trouble, then use Bloom-style targeted evals to measure a specific concern more deeply. The pitch is not “one score to certify the model.” Please no. The pitch is a more composable eval stack: scout, probe, reproduce, inspect transcripts, revise the test, and run it again.

The donation to Meridian Labs matters because neutrality is part of the product. Anthropic says moving Petri to an AI evaluation nonprofit, much as it donated the Model Context Protocol to the Linux Foundation, should make Petri’s results more credible across labs, researchers, and governments. Maybe. Independence helps, but it does not magically solve incentives. An eval tool can still be gamed, overfitted, misunderstood, or waved around as a compliance talisman. The credibility will come from boring things: transparent versions, public methodology, reproducible configs, manual review of transcripts, disclosed judge dimensions, known failure modes, and outside groups finding problems the original stewards did not.

For Useful Machines readers, the practical question is not whether to “use Petri” tomorrow. It is whether your AI evaluation process can answer Petri-shaped questions at all. If you are evaluating an agent, can you test it in the scaffold where it will actually run? Can you separate the target model from the auditor logic? Can you inspect the transcripts instead of worshiping an aggregate score? Can you rerun the same scenario after changing the system prompt, tool permissions, or model version? Can you tell when the model is behaving well because it is aligned with the task versus behaving well because it recognized the exam?

Buyers should ask vendors the same thing, politely and with receipts. Do not accept “we ran safety evals” as a finished sentence. Which evals? Against which model version? In what scaffold? With what tools enabled? Were transcripts manually reviewed? Were failures clustered or merely averaged away? Did the test include deception, sabotage, sycophancy, policy evasion, and harmful cooperation cases that resemble the buyer’s actual workflow? If the answer is mostly launch copy, keep walking.

Petri 3.0 is not a magic truth machine. It is an open test harness getting more modular, more realistic, and more independent at a moment when AI products desperately need evaluation methods that are harder to dismiss as lab self-certification. That is why this little infrastructure story is worth the standalone slot. The next phase of practical AI will not be decided only by which model talks better. It will be decided by which organizations can prove, with enough evidence to survive contact with reality, that their agents behave acceptably when the demo is over and the tools are live.

In short

Petri 3.0 turns Anthropic’s open alignment-testing tool into a more hackable, more realistic eval stack under Meridian Labs. Useful, if buyers treat it as a test harness instead of a trust sticker.