Supervising models we cannot directly evaluate.
Palaestra Research is an independent nonprofit working on scalable oversight: the protocols, evaluations, and training mechanisms that let weaker judges, human or model, reliably supervise systems more capable than themselves.
Oversight as a discipline of structured contests.
The central problem of scalable oversight is that we want to train models to do things we cannot directly verify: write research, plan experiments, take long sequences of actions in a codebase. Naive reward modeling fails here. Judges reward what is persuasive rather than what is true. We study the protocols that close that gap.
The palaestra was the wrestling court of the ancient Greek gymnasium, a place of disciplined, adversarial practice. We borrow the name deliberately. Our work treats oversight as a discipline of structured contests: between a proposer and an independent critic, between a claim and the local check that would falsify it, between what a system asserts and what a weaker judge can verify. We build these contests, run them against the strongest models we can access, and report when they work and when they do not.
This is also a sequencing claim. Many of the safety questions we care about most, including research sabotage, subtle data poisoning, and the slow erosion of oversight across many low-stakes decisions, are diffuse: they cannot be detected from any single action and require an oversight signal aggregated across a trajectory. We do not think that program is tractable until train-time debate and related oversight protocols actually work. So we build the protocols first, and treat diffuse control as the regime we earn the right to study once the supervision signal is reliable.
In the limit, alignment requires us to train systems whose outputs we cannot fully check. The question is not whether to delegate evaluation, but how to design the delegation so the answers we get back are trustworthy.
An empirical program on oversight.
Our current focus is the oversight stack itself: the protocols, training methods, and benchmarks that determine whether a weaker judge can supervise a stronger model. Diffuse control is the regime we want to study; the work below is the infrastructure we need to study it honestly.
Optimizing debaters and judges under RL
Inference-time debate gives weaker judges a better correctness signal in some regimes but not others. We are building the training stack (RL'd debaters, judge-aware reward shaping, and preference data over transcripts) to test whether optimization pressure converts the inconsistent inference-time gains into robust improvements.
Debate, consultancy, and the limits of judge capability
Our recent work (Elasky & Nakasako, 2026) finds that debate helps weak judges when the critic actually beats the judge as a classifier and the judge treats the critique as a claim to verify. We are extending this to rubric-graded math, agentic code trajectories, and longer-horizon fuzzy tasks.
Cleaner ground truth so oversight results mean something
Oversight results are only as trustworthy as the tasks they run on. Our remediation of BigCodeBench (released as BigCodeBench+) lifted judge accuracy by 20–25 points; we are doing similar work on math contamination and ARC-style logic. Without this, we cannot tell whether a protocol failed or the benchmark did.
Aggregating weak signals across long trajectories
When each individual action is only mildly suspicious (research sabotage, subtle data poisoning, slow erosion of oversight) classical control techniques break down. This is the regime we are ultimately aiming at, but it is gated on the train-time oversight stack above. We are scoping evaluations and threat models now; serious empirical work begins once the supervision signal is good enough to be informative.
Datasets & releases.
Artifacts we release for the wider community: cleaned benchmarks, evaluation code, and the supporting material behind our results.
-
2026
HuggingFace ↗
BigCodeBench+ v0.1.0: Cleaned coding benchmark for verdict-accuracy research
-
2026
HuggingFace ↗
Uncontaminated Math Olympiad 2026 (UCMO): recent olympiad problems for contamination-free reasoning evaluation
Founders & collaborators.
Palaestra was founded in 2026 by researchers previously supported by Coefficient Giving, with operating support from collaborators in finance, infrastructure, and policy.
Ethan Elasky
AI researcher previously supported by Coefficient Giving, studying scalable oversight, debate, and control. UC Berkeley (data science, mathematics, Chinese; Phi Beta Kappa). Former research assistant at Academia Sinica.
Frank Nakasako
Mathematics and AI researcher focused on the empirical foundations of control. Co-author of work on generative debate; co-creator of the BigCodeBench+ remediation pipeline.
Laurence Tarquinio
Investments analyst and operating advisor. Background in special-situations private equity and institutional real estate. Supports Palaestra on grants strategy, financial planning, and organizational governance.
Get in touch.
General Correspondence
For research collaboration, press, grants, partnerships, or any other inquiry. A single inbox, monitored daily.
Following Our Work
New papers and notes are posted to the Alignment Forum and our LinkedIn page.