Technical AI Safety Research

Studying the diffuse failures of advanced AI systems.

Palaestra Research is an independent nonprofit advancing the science of AI control in regimes where threats arise not from a single decisive act, but from the slow, plausible accumulation of many small ones.

Adversarial training, in the old sense.

The palaestra was the wrestling court of the ancient Greek gymnasium — a place of disciplined, controlled adversarial practice. We borrow the name deliberately. Our research treats AI control as a discipline of structured contests: between models and their monitors, between claims and their refutations, between what a system can do and what we can verify it has done.

Most published work in AI control assumes a concentrated threat model — a small number of highly incriminating actions that, if caught, betray a model's misalignment. Real safety failures will more often be diffuse: research sabotage, subtle data poisoning, or the steady erosion of oversight across thousands of low-stakes decisions. These regimes demand different protocols, different evaluations, and a different theory of evidence.

Palaestra exists to build that theory empirically.

To derail the entire project of solving alignment, the model probably has to take a large number of malign actions — and individual bad actions are only weak evidence of malign intent.

— Redwood Research, on diffuse threat models

Four lines of active inquiry.

Our agenda is empirical and cumulative. We design control protocols, test them against the strongest models we can access, and publish results — including the negative ones.

01  /  Diffuse Control Protocols

Online training and incrimination under weak signals

When each individual bad action is only mildly suspicious, classical control techniques break down. We study which training-time and inference-time interventions remain effective when catastrophe requires aggregation across many decisions.

02  /  Scalable Oversight

Debate, consultancy, and the limits of judge capability

Our recent work (Elasky & Nakasako, 2026) finds that generative debate underperforms simpler baselines because judges are too credulous. We are extending this to RL-trained debaters, multi-turn settings, and judge debiasing.

03  /  Research Sabotage Evaluations

Measuring whether AI assistants quietly degrade safety work

If labs use models to do safety research, those models could withhold their best ideas or introduce subtle bugs. We are building benchmarks that detect sabotage in research-like settings without false-positive saturation.

04  /  Specification Quality

Cleaner ground truth for control evaluations

Control results are only as trustworthy as the tasks they run on. Our remediation of BigCodeBench (released as BigCodeBench+) lifted judge accuracy by 20–25 points. We continue to invest in evaluation infrastructure.

Selected output.

  • FEB 2026

    Inference-time Generative Debates on Coding and Reasoning Tasks for Scalable Oversight

    Ethan Elasky, Frank Nakasako (equal contribution)

    Preprint  ↗
  • 2026

    BigCodeBench+ v0.1.0 — Cleaned coding benchmark for verdict-accuracy research

    Palaestra Research

    HuggingFace  ↗
  • FORTHCOMING

    NeurIPS 2026 submission  — in preparation

    Elasky, Nakasako, et al.

    In review

Founders & collaborators.

Palaestra was founded in 2026 by researchers previously supported by Coefficient Giving, with operating support from collaborators in finance, infrastructure, and policy.

E

Ethan Elasky

Co-founder  ·  Researcher

AI researcher previously supported by Coefficient Giving, studying scalable oversight, debate, and control. UC Berkeley (data science, mathematics, Chinese; Phi Beta Kappa). Former research assistant at Academia Sinica.

F

Frank Nakasako

Co-founder  ·  Researcher

Mathematics and AI researcher focused on the empirical foundations of control. Co-author of work on generative debate; co-creator of the BigCodeBench+ remediation pipeline.

L

Laurence Tarquinio

Advisor  ·  Operations

Investments analyst and operating advisor. Background in special-situations private equity and institutional real estate. Supports Palaestra on grants strategy, financial planning, and organizational governance.

Working notes.

Short writing on diffuse control, evaluation methodology, and adjacent questions. Less polished than papers; more frequent.

Notes are in preparation. Subscribe for updates via the contact below.

Get in touch.

General Correspondence

For research collaboration, press, grants, partnerships, or any other inquiry — a single inbox, monitored daily.

info@palaestraresearch.org

Following Our Work

New papers and notes are posted to the Alignment Forum and our LinkedIn page.

LinkedIn ↗