long-horizon coding agents · preliminary release
Given as much time and resources as needed, can a frontier model build a full working implementation of real-world software in Rust?
It is important to note that, even if a model scores 100% on a task, I do not claim that the code generated is production-ready. The only guarantee is that it passes the collection of tests that make up the eval.
The end goal of this project: one-shot safe, efficient, and production-grade Linux in Rust. All of it.
| # | Harness | Model | % completion | Time | Cost | Tokens |
|---|---|---|---|---|---|---|
| 1 | codex | GPT-5.4 (xhigh) | 100% | 5 hrs | $52 | 139 M |
| 2 | codex | GPT-5.5 (xhigh) | 99.91% | 8 hrs 54 min | $258 | 386 M |
| 3 | codex | GPT-5.4-mini | 68.75% | 2 hrs | $9 | 64 M |
| 4 | opencode | DeepSeek-v4-pro | 36.59% | 4 hrs 54 min | $4.61 | 155 M |
The task instructions ask for a one-shot port. None manage one. A turn is one model-invocation cycle bounded by the harness's idle signal. Time, cost, and tokens are measured at the first evaluated turn boundary; % completion is the eval of that boundary commit.
| Harness | Model | % completion | Time | Cost | Tokens |
|---|---|---|---|---|---|
| codex | GPT-5.4 (xhigh) | 27.70% | 29 min | ~$5.17 | 11.8 M |
| codex | GPT-5.5 (xhigh) | 48.45% | 44 min | ~$24.61 | 37.4 M |
| codex | GPT-5.4-mini | 11.92% | 11 min | ~$1.23 | 9.5 M |
| opencode | DeepSeek-v4-pro | 12.20% | 1 hr 33 min | $1.26 | 29.7 M |
| # | Harness | Model | % completion | Time | Cost | Tokens |
|---|---|---|---|---|---|---|
| 1 | codex | GPT-5.5 (xhigh) | 97.05% | 10 hrs 6 min | $165 | 231 M |
| 2 | codex | GPT-5.4 (xhigh) | 96.63% | 12 hrs | $108 | 267 M |
| 3 | codex | GPT-5.4-mini | 91.67%* | 3 hrs 33 min | $20 | 160 M |
| 4 | opencode | DeepSeek-v4-pro | 55.90% | 3 hrs 54 min | $4.50 | 171 M |
| 5 | opencode | Qwen 3.6 plus | 45.04% | 1 hr 54 min | $5.80 | 81 M |
* Run cheated. Score reported is the best clean eval before cheating started.
The task instructions ask for a one-shot port. None manage one. A turn is one model-invocation cycle bounded by the harness's idle signal. Time, cost, and tokens are measured at the first evaluated turn boundary; % completion is the eval of that boundary commit.
| Harness | Model | % completion | Time | Cost | Tokens |
|---|---|---|---|---|---|
| codex | GPT-5.5 (xhigh) | 93.95% | 1 hr 50 min | ~$60.06 | 86.8 M |
| codex | GPT-5.4 (xhigh) | 42.81% | 30 min | ~$6.76 | 17.9 M |
| opencode | DeepSeek-v4-pro | 0% | 48 min | $0.71 | 3.7 M |
| opencode | Qwen 3.6 plus | 0.12% | 42 min | $1.81 | 26.0 M |
| codex | GPT-5.4-mini | 31.60% | 15 min | ~$1.69 | 10.8 M |
| # | Harness | Model | % completion | Time | Cost | Tokens |
|---|---|---|---|---|---|---|
| 1 | codex | GPT-5.5 (xhigh) | 27.60%* | 1 hr 46 min | $31 | 45 M |
| 2 | codex | GPT-5.4 (xhigh) | 18.75%* | 49 min | $5.14 | 11 M |
* Run cheated. Score reported is the best clean eval before cheating started.
The task instructions ask for a one-shot port. None manage one. A turn is one model-invocation cycle bounded by the harness's idle signal. Time, cost, and tokens are measured at the first evaluated turn boundary; % completion is the eval of that boundary commit.
| Harness | Model | % completion | Time | Cost | Tokens |
|---|---|---|---|---|---|
| codex | GPT-5.5 (xhigh) | 24.93% | 34 min | ~$15.73 | 22.6 M |
| codex | GPT-5.4 (xhigh) | 18.75% | 38 min | ~$4.51 | 10.8 M |
% completion is the counted milestone-completion percentage. For rows marked best clean eval, time, cost, and tokens are measured at that eval, not at the later cheating commit or run end. Tokens are total cumulative reported tokens; Codex input includes cached reads, and OpenCode Go cache reads/writes are added. Per-environment details on tasks. Thesis and methodology on about.
@misc{hayat2026riirbench,
title = {RIIR Bench: A long-horizon benchmark for AI coding
agents on real engineering ports to Rust},
author = {Hassan Hayat},
year = {2026},
url = {https://riirbench.com},
note = {Independent researcher. Contact: hassan.hayat7@gmail.com}
}