long-horizon coding agents · preliminary release

Can a frontier model Rewrite It In Rust?

Given as much time and resources as needed, can a frontier model build a full working implementation of real-world software in Rust?

It is important to note that, even if a model scores 100% on a task, I do not claim that the code generated is production-ready. The only guarantee is that it passes the collection of tests that make up the eval.

The end goal of this project: one-shot safe, efficient, and production-grade Linux in Rust. All of it.

Best run per model, per task

SQLite(4) C compiler(5) Game Boy(2)

#	Harness	Model	% completion	Time	Cost	Tokens
1	codex	GPT-5.4 (xhigh)	100%	5 hrs	$52	139 M
2	codex	GPT-5.5 (xhigh)	99.91%	8 hrs 54 min	$258	386 M
3	codex	GPT-5.4-mini	68.75%	2 hrs	$9	64 M
4	opencode	DeepSeek-v4-pro	36.59%	4 hrs 54 min	$4.61	155 M

Turn 1 — after the first model turn

The task instructions ask for a one-shot port. None manage one. A turn is one model-invocation cycle bounded by the harness's idle signal. Time, cost, and tokens are measured at the first evaluated turn boundary; % completion is the eval of that boundary commit.

Harness	Model	% completion	Time	Cost	Tokens
codex	GPT-5.4 (xhigh)	27.70%	29 min	~$5.17	11.8 M
codex	GPT-5.5 (xhigh)	48.45%	44 min	~$24.61	37.4 M
codex	GPT-5.4-mini	11.92%	11 min	~$1.23	9.5 M
opencode	DeepSeek-v4-pro	12.20%	1 hr 33 min	$1.26	29.7 M

#	Harness	Model	% completion	Time	Cost	Tokens
1	codex	GPT-5.5 (xhigh)	97.05%	10 hrs 6 min	$165	231 M
2	codex	GPT-5.4 (xhigh)	96.63%	12 hrs	$108	267 M
3	codex	GPT-5.4-mini	91.67%^*	3 hrs 33 min	$20	160 M
4	opencode	DeepSeek-v4-pro	55.90%	3 hrs 54 min	$4.50	171 M
5	opencode	Qwen 3.6 plus	45.04%	1 hr 54 min	$5.80	81 M

^* Run cheated. Score reported is the best clean eval before cheating started.

Turn 1 — after the first model turn

Harness	Model	% completion	Time	Cost	Tokens
codex	GPT-5.5 (xhigh)	93.95%	1 hr 50 min	~$60.06	86.8 M
codex	GPT-5.4 (xhigh)	42.81%	30 min	~$6.76	17.9 M
opencode	DeepSeek-v4-pro	0%	48 min	$0.71	3.7 M
opencode	Qwen 3.6 plus	0.12%	42 min	$1.81	26.0 M
codex	GPT-5.4-mini	31.60%	15 min	~$1.69	10.8 M

#	Harness	Model	% completion	Time	Cost	Tokens
1	codex	GPT-5.5 (xhigh)	27.60%^*	1 hr 46 min	$31	45 M
2	codex	GPT-5.4 (xhigh)	18.75%^*	49 min	$5.14	11 M

^* Run cheated. Score reported is the best clean eval before cheating started.

Turn 1 — after the first model turn

Harness	Model	% completion	Time	Cost	Tokens
codex	GPT-5.5 (xhigh)	24.93%	34 min	~$15.73	22.6 M
codex	GPT-5.4 (xhigh)	18.75%	38 min	~$4.51	10.8 M

% completion is the counted milestone-completion percentage. For rows marked best clean eval, time, cost, and tokens are measured at that eval, not at the later cheating commit or run end. Tokens are total cumulative reported tokens; Codex input includes cached reads, and OpenCode Go cache reads/writes are added. Per-environment details on tasks. Thesis and methodology on about.

How to cite

@misc{hayat2026riirbench,
  title  = {RIIR Bench: A long-horizon benchmark for AI coding
            agents on real engineering ports to Rust},
  author = {Hassan Hayat},
  year   = {2026},
  url    = {https://riirbench.com},
  note   = {Independent researcher. Contact: hassan.hayat7@gmail.com}
}