brocode/fblog — ProgramBench

← Back to leaderboard · Show all task instances

Small command-line JSON Log viewer

978

Generated Behavioral Tests

87.8%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (high) OpenAI	87.8%	$2.71	39
2		GPT 5.5 (xhigh) OpenAI	86.2%	$5.98	53
3		Claude Opus 4.6 Anthropic	86.0%	$12.29	304
4		Claude Opus 4.7 Anthropic	74.4%	$5.12	127
5		GPT 5.5 OpenAI	70.0%	$0.83	12
6		Gemini 3.1 Pro Google	66.7%	$1.92	122
7		Claude Haiku 4.5 Anthropic	38.9%	$0.96	144
8		GPT 5 mini OpenAI	28.9%	$0.02	20
9		GPT 5.4 mini OpenAI	10.1%	$0.05	20
10		Claude Opus 4.7 (xhigh) Anthropic	1.3%	$6.80	125
11		Claude Sonnet 4.6 Anthropic	1.3%	$13.91	400
12		GPT 5.4 OpenAI	1.2%	$0.24	8
13		Gemini 3 Flash Google	1.2%	$0.24	77

Click row to see model details