← bopen.aiSKILL_BENCHMARKS
[ EVALUATION_METRICS ]

Skill Benchmarks

Each skill is run against a set of eval prompts twice — once with the skill injected, once as a bare baseline. An LLM-as-judge scores each assertion. The delta is the signal.

SKILLS_TESTED
1
TOTAL_EVALS
2
AVG_WITH_SKILL
83.3%
AVG_BASELINE
66.7%
markdown-writer[ ▲ +17% ]
WITH SKILL
83%██████████░░
BASELINE
67%████████░░░░
EVALS RUN
2
GENERATED: 3/4/2026, 11:37:25 PM
MODEL: claude-sonnet-4-6