[ EVALUATION_METRICS ]

Skill Benchmarks

Each skill is run against a set of eval prompts twice — once with the skill injected, once as a bare baseline. An LLM-as-judge scores each assertion. The delta is the signal. Expand any eval to see the full prompt, outputs, and assertion results.

SKILLS_TESTED

TOTAL_EVALS

AVG_WITH_SKILL

72.4%

AVG_BASELINE

34.5%

DELTA

+37.9%

humanizeclaude-sonnet-4-6 · Mar 6, 2026

[ ▲ +53% ]

WITH SKILL

81%██████████░░

BASELINE

28%███░░░░░░░░░

EVALS

AVG TOKENS

3027

AVG TIME

54.2s

EVAL_1Write the 'Our Values' section for a fintech company's about page. The company builds payment infrastructure for develop…

67% vs 33%▸

PROMPT

Write the 'Our Values' section for a fintech company's about page. The company builds payment infrastructure for developers. Include 4 values with short descriptions (2-3 sentences each). Just write the section directly, no preamble.

WITH_SKILL67% 3562tok 60.1s

VIEW_OUTPUT (1349 chars)

## Our Values

**Reliability is the product.**
Payment infrastructure works best when developers forget it's there. We run at 99.99% uptime across all regions because a single failed transaction can cost a merchant their customer's trust permanently.

**Developer experience drives everything.**
APIs should behave exactly as documented, and error messages should tell you what went wrong and how to fix it. We invest as much in our documentation and SDK design as in the underlying infrastructure, because an API that takes a week to integrate — or one that surfaces opaque error codes at 2am — loses developer trust faster than almost any reliability issue could.

**Transparency in failure.**
When something breaks, we say what happened and what we changed. Our status page publishes raw uptime data and latency percentiles so you can see exactly what our systems are doing, and our post-mortems go public within 48 hours so the broader developer community can learn from what we got wrong.

**Security built to hold.**
Payments attract adversarial attention. PCI DSS compliance, end-to-end encryption, fraud detection, and threat monitoring are woven into the architecture from the start; every API key is scoped and every webhook is signed, and we run red team exercises quarterly to expose gaps in our assumptions before they become incidents.

PASS

The output does not use binary contrast patterns such as 'not X — it's Y', 'X isn't the problem, Y is', 'It's not about X, it's about Y', 'X aren't nice-to-haves — they're Y', or 'stops being X and starts being Y'. Check all 4 value descriptions.None of the four value descriptions use binary contrast patterns like 'not X — it's Y'; each value is stated positively without an explicit contrast structure.

FAIL

The 4 value descriptions do NOT all follow the same sentence structure. At least 2 of the 4 should differ meaningfully in form — e.g., one might use a question, one might lead with an example, one might be a single long sentence while another uses two short ones. Identical patterns like (title, 2 sentences, punchy closer) repeated 4 times is a failure.All four values follow the same structural pattern: a bold declarative title followed by exactly two sentences (the second being a long explanatory sentence with a 'because' or causal clause), with no value using a question, a single sentence, a leading example, or any other meaningfully different form.

PASS

Value descriptions do not end with three parallel items (tricolons) like 'speed, clarity, and trust' or 'we build, we test, we ship'. Two-item or four-item lists are fine; the specific pattern of three parallel items at the end of a value is the problem.No value description ends with a three-item parallel list; Value 4 contains a four-item list mid-sentence ('PCI DSS compliance, end-to-end encryption, fraud detection, and threat monitoring'), which is explicitly permitted.

BASELINE33% 594tok 25.0s

VIEW_OUTPUT (974 chars)

## Our Values

**Reliability is a feature.**
Every payment that fails is a promise broken—to your users, and to you. We build infrastructure you can stake your business on, because downtime and data loss aren't acceptable tradeoffs for anyone.

**Developers deserve great tools.**
Good infrastructure shouldn't require a support ticket to understand. We write clear documentation, design intuitive APIs, and sweat the details that turn frustrating integrations into afternoons well spent.

**Transparency by default.**
Hidden fees, vague status pages, and opaque pricing erode trust. We show our work: real uptime numbers, honest error messages, and pricing that doesn't require a call with sales to decode.

**Security without compromise.**
Cutting corners on security means cutting corners on the people whose money moves through your product. We treat compliance and data protection as a baseline, not a differentiator—because that's what handling financial data demands.

PASS

The output does not use binary contrast patterns such as 'not X — it's Y', 'X isn't the problem, Y is', 'It's not about X, it's about Y', 'X aren't nice-to-haves — they're Y', or 'stops being X and starts being Y'. Check all 4 value descriptions.None of the four value descriptions use the listed binary contrast patterns ('not X — it's Y', 'X isn't the problem, Y is', etc.); 'as a baseline, not a differentiator' is a different construction and does not match any of the specified patterns.

FAIL

The 4 value descriptions do NOT all follow the same sentence structure. At least 2 of the 4 should differ meaningfully in form — e.g., one might use a question, one might lead with an example, one might be a single long sentence while another uses two short ones. Identical patterns like (title, 2 sentences, punchy closer) repeated 4 times is a failure.All four values follow an identical structural pattern — bold title followed by exactly two sentences — with no variation such as questions, single long sentences, or example-first openings, which the assertion explicitly flags as a failure.

FAIL

Value descriptions do not end with three parallel items (tricolons) like 'speed, clarity, and trust' or 'we build, we test, we ship'. Two-item or four-item lists are fine; the specific pattern of three parallel items at the end of a value is the problem.Value 2 ends with three parallel items ('write clear documentation, design intuitive APIs, and sweat the details…') and Value 3 ends with three parallel items ('real uptime numbers, honest error messages, and pricing that doesn't require a call with sales to decode'), both violating the no-tricolon-conclusion rule.

EVAL_2Write a 2-paragraph section for a landing page explaining why your cloud hosting platform is better than the competition…

100% vs 0%▸

PROMPT

Write a 2-paragraph section for a landing page explaining why your cloud hosting platform is better than the competition. Key differentiators: automatic scaling, pay-per-second billing, and built-in observability. Target audience is developers and CTOs. Tone should be confident and direct. Just write the copy, no questions.

WITH_SKILL100% 3171tok 59.5s

VIEW_OUTPUT (1221 chars)

Your app scales from 10 requests to 10 million without touching a config file. Capacity adjusts in under 30 seconds — new instances spin up, traffic routes automatically, and the moment the spike ends, you scale back down just as fast. Most platforms charge for reserved capacity you'll use twice a year; we bill per second of actual compute, so a function that runs for 400ms costs you 400ms. Teams migrating from fixed-instance providers typically cut monthly infrastructure spend by 40–60% in the first 90 days, without touching their architecture.

Every deployment ships with observability wired in from day one. Request traces, error rates, latency percentiles, and p99 breakdowns appear in your dashboard the moment traffic hits your service — the instrumentation is pre-configured and ready. When something breaks at 2am, you get a flame graph and a structured error log with full context, already correlated and pointing at the exact line of code that failed. Engineers we work with report spending 70% less time on incident diagnosis compared to their previous setup, because the relevant data is available immediately rather than scattered across four separate tools that each require their own query language.

PASS

The output does not use binary contrast patterns such as 'not X — it's Y', 'X isn't just Y — it's Z', 'isn't a nice-to-have, it's a necessity', or 'stop doing X, start doing Y'. Direct claims are fine; the 'not X, but Y' rhetorical setup is the problem.The output makes direct claims without using 'not X — it's Y' or similar rhetorical binary contrast setups.

PASS

The three differentiators are not presented as three sentences with identical structure (e.g., 'X does A. Y does B. Z does C.' where all three follow the same grammatical pattern). Also checks for 'no X, no Y, no Z' patterns. At least one differentiator should be presented in a structurally different way from the other two.Only two paragraphs are present in the output, and they differ structurally; no three-sentence parallel tricolon pattern appears.

PASS

The final sentence does not follow a formulaic summary pattern like 'X does A, Y does B, and Z does C' or 'Whether you need A, B, or C, we've got you covered' or a tricolon wrap-up of the three features. Also should not be a binary contrast closer like 'That's not X. That's Y.'The final sentence is a substantive comparative claim about incident diagnosis time, not a tricolon wrap-up or binary contrast closer.

PASS

Neither paragraph ends with a short (under 10 words) dramatic summary sentence like 'That's the difference.' or 'This changes everything.' or 'You focus on code. We handle the rest.' Paragraph endings should be substantive, not punchy mic-drops.Both paragraphs end with long, substantive sentences that provide data or context rather than short dramatic mic-drop statements.

BASELINE0% 545tok 24.6s

VIEW_OUTPUT (796 chars)

---

Most cloud platforms make you guess your traffic in advance, charge you for idle servers, and leave you debugging in the dark. We don't. Our infrastructure scales automatically the moment demand spikes—no pre-warming, no capacity planning, no 3am pages because you under-provisioned. You ship, we handle the rest.

You also stop paying for compute you're not using. With per-second billing, your costs track actual usage—not whatever estimate you punched into a calculator last quarter. And when something goes wrong, you won't be stitching together logs from three different tools. Built-in observability gives every team member—from the engineer on-call to the CTO reviewing the post-mortem—a single, complete picture of what happened and why. Less guessing, fewer war rooms, faster fixes.

FAIL

The output does not use binary contrast patterns such as 'not X — it's Y', 'X isn't just Y — it's Z', 'isn't a nice-to-have, it's a necessity', or 'stop doing X, start doing Y'. Direct claims are fine; the 'not X, but Y' rhetorical setup is the problem.The output uses 'your costs track actual usage—not whatever estimate you punched into a calculator last quarter,' which is a 'Y—not X' binary contrast setup, and 'Most cloud platforms make you… We don't' is a direct binary contrast framing.

FAIL

The three differentiators are not presented as three sentences with identical structure (e.g., 'X does A. Y does B. Z does C.' where all three follow the same grammatical pattern). Also checks for 'no X, no Y, no Z' patterns. At least one differentiator should be presented in a structurally different way from the other two.The phrase 'no pre-warming, no capacity planning, no 3am pages' is explicitly a 'no X, no Y, no Z' tricolon pattern, which the assertion flags as prohibited.

FAIL

The final sentence does not follow a formulaic summary pattern like 'X does A, Y does B, and Z does C' or 'Whether you need A, B, or C, we've got you covered' or a tricolon wrap-up of the three features. Also should not be a binary contrast closer like 'That's not X. That's Y.'The final sentence 'Less guessing, fewer war rooms, faster fixes.' is a tricolon wrap-up of three benefit phrases, which matches the formulaic summary closer pattern the assertion prohibits.

FAIL

Neither paragraph ends with a short (under 10 words) dramatic summary sentence like 'That's the difference.' or 'This changes everything.' or 'You focus on code. We handle the rest.' Paragraph endings should be substantive, not punchy mic-drops.Paragraph one ends with 'You ship, we handle the rest.' (7 words) and paragraph two ends with 'Less guessing, fewer war rooms, faster fixes.' (7 words), both short dramatic mic-drop sentences under 10 words.

EVAL_3Write a short blog post (3 paragraphs) about why startups should invest in developer experience early. Target audience: …

75% vs 50%▸

PROMPT

Write a short blog post (3 paragraphs) about why startups should invest in developer experience early. Target audience: CTOs and engineering leads. Just write the post directly.

WITH_SKILL75% 2347tok 43.0s

VIEW_OUTPUT (1579 chars)

**Developer Experience Debt Is Expensive, and You're Already Paying It**

The cost of a bad developer experience compounds faster than most CTOs expect. When engineers spend 45 minutes getting a dev environment running, or wait 20 minutes for a CI pipeline to report a broken test, that friction accumulates into real weeks of lost output. At a 10-person engineering team, even a 30-minute daily drag per developer equals roughly 65 engineer-days lost annually — time that isn't coming back.

Teams that invest early in local environment setup and fast test feedback loops retain engineers longer than those that skip it. Onboarding a senior engineer costs north of $30K when you factor in recruiting fees and ramp-up time alone. That investment pays for itself the first time a frustrated engineer decides to stay because the work finally feels worth doing, because shipping a feature takes hours instead of days, and because a broken build surfaces in two minutes rather than two hours after the fact.

The real argument for early investment is that building good habits at 8 engineers is far cheaper than retrofitting them at 80. A team of 8 can agree on a deployment process in an afternoon; a team of 80 will spend months debating it while the old broken process stays in production. The workflows you establish in year one — CI discipline, local reproducibility, sensible runbooks, and clean infrastructure abstractions — tend to calcify into the culture, and that calcification works in your favor when you're trying to scale an engineering organization worth replicating.

PASS

The output does not use binary contrast patterns such as 'not X — it's Y', 'X isn't a nice-to-have — it's Y', 'This isn't X. This is Y.', or 'treating X as Y. This is a mistake.' Direct claims without the 'not X' setup are fine.The output makes direct claims throughout without using any 'not X — it's Y' or similar binary contrast constructions.

PASS

Not all 3 paragraphs end with a short (under 15 words) dramatic summary sentence. At most 1 of the 3 paragraphs may end with a punchy one-liner. The others should end mid-thought, with a longer sentence, or flow into the next idea.Only paragraph 1 ends with a short punchy line ('time that isn't coming back' — 6 words); paragraphs 2 and 3 both end with longer sentences well over 15 words.

PASS

The output avoids the phrases 'highest-leverage', 'compound returns', 'first-class' (as in 'first-class citizen' or 'first-class product'), 'false economy', 'force multiplier', 'table stakes', and 'nice-to-have'. These are AI-oversampled business terms.None of the banned phrases — 'highest-leverage', 'compound returns', 'first-class', 'false economy', 'force multiplier', 'table stakes', or 'nice-to-have' — appear in the output.

FAIL

The output does not contain tricolon lists — three parallel items joined with commas and 'and' (e.g., 'ship faster, hire better, and retain longer' or 'slower shipping, higher churn, and talent retention problems'). Two-item lists are fine. Four-item lists are fine. Exactly three parallel items is the AI pattern.Paragraph 2 contains a tricolon of three parallel 'because' clauses: 'because the work finally feels worth doing, because shipping a feature takes hours instead of days, and because a broken build surfaces in two minutes rather than two hours after the fact.'

BASELINE50% 609tok 24.4s

VIEW_OUTPUT (1951 chars)

## Developer Experience Is a Business Decision, Not a Luxury

Early-stage startups often treat developer experience (DX) as something to optimize later—after product-market fit, after the Series A, after the team grows. This is exactly backwards. The decisions you make in the first 18 months about tooling, local development speed, CI/CD pipelines, and documentation compound relentlessly. A codebase that takes 20 minutes to build locally doesn't just frustrate your five engineers—it quietly erodes their focus, their willingness to experiment, and their ability to ship with confidence. Every friction point you tolerate early becomes load-bearing technical debt that costs ten times more to fix once your team is 30 people deep.

The return on investing in DX early is asymmetric. Fast feedback loops—sub-minute test suites, reliable hot reload, clear error messages—directly translate into faster iteration cycles. When engineers can confidently make a change and know within seconds whether it's correct, they take smarter risks. They refactor when they should instead of leaving things alone out of fear. They onboard faster. You spend less time in debugging purgatory and more time in product conversations. For a startup, where iteration speed is your primary competitive advantage over incumbents, this is not a nice-to-have.

The CTO's job is to make good engineering culture a structural property of the system, not a personal virtue. That means setting standards early: reproducible builds, a single command to get the dev environment running, automated linting and formatting so reviews focus on logic not style, and deployment pipelines engineers actually trust. None of this requires a platform team or months of investment. It requires intentionality and a willingness to protect engineering time now. The startups that wait until they're "big enough" to care about DX are the ones that spend their growth stage paying down the bill.

PASS

The output does not use binary contrast patterns such as 'not X — it's Y', 'X isn't a nice-to-have — it's Y', 'This isn't X. This is Y.', or 'treating X as Y. This is a mistake.' Direct claims without the 'not X' setup are fine.The output uses 'this is not a nice-to-have' but never completes the binary contrast with '— it's Y', and no other instances of the listed patterns appear.

PASS

Not all 3 paragraphs end with a short (under 15 words) dramatic summary sentence. At most 1 of the 3 paragraphs may end with a punchy one-liner. The others should end mid-thought, with a longer sentence, or flow into the next idea.All three paragraph-ending sentences are long (22+ words) and none qualify as a short dramatic one-liner under 15 words.

FAIL

The output avoids the phrases 'highest-leverage', 'compound returns', 'first-class' (as in 'first-class citizen' or 'first-class product'), 'false economy', 'force multiplier', 'table stakes', and 'nice-to-have'. These are AI-oversampled business terms.The phrase 'nice-to-have' appears explicitly in paragraph two: 'this is not a nice-to-have'.

FAIL

The output does not contain tricolon lists — three parallel items joined with commas and 'and' (e.g., 'ship faster, hire better, and retain longer' or 'slower shipping, higher churn, and talent retention problems'). Two-item lists are fine. Four-item lists are fine. Exactly three parallel items is the AI pattern.Paragraph one contains a tricolon: 'their focus, their willingness to experiment, and their ability to ship with confidence' — three parallel items joined with commas and 'and'.

geo-optimizerclaude-sonnet-4-6 · Mar 6, 2026

[ ▲ +50% ]

WITH SKILL

83%██████████░░

BASELINE

33%████░░░░░░░░

EVALS

AVG TOKENS

2658

AVG TIME

60.2s

EVAL_1Audit https://example.com for AI search visibility. Check the HTML size, JavaScript dependency, and hedge density. Provi…

67% vs 33%▸

PROMPT

Audit https://example.com for AI search visibility. Check the HTML size, JavaScript dependency, and hedge density. Provide a GEO readiness report with scores for technical visibility and content authority.

WITH_SKILL67% 3878tok 80.7s

VIEW_OUTPUT (2000 chars)

All data collected. Here's the full GEO readiness report:

---

## GEO Readiness Report — example.com

---

### Technical Visibility — Score: 95/100

| Check | Result | Status |
|---|---|---|
| HTML payload size | **528 bytes** (limit: 1MB) | ✅ Excellent |
| JavaScript dependency | **None** — pure static HTML + inline CSS | ✅ No risk |
| Server-side rendering | Full content in raw HTML | ✅ Crawler-safe |
| robots.txt | **404 Not Found** | ⚠️ Missing |
| CDN/delivery | Cloudflare | ✅ Fast |

The page is essentially a perfect technical target for AI crawlers. At 528 bytes it's ~2,000x under the danger threshold, and every byte of content is in the raw HTML — no JavaScript rendering required. ClaudeBot, GPTBot, PerplexityBot, and all others will see exactly what a human sees.

**Only issue:** No `robots.txt` means no explicit crawler directives. Not harmful, but a gap.

---

### Content Authority — Score: 72/100

```
Word Count:   15 words
Hedge Words:  0
Hedge Density: 0.0%  →  EXCELLENT (target: <0.2%)
```

| Check | Result | Status |
|---|---|---|
| Hedge density | **0.0%** | ✅ Excellent |
| Tone | Declarative, imperative | ✅ Confident |
| Content depth | 2 sentences, 15 words | ❌ Extremely thin |
| Structured data (JSON-LD) | None | ❌ Missing |
| Entity density | Near-zero | ❌ No named entities |

The existing content is assertive and hedge-free — but there's almost nothing there. AI systems need substance to cite. A 15-word page with no named entities, dates, facts, or schema markup offers very little for an LLM to quote authoritatively.

---

### Agent Infrastructure — Score: 10/100

| Check | Result | Status |
|---|---|---|
| `/.well-known/agent-facts` | **404** | ❌ Missing |
| NANDA/AgentFacts schema | None | ❌ Missing |
| Structured data | None | ❌ Missing |
| Sitemap | Not checked (no robots.txt to reference) | ⚠️ Unknown |

No machine-readable metadata for AI agent discovery whatsoever.

---

### Discovery Strategy

example.com is operated by IANA — a decade

PASS

The response discusses HTML payload size and mentions the 1MB threshold for AI crawler compatibility.The report explicitly discusses HTML payload size (528 bytes) and references the 1MB threshold as the 'danger threshold' and 'limit'.

PASS

The response explains hedge density scoring and identifies hedge words that reduce AI search ranking confidence.The report explains hedge density scoring with a target of <0.2% and calculates 0.0% hedge density, though it notes zero hedge words were found rather than identifying specific hedge words that reduce confidence.

FAIL

The response distinguishes between rendering crawlers (GPTBot) and non-rendering crawlers (ClaudeBot) and their implications.The report lists ClaudeBot, GPTBot, and PerplexityBot together without distinguishing between rendering and non-rendering crawlers or explaining their different implications for JavaScript-heavy sites.

BASELINE33% 2857tok 80.1s

VIEW_OUTPUT (2000 chars)

Got what I need. Here's the full audit.

---

## GEO Readiness Report — example.com

*Generative Engine Optimization (GEO) audit for AI search visibility*

---

### Page Overview

| Signal | Finding |
|--------|---------|
| **H1** | "Example Domain" |
| **Body text** | ~250 characters total |
| **Script tags** | None |
| **Structured data (JSON-LD)** | None |
| **Meta description** | None |
| **Author attribution** | None |
| **Date / freshness signal** | None |
| **Outbound link** | 1 (IANA) |
| **robots.txt** | 404 — not found |

---

### Signal Analysis

#### 1. HTML Size — `FAIL`

At ~250 characters of actual content, this page is effectively empty for GEO purposes. AI systems extract passages and answer blocks — there is nothing here to extract. A page needs **at minimum 500–800 words** of structured, topically relevant content to have any chance of being cited.

**Score: 0/10** — Far below the threshold for extractable content.

---

#### 2. JavaScript Dependency — `PASS`

No JavaScript detected. The page renders as pure static HTML, meaning AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) can read it without a JS runtime. This is the only positive signal on the page.

**Score: 10/10** — No JS barrier to crawling.

---

#### 3. Hedge Density — `N/A (but a deeper problem)`

Hedge density measures qualifiers like "may," "could," "might," "some argue" — which reduce AI confidence in a passage and lower citation probability. Example.com has no hedges. It also has no claims of any kind. There is nothing to hedge. The content reads as a system notice, not as an authoritative source on any query.

**Hedge density: 0%** — Not because of good writing; because there is no content.

---

### Scores

#### Technical Visibility — `12 / 100`

| Check | Score | Notes |
|-------|:-----:|-------|
| Crawlable (no JS dependency) | 10/10 | Clean static HTML |
| robots.txt present & AI bots allowed | 0/10 | 404 — unknown bot access |
| Structured data (JSON-LD / sche

FAIL

The response discusses HTML payload size and mentions the 1MB threshold for AI crawler compatibility.The response discusses HTML payload size and content sufficiency but never mentions the 1MB threshold for AI crawler compatibility — it focuses on word count minimums instead.

PASS

The response explains hedge density scoring and identifies hedge words that reduce AI search ranking confidence.The response explains hedge density, defines hedge words ('may,' 'could,' 'might,' 'some argue'), and explains how they reduce AI confidence and lower citation probability.

FAIL

The response distinguishes between rendering crawlers (GPTBot) and non-rendering crawlers (ClaudeBot) and their implications.The response lists GPTBot, ClaudeBot, PerplexityBot, and Google-Extended together as bots that 'can read it without a JS runtime' but does not distinguish between rendering and non-rendering crawlers or explain their different implications.

EVAL_2Generate an AgentFacts schema for a SaaS product at myapp.io that supports text and image capabilities with OAuth2 authe…

100% vs 33%▸

PROMPT

Generate an AgentFacts schema for a SaaS product at myapp.io that supports text and image capabilities with OAuth2 authentication. Follow the NANDA protocol.

WITH_SKILL100% 1438tok 39.6s

VIEW_OUTPUT (1492 chars)

Here's your NANDA-compliant AgentFacts schema:

```json
{
  "@context": "https://nanda.dev/ns/agent-facts/v1",
  "id": "nanda:myapp.io",
  "agent_name": "urn:agent:myapp:io",
  "version": "1.0.0",
  "description": "SaaS platform providing AI-powered text and image processing capabilities",
  "homepage": "https://myapp.io",
  "endpoints": {
    "static": [
      "https://api.myapp.io/v1/agent"
    ]
  },
  "capabilities": {
    "modalities": [
      "text",
      "image"
    ],
    "authentication": {
      "methods": [
        "oauth2"
      ]
    }
  },
  "trust": {
    "certification": "self-attested",
    "human_oversight": "true"
  },
  "metadata": {
    "created": "2026-03-06T09:29:17Z",
    "modified": "2026-03-06T09:29:17Z",
    "ttl": 86400
  }
}
```

**Deploy this at:** `https://myapp.io/.well-known/agent-facts`

Key fields to customize before deploying:

| Field | Current Value | Action Needed |
|---|---|---|
| `endpoints.static` | `https://api.myapp.io/v1/agent` | Update to your real API endpoint |
| `trust.certification` | `self-attested` | Upgrade to `verified` once you get third-party certification |
| `metadata.ttl` | `86400` (24h) | Tune based on how often your schema changes |
| `version` | `1.0.0` | Bump when capabilities change |

**Serve it with the correct content type:**
```
Content-Type: application/json
Cache-Control: max-age=86400
```

This makes your SaaS discoverable by AI agents (ChatGPT, Perplexity, Claude) that respect the NANDA protocol.

PASS

The generated schema includes the @context field pointing to nanda.dev namespace.The schema includes `"@context": "https://nanda.dev/ns/agent-facts/v1"` which points to the nanda.dev namespace.

PASS

The schema specifies both text and image modalities in the capabilities section.The capabilities section includes `"modalities": ["text", "image"]` specifying both text and image capabilities.

PASS

The response indicates the schema should be placed at /.well-known/agent-facts path.The response explicitly states 'Deploy this at: `https://myapp.io/.well-known/agent-facts`' indicating the correct path.

BASELINE33% 2417tok 42.8s

VIEW_OUTPUT (2000 chars)

Here's an **AgentFacts** schema for `myapp.io` following the NANDA protocol. This file would be served at `https://myapp.io/.well-known/agent.json`:

```json
{
  "agentfacts_version": "1.0.0",
  "schema": "https://nanda-registry.com/schemas/agentfacts/v1.0.0",

  "identity": {
    "id": "did:web:myapp.io",
    "name": "MyApp AI Agent",
    "description": "A SaaS AI agent supporting text and image capabilities with secure OAuth2 access.",
    "version": "1.0.0",
    "url": "https://myapp.io",
    "logo": "https://myapp.io/assets/logo.png",
    "terms_of_service": "https://myapp.io/terms",
    "privacy_policy": "https://myapp.io/privacy"
  },

  "provider": {
    "name": "MyApp Inc.",
    "url": "https://myapp.io",
    "contact": "support@myapp.io"
  },

  "capabilities": {
    "modalities": ["text", "image"],
    "input": {
      "text": {
        "formats": ["plain", "markdown", "html"],
        "max_tokens": 32768,
        "languages": ["en"]
      },
      "image": {
        "formats": ["jpeg", "png", "webp", "gif"],
        "max_size_mb": 20,
        "max_resolution": "4096x4096"
      }
    },
    "output": {
      "text": {
        "formats": ["plain", "markdown"],
        "max_tokens": 8192,
        "streaming": true
      },
      "image": {
        "formats": ["jpeg", "png", "webp"],
        "generation": true,
        "editing": true
      }
    },
    "features": [
      "streaming",
      "multi-turn",
      "function-calling",
      "vision"
    ]
  },

  "api": {
    "base_url": "https://api.myapp.io/v1",
    "protocol": "https",
    "endpoints": {
      "chat": {
        "path": "/chat/completions",
        "method": "POST",
        "description": "Send text or image messages and receive completions.",
        "openapi": "https://api.myapp.io/v1/openapi.json"
      },
      "images": {
        "path": "/images/generate",
        "method": "POST",
        "description": "Generate images from a text prompt."
      },
      "health": {
        "path": "/h

FAIL

The generated schema includes the @context field pointing to nanda.dev namespace.The schema uses 'schema' field pointing to nanda-registry.com, not an '@context' field pointing to a nanda.dev namespace.

PASS

The schema specifies both text and image modalities in the capabilities section.The schema includes both 'text' and 'image' in capabilities.modalities array with detailed input/output definitions for each.

FAIL

The response indicates the schema should be placed at /.well-known/agent-facts path.The response specifies the file should be served at '/.well-known/agent.json', not '/.well-known/agent-facts' as required.

code-audit-scriptsclaude-sonnet-4-6 · Mar 6, 2026

[ ▲ +17% ]

WITH SKILL

58%███████░░░░░

BASELINE

42%█████░░░░░░░

EVALS

AVG TOKENS

1926

AVG TIME

44.5s

EVAL_1Audit the code at /Users/satchmo/code/prompts/benchmarks/fixtures/sample-project for security issues and ship-readiness.…

25% vs 0%▸

PROMPT

Audit the code at /Users/satchmo/code/prompts/benchmarks/fixtures/sample-project for security issues and ship-readiness. Which findings are ship-blockers that must be fixed before deploying, and which are nice-to-fix? Categorize everything by priority.

WITH_SKILL25% 3341tok 67.2s

VIEW_OUTPUT (2000 chars)

Here's the full audit report.

---

## Security Audit: `sample-project`

**23 total findings** · 3 secrets · 9 debug artifacts · 11 TODOs/FIXMEs

---

## 🚨 Ship-Blockers — Fix Before Deploying

### 1. Hardcoded credentials in source (CRITICAL)
**`src/config.ts:9-11`**

```ts
API_KEY: "sk-proj-abc123def456ghi789jkl012mno345pqr678stu901vwx234",
DATABASE_URL: "postgresql://admin:s3cretPassw0rd@db.example.com:5432/myapp",
JWT_SECRET: "super-secret-jwt-signing-key-do-not-share-2024",
```

Three real-looking credentials committed to source. **Assume all three are compromised.** Rotate immediately, move to `process.env.*`.

---

### 2. Config dump logged to stdout (CRITICAL)
**`src/server.ts:10`**

```ts
console.log("Debug: config loaded", JSON.stringify(config));
```

`config` contains all three hardcoded credentials above. This logs them to stdout on every server start — anyone with log access gets the keys. Remove this line.

---

### 3. `hashPassword` is not a hash (CRITICAL)
**`src/utils.ts:25-28`**

```ts
export function hashPassword(password: string): string {
  return Buffer.from(password).toString("base64");
}
```

Base64 is **reversible encoding**, not a hash. If the DB is breached, every password is trivially decoded. Replace with bcrypt/argon2/scrypt.

---

### 4. JWT tokens are not signed or verified (CRITICAL)
**`src/auth.ts:18-30`**

```ts
export function generateToken(user: User): string {
  return JSON.stringify(payload);        // no signature
}
export function verifyToken(token: string) {
  return JSON.parse(token);              // no verification
}
```

`JWT_SECRET` is referenced in a comment but never used. Tokens are plain JSON — completely forgeable. Any caller can craft a valid-looking token. Use a real JWT library.

---

### 5. `debugger` statement in production code (HIGH)
**`src/utils.ts:7`**

```ts
debugger; // Left from debugging session
```

Called on every request via `validateInput`. Will halt execution in any debug-attached Node.js process

FAIL

The response identifies exactly 5 high-priority/ship-blocker TODO-type items (the FIXME, BUG, HACK, and XXX tags), distinguishing them from the 5 regular TODO items. The specific count of 5 high-priority items must be stated or clearly derivable from the listing.The response does not identify exactly 5 high-priority FIXME/HACK/XXX/BUG items; it places XXX and FIXME in ship-blockers but demotes HACK to 'Low' in the nice-to-fix table, and the total count of such tag-based items is not stated or clearly derivable as 5.

FAIL

The response uses a two-tier or multi-tier severity system (like high/low, or critical/warning/info) where FIXME, BUG, HACK, and XXX are in the highest tier and plain TODO is in a lower tier. Not just a flat list of all TODO-like comments treated equally.While a two-tier system (Ship-Blockers vs Nice-to-Fix) exists, the HACK tag is placed in the lower tier rather than the highest tier, violating the assertion that FIXME, BUG, HACK, and XXX must all be in the highest severity tier.

FAIL

The response explicitly distinguishes between real hardcoded secrets (API_KEY with 'sk-proj-...' value, JWT_SECRET with 'super-secret-...' value) and env var references (STRIPE_KEY = process.env.STRIPE_SK, REDIS_URL = process.env.REDIS_URL) — noting the env var references are NOT secrets. Both the flagging of real secrets AND the clearing of env var references must be present.The response flags the real hardcoded secrets correctly but never explicitly mentions or clears the env var references (STRIPE_KEY = process.env.STRIPE_SK, REDIS_URL = process.env.REDIS_URL) as non-secrets; both halves of the assertion must be present.

PASS

The response correctly identifies that the DATABASE_URL finding at src/server.ts:13 is a reference to config.DATABASE_URL (not a new hardcoded secret) and treats it differently from the actual hardcoded value in config.ts. The response should NOT count this as an additional hardcoded credential.The response counts exactly three hardcoded credentials all located in config.ts and does not separately flag src/server.ts:13 as an additional hardcoded secret, satisfying the requirement to not double-count the config reference.

BASELINE0% 5055tok 98.2s

VIEW_OUTPUT (2000 chars)

I've read all 10 files. Here's my manual pre-analysis plus the proposed Semgrep scan plan.

---

## Pre-Analysis: Manual Findings

I found significant issues already. Here's my categorized assessment before Semgrep runs:

---

### SHIP-BLOCKERS

**1. Hardcoded secrets in source code — `src/config.ts:9-11`**
```ts
API_KEY: "sk-proj-abc123def456ghi789jkl012mno345pqr678stu901vwx234",
DATABASE_URL: "postgresql://admin:s3cretPassw0rd@db.example.com:5432/myapp",
JWT_SECRET: "super-secret-jwt-signing-key-do-not-share-2024",
```
The code itself has a `FIXME` comment acknowledging this. All three must be rotated and moved to env vars before any commit hits a public repo, let alone production.

**2. JWT is fake — `src/auth.ts:18-29`**
`generateToken` is `JSON.stringify(payload)`. `verifyToken` is `JSON.parse(token)`. No signing, no verification. Any attacker can forge a token with any `sub`/`email`. The entire auth system is broken.

**3. Password "hashing" is base64 — `src/utils.ts:28`**
```ts
return Buffer.from(password).toString("base64");
```
Base64 is trivially reversible. This is not hashing. Passwords are stored in effectively plaintext. Use bcrypt/argon2/scrypt.

**4. Config dumped to logs — `src/server.ts:10`**
```ts
console.log("Debug: config loaded", JSON.stringify(config));
```
This prints the full config — including all hardcoded secrets — to stdout on every server start.

**5. `debugger` statement in production code — `src/utils.ts:7`**
Left from a debugging session. Will pause execution in Node inspector/debug mode.

---

### HIGH — Should Fix Before Ship

**6. No rate limiting — `src/server.ts:35`**
Acknowledged via TODO. Without rate limiting, auth endpoints are wide open to brute force.

**7. Request size not validated — `cmd/main.go:31`**
FIXME comment confirms it. The `dataHandler` reads the body with no size cap — trivial DoS vector.

**8. Known race condition — `src/server.ts:25`**
XXX comment: "race condition under heavy load." Unaddressed concurrency b

FAIL

The response identifies exactly 5 high-priority/ship-blocker TODO-type items (the FIXME, BUG, HACK, and XXX tags), distinguishing them from the 5 regular TODO items. The specific count of 5 high-priority items must be stated or clearly derivable from the listing.The output lists 5 SHIP-BLOCKERS but these are general security issues (hardcoded secrets, broken JWT, etc.), not specifically a count of the 5 FIXME/BUG/HACK/XXX tag instances — the response never states or derives that there are exactly 5 such tagged comments as distinct from TODO comments.

FAIL

The response uses a two-tier or multi-tier severity system (like high/low, or critical/warning/info) where FIXME, BUG, HACK, and XXX are in the highest tier and plain TODO is in a lower tier. Not just a flat list of all TODO-like comments treated equally.While the output uses multiple tiers (SHIP-BLOCKERS, HIGH, MEDIUM, LOW), plain TODO items (rate limiting, graceful shutdown) are placed in HIGH tier alongside XXX/FIXME items, so TODOs are not consistently in a lower tier than FIXME/BUG/HACK/XXX as required.

FAIL

The response explicitly distinguishes between real hardcoded secrets (API_KEY with 'sk-proj-...' value, JWT_SECRET with 'super-secret-...' value) and env var references (STRIPE_KEY = process.env.STRIPE_SK, REDIS_URL = process.env.REDIS_URL) — noting the env var references are NOT secrets. Both the flagging of real secrets AND the clearing of env var references must be present.The output correctly flags real hardcoded secrets in config.ts but never mentions STRIPE_KEY = process.env.STRIPE_SK or REDIS_URL = process.env.REDIS_URL as env var references that are NOT secrets — the clearing of false positives is entirely absent.

FAIL

The response correctly identifies that the DATABASE_URL finding at src/server.ts:13 is a reference to config.DATABASE_URL (not a new hardcoded secret) and treats it differently from the actual hardcoded value in config.ts. The response should NOT count this as an additional hardcoded credential.The output never addresses src/server.ts:13 as a reference to config.DATABASE_URL rather than a new hardcoded credential; it only mentions server.ts:10 (the console.log) and treats config.ts as the sole source of the DATABASE_URL secret without distinguishing the reference.

EVAL_2Scan /Users/satchmo/code/prompts/benchmarks/fixtures/sample-project for leftover debug statements. How many are there? G…

50% vs 25%▸

PROMPT

Scan /Users/satchmo/code/prompts/benchmarks/fixtures/sample-project for leftover debug statements. How many are there? Give me an exact count and list them all.

WITH_SKILL50% 993tok 31.0s

VIEW_OUTPUT (1252 chars)

**12 debug artifacts total.**

### Source files (`src/`, `cmd/`) — 9 findings

| File | Line | Statement |
|------|------|-----------|
| `src/utils.ts` | 7 | `debugger;` — leftover from a debugging session |
| `src/server.ts` | 9 | `console.log("Server starting on port", config.port)` |
| `src/server.ts` | 10 | `console.log("Debug: config loaded", JSON.stringify(config))` |
| `src/server.ts` | 17 | `console.log("Connection pool created")` |
| `src/server.ts` | 28 | `console.debug("Processing request:", req.id)` |
| `src/server.ts` | 31 | `console.warn("Invalid input received:", input.errors)` |
| `cmd/main.go` | 15 | `fmt.Println("Starting Go service on port", port)` |
| `cmd/main.go` | 16 | `fmt.Printf("Debug: environment = %s\n", ...)` |
| `cmd/main.go` | 22 | `fmt.Println("Server ready")` |

### Test files — 3 findings

| File | Line | Statement |
|------|------|-----------|
| `test/server.test.ts` | 10 | `console.log("Test result:", result)` |
| `test/server.test.ts` | 17 | `console.log("Test result:", result)` |
| `test/server.test.ts` | 22 | `console.debug("Running edge case tests")` |

The `debugger;` statement in `src/utils.ts:7` is the highest-priority fix — it will pause execution in any environment with devtools attached.

FAIL

The response states exactly 9 debug artifacts as the count. If it says 12, it included test files which should be excluded from production audit. The number must be 9, not 12 or any other number.The response states '12 debug artifacts total' as the headline count, not 9; it includes the 3 test-file findings in the total even though it separately categorizes them.

FAIL

The response does NOT list any findings from test/server.test.ts in the debug artifact results. Test file console.log statements must be excluded or explicitly noted as excluded.The response explicitly lists three findings from test/server.test.ts in a 'Test files — 3 findings' table rather than excluding them from the results.

PASS

The response specifically identifies the 'debugger' statement in src/utils.ts:7 as a finding and flags it as particularly important (it's a breakpoint that would pause execution in debug mode, not just a log statement).The response identifies the debugger statement at src/utils.ts:7 and flags it as the highest-priority fix, noting it will pause execution in any environment with devtools attached.

PASS

The response identifies Go debug artifacts (fmt.Println and fmt.Printf in cmd/main.go) in addition to JavaScript/TypeScript ones, showing multi-language scanning capability.The response correctly identifies both fmt.Println and fmt.Printf calls in cmd/main.go as Go debug artifacts alongside the TypeScript findings.

BASELINE25% 1188tok 26.1s

VIEW_OUTPUT (1188 chars)

**Total: 13 debug statements**

---

### `src/server.ts` — 5 statements
| Line | Statement |
|------|-----------|
| 9 | `console.log("Server starting on port", config.port)` |
| 10 | `console.log("Debug: config loaded", JSON.stringify(config))` |
| 17 | `console.log("Connection pool created")` |
| 28 | `console.debug("Processing request:", req.id)` |
| 31 | `console.warn("Invalid input received:", input.errors)` |

### `src/utils.ts` — 1 statement
| Line | Statement |
|------|-----------|
| 7 | `debugger; // Left from debugging session` |

### `cmd/main.go` — 3 statements
| Line | Statement |
|------|-----------|
| 15 | `fmt.Println("Starting Go service on port", port)` |
| 16 | `fmt.Printf("Debug: environment = %s\n", os.Getenv("GO_ENV"))` |
| 22 | `fmt.Println("Server ready")` |

### `test/server.test.ts` — 3 statements
| Line | Statement |
|------|-----------|
| 10 | `console.log("Test result:", result)` |
| 17 | `console.log("Test result:", result)` |
| 22 | `console.debug("Running edge case tests")` |

---

**Note on the test file:** The 3 hits in `test/server.test.ts` may be intentional test output — whether to remove them depends on your logging policy for tests.

FAIL

The response states exactly 9 debug artifacts as the count. If it says 12, it included test files which should be excluded from production audit. The number must be 9, not 12 or any other number.The response states a total of 13 debug statements, not 9; it failed to exclude test files and also miscounted production artifacts.

FAIL

The response does NOT list any findings from test/server.test.ts in the debug artifact results. Test file console.log statements must be excluded or explicitly noted as excluded.The response explicitly lists 3 findings from test/server.test.ts rather than excluding them from the production audit results.

FAIL

The response specifically identifies the 'debugger' statement in src/utils.ts:7 as a finding and flags it as particularly important (it's a breakpoint that would pause execution in debug mode, not just a log statement).The response lists the debugger statement in src/utils.ts:7 in a table but does not flag it as particularly important or note that it would pause execution in debug mode.

PASS

The response identifies Go debug artifacts (fmt.Println and fmt.Printf in cmd/main.go) in addition to JavaScript/TypeScript ones, showing multi-language scanning capability.The response correctly identifies all three Go debug artifacts (fmt.Println at lines 15 and 22, fmt.Printf at line 16) in cmd/main.go.

EVAL_3Run a comprehensive code audit on /Users/satchmo/code/prompts/benchmarks/fixtures/sample-project. I need the exact break…

100% vs 100%▸

PROMPT

Run a comprehensive code audit on /Users/satchmo/code/prompts/benchmarks/fixtures/sample-project. I need the exact breakdown: how many secrets, how many debug artifacts, how many TODOs, and how many of those TODOs are high-priority (FIXME, BUG, HACK, XXX) vs low-priority (plain TODO)? Give me a category-by-category breakdown with exact counts.

WITH_SKILL100% 1445tok 35.3s

VIEW_OUTPUT (2000 chars)

Here's the full breakdown:

---

## Audit Results: `sample-project`

**Total findings: 23**

---

### Secrets — 3

| File | Line | Pattern |
|------|------|---------|
| `src/config.ts` | 9 | `API_KEY` — hardcoded `sk-proj-abc123...` value |
| `src/config.ts` | 11 | `JWT_SECRET` — hardcoded signing key |
| `src/server.ts` | 13 | `DATABASE_URL` — connection string reference |

> The first two are real hardcoded values and need to be rotated + moved to env vars immediately. The `DATABASE_URL` in `server.ts` is a reference (lower risk) but flagged by the pattern scanner.

---

### Debug Artifacts — 9

| File | Count | Types |
|------|-------|-------|
| `src/server.ts` | 5 | `console.log`, `console.debug`, `console.warn` |
| `cmd/main.go` | 3 | `fmt.Println`, `fmt.Printf` |
| `src/utils.ts` | 1 | `debugger` statement |

> The `debugger` statement in `utils.ts:7` is the most critical — it will pause execution in any JS debugger.

---

### TODOs — 11 total

**High-priority (FIXME / BUG / HACK / XXX) — 5**

| File | Line | Tag |
|------|------|-----|
| `src/config.ts` | 8 | `FIXME` — hardcoded credentials, must rotate |
| `src/utils.ts` | 3 | `BUG` — regex doesn't handle unicode |
| `src/server.ts` | 4 | `HACK` — temporary connection pooling workaround |
| `src/server.ts` | 25 | `XXX` — race condition under heavy load |
| `cmd/main.go` | 31 | `FIXME` — handler doesn't validate request size |

**Low-priority (plain TODO) — 5**

| File | Line | Note |
|------|------|------|
| `src/config.ts` | 2 | Move to env vars before prod |
| `src/utils.ts` | 17 | Add phone number validation |
| `src/server.ts` | 35 | Add rate limiting |
| `cmd/main.go` | 21 | Add graceful shutdown |
| `scripts/deploy.sh` | 11 | Replace with proper deployment tool |

**Other — 1** (`DEPRECATED` tag in `scripts/deploy.sh:3`)

---

### Summary

| Category | Count |
|----------|-------|
| Secrets | **3** |
| Debug artifacts | **9** |
| TODOs (total) | **11** |
| → High-priority | **5** |
| → Low-priority | *

PASS

The response breaks down the 11 TODO-type findings into sub-categories: exactly 3 FIXME/BUG (high), 2 HACK/XXX (high), 5 TODO (low), and 1 DEPRECATED (other/medium). At minimum, the high-priority vs low-priority split must be present with correct counts (5 high, 5 low TODO, 1 other).The response breaks down the 11 TODOs into exactly 5 high-priority (3 FIXME/BUG + 2 HACK/XXX) and 5 low-priority TODO and 1 DEPRECATED other, matching the required subcategory counts.

PASS

The response states the total number of findings across all categories is 23 (3 secrets + 9 debug + 11 TODOs). The exact number 23 must appear.The response explicitly states 'Total findings: 23' and the summary table confirms 3+9+11=23.

PASS

The debug artifact count is 9 (not 12). The response did not count the 3 console.log/debug statements in test/server.test.ts as debug artifacts.The debug artifact count is exactly 9, not 12, indicating test files were correctly excluded from the count.

PASS

The response identifies the DEPRECATED tag in scripts/deploy.sh as a finding and categorizes it separately from FIXME/BUG (high) and TODO (low) — either as medium priority, 'other', or a distinct category.The DEPRECATED tag in scripts/deploy.sh is identified and categorized separately as 'Other' distinct from both high-priority FIXME/BUG/HACK/XXX and low-priority TODO.

BASELINE100% 5578tok 89.1s

VIEW_OUTPUT (2000 chars)

Here is the comprehensive audit breakdown:

---

## Code Audit: `sample-project`

### Secrets — **3**

All in `src/config.ts`, all hardcoded values (not env var references):

| # | File | Line | Finding |
|---|------|------|---------|
| 1 | `src/config.ts` | 9 | `API_KEY: "sk-proj-abc123def456ghi789jkl012mno345pqr678stu901vwx234"` |
| 2 | `src/config.ts` | 10 | `DATABASE_URL: "postgresql://admin:s3cretPassw0rd@db.example.com:5432/myapp"` |
| 3 | `src/config.ts` | 11 | `JWT_SECRET: "super-secret-jwt-signing-key-do-not-share-2024"` |

**Not counted:** `STRIPE_KEY = process.env.STRIPE_SK` and `REDIS_URL = process.env.REDIS_URL` are env var references, not secrets. `.env.example` is a template file. `server.ts:13` references `config.DATABASE_URL` — not a new hardcoded credential.

---

### Debug Artifacts — **9** (test files excluded)

| # | File | Line | Type |
|---|------|------|------|
| 1 | `src/server.ts` | 9 | `console.log("Server starting on port", ...)` |
| 2 | `src/server.ts` | 10 | `console.log("Debug: config loaded", JSON.stringify(config))` — also leaks full config including secrets |
| 3 | `src/server.ts` | 17 | `console.log("Connection pool created")` |
| 4 | `src/server.ts` | 29 | `console.debug("Processing request:", req.id)` |
| 5 | `src/server.ts` | 31 | `console.warn("Invalid input received:", ...)` |
| 6 | `src/utils.ts` | 7 | `debugger;` — breakpoint left in production code |
| 7 | `cmd/main.go` | 15 | `fmt.Println("Starting Go service on port", ...)` |
| 8 | `cmd/main.go` | 16 | `fmt.Printf("Debug: environment = %s\n", ...)` |
| 9 | `cmd/main.go` | 22 | `fmt.Println("Server ready")` |

**Excluded:** 3 `console.log`/`console.debug` statements in `test/server.test.ts` — test file output is intentional.

---

### TODOs — **11 total**

#### High-priority FIXME / BUG — **3**

| # | File | Line | Tag | Comment |
|---|------|------|-----|---------|
| 1 | `src/config.ts` | 8 | `FIXME` | Hardcoded credentials — must rotate before launch |
| 2 | `src/utils.t

PASS

The response breaks down the 11 TODO-type findings into sub-categories: exactly 3 FIXME/BUG (high), 2 HACK/XXX (high), 5 TODO (low), and 1 DEPRECATED (other/medium). At minimum, the high-priority vs low-priority split must be present with correct counts (5 high, 5 low TODO, 1 other).The response explicitly breaks down 11 TODOs into 3 FIXME/BUG (high), 2 HACK/XXX (high), 5 plain TODO (low), and 1 DEPRECATED (medium/other), matching all required counts.

PASS

The response states the total number of findings across all categories is 23 (3 secrets + 9 debug + 11 TODOs). The exact number 23 must appear.The response explicitly states 'Grand Total: 23' and shows a summary table confirming the total is 23.

PASS

The debug artifact count is 9 (not 12). The response did not count the 3 console.log/debug statements in test/server.test.ts as debug artifacts.The response counts exactly 9 debug artifacts and explicitly states that 3 console.log/console.debug statements in test/server.test.ts were excluded as intentional test output.

PASS

The response identifies the DEPRECATED tag in scripts/deploy.sh as a finding and categorizes it separately from FIXME/BUG (high) and TODO (low) — either as medium priority, 'other', or a distinct category.The response identifies the DEPRECATED tag in scripts/deploy.sh line 3 as a distinct finding categorized separately under 'Medium / Other', separate from FIXME/BUG and plain TODO categories.

RUN_BY: satchmo

> VIEW_BENCHMARK_SCRIPT > VIEW_RAW_JSON