Document stats & auditing

analyzeDocument and contentbit stats — outlines, block usage, link domains, and validation summaries as JSON.

contentbit doctor is the ranked audit command: validation, links, and content health in one repair plan. contentbit stats is the raw read tool next to it: it analyzes documents and prints JSON to stdout, always exiting 0 — even when documents have validation errors. Use stats when you need measurements to feed a custom report, dashboard, or LLM prompt.

Try it

Edit the source — the report recomputes as you type. Notice the per-section word counts: that's the thin-section detector.

source

contentbit statsexit 0

37words

1read min

~92tokens

19lines

✓ valid

outline

Getting started20w
Deployment17w

blocks

callout ×1steps ×1

links · 1 external / 0 internal

contentbit.dev

The CLI

Point stats at files or globs (quote globs so your shell doesn't expand them):

contentbit stats "content/**/*.md" --registry ./blocks/registry.ts

If your registry owns the full block set and includes names from the generic pack, add --no-generic-blocks.

One matched file prints a single stats object; multiple files print an array. Validation runs against the same registry validate uses and lands in a validation summary per file — skip it with --no-validate. No files matched is the only failure mode (exit 2).

A report, trimmed to the interesting fields:

{
  "file": { "path": "content/getting-started.md", "lines": 102 },
  "length": { "words": 601, "readingMinutes": 4, "approxTokens": 1164 },
  "outline": [
    { "level": 2, "text": "Getting started", "line": 20, "words": 127 },
    { "level": 2, "text": "Deployment", "line": 37, "words": 2 }
  ],
  "blocks": { "total": 7, "byName": { "callout": 1, "steps": 1 }, "maxDepth": 2 },
  "links": { "total": 3, "external": 1, "internal": 2, "domains": ["contentbit.dev"] },
  "validation": { "errors": 0, "warnings": 0 }
}

What's in the report

outline — every ATX heading with its level, source line, and the prose word count from that heading until the next one. A section with 2 words is a stub; the outline finds it instantly.
blocks — total count, byName usage, nesting depth, and per-instance source lines. An empty byName on a long page usually means missed structure: prose that wants to be steps, a comparison, or an FAQ.
links — external/internal split plus the deduplicated external domains list. Good for spotting link rot targets and unexpected domains. This is inline Markdown link analysis. For the frontmatter-authored internal graph with slug, linksTo, aliases, and backlinks, use contentbit links.
length — prose words only (frontmatter, code, and markup syntax are excluded), reading minutes, and a rough token estimate.
frontmatter, images (with missingAlt), code, and structure (lists, tables, blockquotes) round out the picture.

The library API

The same analysis is one import, with no filesystem access and nothing environment-specific:

import { analyzeDocument } from '@contentbit/core'

const stats = analyzeDocument(source, { path: 'content/post.md' })
stats.outline // [{ level, text, line, words }, ...]

analyzeDocument takes the raw source string and needs no registry. The CLI adds the validation summary by running the validate pipeline alongside it — do the same if you need both:

import { compileDocument } from '@contentbit/core'

const result = compileDocument(source, registry)

The live demo above is exactly this — analyzeDocument plus validateDocument, running in your browser.

Auditing with an LLM agent

The contentbit-audit skill installed by contentbit agents starts with contentbit doctor for a ranked report, then uses stats when raw metrics are useful. Doctor's priority order is worth stealing for your own tooling:

Validation errors and warnings — broken content ships broken pages.
Thin documents — outline sections with very low word counts.
Block-less documents — empty blocks.byName where sibling documents use blocks.
Missing or inconsistent frontmatter compared to siblings.
Structural imbalance — skipped heading levels, single-section walls of text.

Ask your LLM agent to "audit my content" and this is the loop it runs.

Recipes

JSON to stdout means jq does the rest:

# Rank files by word count, shortest first
contentbit stats "content/**/*.md" | jq 'sort_by(.length.words) | .[] | {path: .file.path, words: .length.words}'

# Files with no blocks at all
contentbit stats "content/**/*.md" | jq '.[] | select(.blocks.total == 0) | .file.path'

# Every external domain you link to
contentbit stats "content/**/*.md" | jq '[.[].links.domains[]] | unique'