The Understanding Group — Internal
Step 2

Content Enrichment

Classify documents by persona, concepts, pain points, and format
Output: data.json Requires: scikit-learn Status: Working
What It Does

The Enrichment skill assigns structured metadata to each document in a corpus. For every document, it determines who the content is for, what problems it addresses, what key ideas it covers, and what kind of content it is. It also computes related articles using TF-IDF cosine similarity.

This is the Classification step of the Librarian pipeline — the work that transforms a pile of documents into a browsable, filterable, targetable collection.

Classification Facets

Each document is classified along five independent dimensions. Following Ranganathan's tradition, these facets are independent — a document can be accessed through any of them.

Primary Persona
Who is this content primarily written for?
sara, laura, ben, sue
Pain Points
What problems does it address?
ai-anxiety, digital-strategy, skill-building
Concepts
What key ideas does it cover?
modeling, blueprints, ai-ethics, taxonomies
Format
What kind of content is it?
how-to, thought-leadership, case-study, framework
Two Approaches
LLM-Assisted Classification
Use Claude's judgment to read each document and assign metadata based on content analysis. Most accurate for small corpora.
Best for: New corpora, < 100 docs, exploring what categories make sense
Script-Based Classification
When personas and vocabularies are already established, use the enrichment script to apply them systematically with computed similarity.
Best for: Established schemas, repeatable pipeline, > 50 docs
How to Use

Via Claude:

"I have a collection of [N] documents. Please classify each one by audience persona, key concepts, and pain points addressed."

Claude will read each document and propose classifications. Review and refine, then save as an enrichments JSON file.

Via command line (once you have enrichments):

python skills/librarian-enrichment/scripts/enrich_corpus.py \ --input ./raw \ --enrichments enrichments.json \ --output data.json
Gap Analysis

After enrichment, the script automatically identifies gaps in your content strategy:

Underserved personas — Which audiences have few articles targeted at them?
Thin pain points — Which problems have only 1-2 articles addressing them?
Missing concepts — Are there important topics with no coverage?
Format imbalance — Too much thought-leadership, not enough practical how-to?
Persona imbalance — Detects when one persona has 3x more content than another
What Comes Next
← Back to Librarian Skills