The Understanding Group — Internal

Librarian Skills

Four-stage pipeline for corpus analysis and organization
1
Ingestion
Profile the corpus, detect structure, assess quality and feasibility
collection-record.json
2
Enrichment
Classify documents by persona, concepts, pain points, format
data.json
3
Similarity
Compute document relationships via TF-IDF cosine similarity
graph-data.json
4
Briefing
Present options to the human with honest tradeoffs
briefing.md

The Librarian pipeline applies formal information science methods to analyze and organize collections of documents. It's built on the same discipline that professional librarians use when they receive a new collection — profiling what's there, assessing condition, determining what analysis is feasible, and presenting options to the human who will make decisions.

Each skill produces a structured artifact that the next skill consumes. You don't have to run all four — start with Ingestion and it will tell you what's worth doing next.

Skills
1
Ingestion
Profile a corpus — document count, vocabulary analysis, prior organization, quality flags, and feasibility assessment for different analysis methods. Produces the Collection Record that all other skills reference.
collection-record.json stdlib only working
2
Enrichment
Classify documents by audience persona, concepts, pain points, and content format. Compute related articles via TF-IDF similarity. Run gap analysis to find underserved personas and thin content areas.
data.json scikit-learn working
3
Similarity
Compute the full cosine similarity matrix across all documents. Identify tight clusters, bridge documents, and isolated content. Feeds the Relationship Map visualization.
graph-data.json scikit-learn + numpy working
4
Briefing
Generate a human-readable capabilities briefing from a Collection Record. Presents 2-3 recommended approaches with effort, output, and honest tradeoffs. The human decides what to do.
briefing.md + package.json stdlib only working
Quick Start
Prepare your documents
Get documents as .txt files in a single folder. If they're PDFs, web pages, or CSV — extract text first. Naming patterns like category--title.txt will be auto-detected.
Run Ingestion
Tell Claude: "I have a folder of documents at [path]. Run the librarian ingestion skill." This produces a Collection Record profiling your corpus.
Review the Collection Record
Check document count, vocabulary characteristics, quality flags, feasibility assessments, and prioritized recommendations.
Choose next steps
Based on the recommendations, run Enrichment (to classify), Similarity (to map relationships), or Briefing (to get structured options).
View results in the Organizer Console
Copy output JSON files to organizer-data/ and open the Organizer Console for interactive exploration.
Theoretical Foundation

The Librarian pipeline is grounded in Robert Glushko's framework from The Discipline of Organizing (MIT Press). Every organizing system involves identifying resources, describing and classifying them, designing the interactions they support, and maintaining the organization over time.

The three-phase human workflow — Reference Interview, Capabilities Briefing, Collaborative Triage — is adapted from professional library science practice. The reference interview encodes the insight that the first question asked is rarely the actual need. A reference librarian works backward from the question to the task, from the task to the need, from the need to the situation.

The pipeline also draws on S.R. Ranganathan's faceted classification tradition, which holds that resources should be analyzed along multiple independent dimensions and recombined rather than forced into a single hierarchy. In practice, this means a document might be classified by persona, topic, content type, and concept simultaneously — and any of those facets can be the entry point for retrieval.

← Back to Library Architecture