The Understanding Group — Internal
Step 1

Corpus Ingestion

Profile a corpus — the foundational step logically prior to all other analysis
Output: collection-record.json Dependencies: Python stdlib only Status: Working
What It Does

The Ingestion skill receives, examines, and characterizes a collection of documents. It produces a Collection Record — a structured description of what's in the corpus, its condition, statistical characteristics, what organization already exists, and what analytical methods are feasible given the material.

This is logically prior to all other analysis. Before topics can be discovered, documents clustered, or taxonomies built, someone needs to look at what's actually there.

The Librarian's Discipline at Ingestion

When a box of donated materials arrives at a library, the librarian doesn't dump it on a shelf. There is a discipline to receiving material:

  1. What do we have here?
    Document count, file types, naming conventions, folder structures. Are there labels or patterns that suggest someone already organized this?
  2. What's the provenance?
    Where did this come from? Who created it, for what purpose? A carefully curated policy collection behaves differently under analysis than a web crawl.
  3. What's the scope and boundaries?
    Is this everything, or a sample? What's included and excluded? Be attentive to what's absent, not just what's present.
  4. What's the condition?
    Is text clean or noisy? Duplicates? Mixed languages? Inconsistent formatting?
  5. What are the resources, really?
    Not just "500 documents" but what kind. A 200-page report and a one-paragraph FAQ are both documents but need fundamentally different treatment.
  6. What descriptions already exist?
    File names, folder structures, metadata fields, tags, categories — these are prior organizing decisions. They're clues about how the collection was intended to be used.
How to Use

Via Claude: Tell any TUG team Claude session:

"I have a folder of documents at [path]. Run the librarian ingestion skill to profile this corpus."

Via command line:

python skills/librarian-ingestion/scripts/ingest_corpus.py \ --input /path/to/documents \ --name "My Corpus Name" \ --source "https://example.com" \ --purpose "Content audit for redesign" \ --output collection-record.json

The script accepts .txt files by default. Use --glob "*.md" for other formats.

Collection Record Schema
SectionWhat It Contains
provenanceSource, collector, purpose, date, sampling info
inventoryDocument count, total chars/tokens, formats, date range
statistical_profileLength distribution (min/max/mean/median/std/quartiles), vocabulary stats (unique terms, hapax legomena, type-token ratio, richness), language detection
structureFolder hierarchy, naming patterns, prior organization (categories, coverage), granularity, heterogeneity
quality_assessmentFlags, duplicates, stubs, encoding issues, overall quality rating (good / acceptable / needs_cleaning)
feasibilityTopic modeling, clustering, taxonomy extraction, faceted analysis — each with feasibility rating, recommended approach, and notes
recommendationsPrioritized next actions with rationale and skill references
documentsPer-document inventory: ID, path, title, char count, metadata
Feasibility Decision Logic
Corpus SizeRecommendation
< 10 docsToo small for topic modeling. Manual classification or direct LLM analysis.
10–100 docsNMF for topics (more interpretable than LDA at small scale). Hierarchical clustering. LLM classification feasible.
100–1000 docsLDA viable. BERTopic also works. K-means or HDBSCAN for clustering.
> 1000 docsLDA with sampling. Consider pre-processing vocabulary. Partition-based clustering.
What Comes Next
← Back to Librarian Skills