Ingestion Skill — Understanding Workbench

Output: collection-record.json Dependencies: Python stdlib only Status: Working

What It Does

The Ingestion skill receives, examines, and characterizes a collection of documents. It produces a Collection Record — a structured description of what's in the corpus, its condition, statistical characteristics, what organization already exists, and what analytical methods are feasible given the material.

This is logically prior to all other analysis. Before topics can be discovered, documents clustered, or taxonomies built, someone needs to look at what's actually there.

The Librarian's Discipline at Ingestion

When a box of donated materials arrives at a library, the librarian doesn't dump it on a shelf. There is a discipline to receiving material:

What do we have here?
Document count, file types, naming conventions, folder structures. Are there labels or patterns that suggest someone already organized this?
What's the provenance?
Where did this come from? Who created it, for what purpose? A carefully curated policy collection behaves differently under analysis than a web crawl.
What's the scope and boundaries?
Is this everything, or a sample? What's included and excluded? Be attentive to what's absent, not just what's present.
What's the condition?
Is text clean or noisy? Duplicates? Mixed languages? Inconsistent formatting?
What are the resources, really?
Not just "500 documents" but what kind. A 200-page report and a one-paragraph FAQ are both documents but need fundamentally different treatment.
What descriptions already exist?
File names, folder structures, metadata fields, tags, categories — these are prior organizing decisions. They're clues about how the collection was intended to be used.

How to Use

Via Claude: Tell any TUG team Claude session:

"I have a folder of documents at [path]. Run the librarian ingestion skill to profile this corpus."

Via command line:

python skills/librarian-ingestion/scripts/ingest_corpus.py \
  --input /path/to/documents \
  --name "My Corpus Name" \
  --source "https://example.com" \
  --purpose "Content audit for redesign" \
  --output collection-record.json

The script accepts .txt files by default. Use --glob "*.md" for other formats.

Collection Record Schema

Section	What It Contains
provenance	Source, collector, purpose, date, sampling info
inventory	Document count, total chars/tokens, formats, date range
statistical_profile	Length distribution (min/max/mean/median/std/quartiles), vocabulary stats (unique terms, hapax legomena, type-token ratio, richness), language detection
structure	Folder hierarchy, naming patterns, prior organization (categories, coverage), granularity, heterogeneity
quality_assessment	Flags, duplicates, stubs, encoding issues, overall quality rating (good / acceptable / needs_cleaning)
feasibility	Topic modeling, clustering, taxonomy extraction, faceted analysis — each with feasibility rating, recommended approach, and notes
recommendations	Prioritized next actions with rationale and skill references
documents	Per-document inventory: ID, path, title, char count, metadata

Feasibility Decision Logic

Corpus Size	Recommendation
< 10 docs	Too small for topic modeling. Manual classification or direct LLM analysis.
10–100 docs	NMF for topics (more interpretable than LDA at small scale). Hierarchical clustering. LLM classification feasible.
100–1000 docs	LDA viable. BERTopic also works. K-means or HDBSCAN for clustering.
> 1000 docs	LDA with sampling. Consider pre-processing vocabulary. Partition-based clustering.

What Comes Next

Enrichment

Classify documents by persona, concepts, pain points

Briefing

Present options to the human for next steps

← Back to Librarian Skills

Corpus Ingestion