The Ingestion skill receives, examines, and characterizes a collection of documents. It produces a Collection Record — a structured description of what's in the corpus, its condition, statistical characteristics, what organization already exists, and what analytical methods are feasible given the material.
This is logically prior to all other analysis. Before topics can be discovered, documents clustered, or taxonomies built, someone needs to look at what's actually there.
When a box of donated materials arrives at a library, the librarian doesn't dump it on a shelf. There is a discipline to receiving material:
Via Claude: Tell any TUG team Claude session:
Via command line:
The script accepts .txt files by default. Use --glob "*.md" for other formats.
| Section | What It Contains |
|---|---|
| provenance | Source, collector, purpose, date, sampling info |
| inventory | Document count, total chars/tokens, formats, date range |
| statistical_profile | Length distribution (min/max/mean/median/std/quartiles), vocabulary stats (unique terms, hapax legomena, type-token ratio, richness), language detection |
| structure | Folder hierarchy, naming patterns, prior organization (categories, coverage), granularity, heterogeneity |
| quality_assessment | Flags, duplicates, stubs, encoding issues, overall quality rating (good / acceptable / needs_cleaning) |
| feasibility | Topic modeling, clustering, taxonomy extraction, faceted analysis — each with feasibility rating, recommended approach, and notes |
| recommendations | Prioritized next actions with rationale and skill references |
| documents | Per-document inventory: ID, path, title, char count, metadata |
| Corpus Size | Recommendation |
|---|---|
| < 10 docs | Too small for topic modeling. Manual classification or direct LLM analysis. |
| 10–100 docs | NMF for topics (more interpretable than LDA at small scale). Hierarchical clustering. LLM classification feasible. |
| 100–1000 docs | LDA viable. BERTopic also works. K-means or HDBSCAN for clustering. |
| > 1000 docs | LDA with sampling. Consider pre-processing vocabulary. Partition-based clustering. |