Ingestion Pipeline — Capability Architecture

Three-layer pipeline: normalize any format, enrich with AI, extract domain structure
Status: Layer 1-3 implemented, canonical model integration complete
Backends: Docling (fidelity), Unstructured (default)
Code: ~1,800 lines Python across 24 modules
Layer 1 — Normalization
Layer 2 — Enrichment
Layer 3 — Extraction
Ingestion Pipeline
Accepts documents in any format, normalizes them to structured text + images, enriches them with AI-generated metadata, and optionally extracts domain-specific structure. Each layer is independent and fails gracefully.
1 Format Normalization
Converts any supported format into a NormalizedDocument: structured text, extracted images, headings, tables, lists, and metadata. An Intake Router selects the best backend based on file type and structural complexity.
PDF DOCX PPTX HTML Markdown CSV XML Email Images Plain Text
Unstructured
Broad format coverage, good general-purpose extraction. Handles most document types well.
Default
Docling
Superior hierarchy preservation for regulations, dense tables, multi-column layouts.
Fidelity
Intake Router MIME Detection Image Extraction .txt Sidecar Passthrough (text/md)
Layer 1 output: NormalizedDocument ↓
2 Semantic Enrichment
Two LLM passes against the TUG controlled vocabulary. Text pass classifies the document (subject, artifact type, methodological phase, audience). Vision pass describes each extracted image and identifies artifact types in diagrams.
Text Pass
Claude API classifies the document against the controlled vocabulary: artifact type, PASS phase, subject domain, audience, abstraction level.
Vision Pass
Claude Vision describes each extracted image. Tuned to identify TUG artifact types: current-state models, ecosystem maps, org charts, journey maps.
TUG Vocabulary v1.0 22 Artifact Types PASS Phases Subject Domains Audience Types Abstraction Levels
Layer 2 output: EnrichmentMetadata + tagged images ↓
3 Domain Extraction
Pluggable extractors that produce structured domain output. When an extractor produces canonical model JSON (matching the structure-engine schema), the output can be opened directly in the Structure Editor for visualization and editing.
ExtractorRegistry DomainExtractor ABC Org Chart Extractor Canonical Model Builder Schema Validator
Pipeline Output
What the Pipeline Produces
Normalized Text
.txt sidecar file for compatibility with existing Librarian skills
Librarian Skills, Similarity, Briefing
Enrichment Metadata
Classified artifact type, phase, subject, audience, keywords
Organizer, Library Collections
Canonical JSON
Domain-specific structured output (when extractor matches)
Structure Editor, Modeling Tools
Cross-cutting: Feeds the Library (text + metadata), Modeling (canonical JSON), and all capabilities that consume enriched documents. Standalone tool: users download canonical JSON and load it into the Structure Editor independently. Open source only for MVP (Docling + Unstructured). Commercial backends can slot in via the NormalizationBackend ABC.