Ingestion Pipeline — Understanding Workbench

Status: Layer 1-3 implemented, canonical model integration complete

Backends: Docling (fidelity), Unstructured (default)

Code: ~1,800 lines Python across 24 modules

Layer 1 — Normalization

Layer 2 — Enrichment

Layer 3 — Extraction

Ingestion Pipeline

Accepts documents in any format, normalizes them to structured text + images, enriches them with AI-generated metadata, and optionally extracts domain-specific structure. Each layer is independent and fails gracefully.

1 Format Normalization

Converts any supported format into a NormalizedDocument: structured text, extracted images, headings, tables, lists, and metadata. An Intake Router selects the best backend based on file type and structural complexity.

PDF DOCX PPTX HTML Markdown CSV XML Email Images Plain Text

Unstructured

Broad format coverage, good general-purpose extraction. Handles most document types well.

Default

Docling

Superior hierarchy preservation for regulations, dense tables, multi-column layouts.

Fidelity

Intake Router MIME Detection Image Extraction .txt Sidecar Passthrough (text/md)

Layer 1 output: NormalizedDocument ↓

2 Semantic Enrichment

Two LLM passes against the TUG controlled vocabulary. Text pass classifies the document (subject, artifact type, methodological phase, audience). Vision pass describes each extracted image and identifies artifact types in diagrams.

Text Pass

Claude API classifies the document against the controlled vocabulary: artifact type, PASS phase, subject domain, audience, abstraction level.

Vision Pass

Claude Vision describes each extracted image. Tuned to identify TUG artifact types: current-state models, ecosystem maps, org charts, journey maps.

TUG Vocabulary v1.0 22 Artifact Types PASS Phases Subject Domains Audience Types Abstraction Levels

Layer 2 output: EnrichmentMetadata + tagged images ↓

3 Domain Extraction

Pluggable extractors that produce structured domain output. When an extractor produces canonical model JSON (matching the structure-engine schema), the output can be opened directly in the Structure Editor for visualization and editing.

ExtractorRegistry DomainExtractor ABC Org Chart Extractor Canonical Model Builder Schema Validator

Pipeline Output

What the Pipeline Produces

Normalized Text

.txt sidecar file for compatibility with existing Librarian skills

Librarian Skills, Similarity, Briefing

Enrichment Metadata

Classified artifact type, phase, subject, audience, keywords

Organizer, Library Collections

Canonical JSON

Domain-specific structured output (when extractor matches)

Structure Editor, Modeling Tools

Cross-cutting: Feeds the Library (text + metadata), Modeling (canonical JSON), and all capabilities that consume enriched documents. Standalone tool: users download canonical JSON and load it into the Structure Editor independently. Open source only for MVP (Docling + Unstructured). Commercial backends can slot in via the NormalizationBackend ABC.

Library Consumer

Normalized text feeds the Librarian skills pipeline: Ingestion, Enrichment, Similarity, Briefing.

Modeling Consumer

Canonical JSON output from domain extractors can be visualized and edited in the Structure Editor.

Librarian Skills 4 Skills

The .txt sidecar bridge ensures existing skills work on rich-format documents without modification.

Ingestion Pipeline — Capability Architecture