Dashboard / Workbench Layer / Capabilities /
Ingestion Pipeline
Status: Layer 1-3 implemented, canonical model integration complete
Backends: Docling (fidelity), Unstructured (default)
Code: ~1,800 lines Python across 24 modules
Ingestion Pipeline
Accepts documents in any format, normalizes them to structured text + images,
enriches them with AI-generated metadata, and optionally extracts domain-specific
structure. Each layer is independent and fails gracefully.
1 Format Normalization
Converts any supported format into a NormalizedDocument: structured text, extracted images,
headings, tables, lists, and metadata. An Intake Router selects the best backend based on
file type and structural complexity.
PDF
DOCX
PPTX
HTML
Markdown
CSV
XML
Email
Images
Plain Text
Unstructured
Broad format coverage, good general-purpose extraction. Handles most document types well.
Default
Docling
Superior hierarchy preservation for regulations, dense tables, multi-column layouts.
Fidelity
Intake Router
MIME Detection
Image Extraction
.txt Sidecar
Passthrough (text/md)
Layer 1 output: NormalizedDocument ↓
2 Semantic Enrichment
Two LLM passes against the TUG controlled vocabulary. Text pass classifies the document
(subject, artifact type, methodological phase, audience). Vision pass describes each
extracted image and identifies artifact types in diagrams.
Text Pass
Claude API classifies the document against the controlled vocabulary: artifact type, PASS phase, subject domain, audience, abstraction level.
Vision Pass
Claude Vision describes each extracted image. Tuned to identify TUG artifact types: current-state models, ecosystem maps, org charts, journey maps.
TUG Vocabulary v1.0
22 Artifact Types
PASS Phases
Subject Domains
Audience Types
Abstraction Levels
Layer 2 output: EnrichmentMetadata + tagged images ↓
3 Domain Extraction
Pluggable extractors that produce structured domain output. When an extractor produces
canonical model JSON (matching the structure-engine schema), the output can be opened
directly in the Structure Editor for visualization and editing.
ExtractorRegistry
DomainExtractor ABC
Org Chart Extractor
Canonical Model Builder
Schema Validator
Normalized Text
.txt sidecar file for compatibility with existing Librarian skills
Librarian Skills, Similarity, Briefing
Enrichment Metadata
Classified artifact type, phase, subject, audience, keywords
Organizer, Library Collections
Canonical JSON
Domain-specific structured output (when extractor matches)
Structure Editor, Modeling Tools
Cross-cutting: Feeds the Library (text + metadata), Modeling (canonical JSON), and all capabilities that consume enriched documents.
Standalone tool: users download canonical JSON and load it into the Structure Editor independently.
Open source only for MVP (Docling + Unstructured). Commercial backends can slot in via the NormalizationBackend ABC.