Output: graph-data.json
Requires: scikit-learn + numpy
Status: Working
What It Does
The Similarity skill computes a cosine similarity matrix across all documents in a corpus. Each document is converted to a TF-IDF vector (term frequency weighted by inverse document frequency), then the angle between every pair of vectors gives a similarity score from 0 (completely different) to 1 (identical vocabulary).
The output feeds the Organizer Console's Relationship Map — a force-directed graph where documents are nodes and similarity scores are edges.
Interpreting Similarity Scores
< 0.1
Unrelated
Different topics and vocabulary
0.1 – 0.3
Loosely related
Some shared concepts
0.3 – 0.5
Related
Shared themes or vocabulary
> 0.5
Very similar
Same topic, possible near-duplicates
What to Look For
Tight Clusters
Groups of documents with high mutual similarity suggest natural categories that may or may not match existing labels.
Bridge Documents
Documents similar to multiple clusters may be good entry points or connecting pieces between topic areas.
Isolated Nodes
Documents with no similarity above 0.1 to anything else. May be unique content, outliers, or off-topic material.
Unexpected Connections
Documents from different categories that are highly similar reveal cross-cutting themes worth investigating.
How to Use
Via Claude:
"Compute the similarity matrix for the documents in [path].
Show me which documents are most related to each other."
Via command line:
python skills/librarian-similarity/scripts/compute_similarity.py \
--input ./raw \
--data dashboard/data.json \
--output dashboard/graph-data.json
The script prints the top 5 most similar pairs and any isolated documents automatically.
How It Works
TF-IDF Vectorization — Each document becomes a vector in a 1000-dimensional space. Terms that appear frequently in one document but rarely across the corpus get high weight. Common words get low weight.
Cosine Similarity — The cosine of the angle between two document vectors. Documents that use similar distinctive vocabulary end up close together regardless of document length.
Threshold Filtering — In the Relationship Map visualization, a slider controls the minimum similarity for an edge to appear. This lets you focus on strong connections or see the full web.
← Back to Librarian Skills