Similarity Skill — Understanding Workbench

What It Does

The Similarity skill computes a cosine similarity matrix across all documents in a corpus. Each document is converted to a TF-IDF vector (term frequency weighted by inverse document frequency), then the angle between every pair of vectors gives a similarity score from 0 (completely different) to 1 (identical vocabulary).

The output feeds the Organizer Console's Relationship Map — a force-directed graph where documents are nodes and similarity scores are edges.

Interpreting Similarity Scores

< 0.1

Unrelated

Different topics and vocabulary

0.1 – 0.3

Loosely related

Some shared concepts

0.3 – 0.5

> 0.5

Very similar

Same topic, possible near-duplicates

What to Look For

Tight Clusters

Groups of documents with high mutual similarity suggest natural categories that may or may not match existing labels.

Bridge Documents

Documents similar to multiple clusters may be good entry points or connecting pieces between topic areas.

Isolated Nodes

Documents with no similarity above 0.1 to anything else. May be unique content, outliers, or off-topic material.

Unexpected Connections

Documents from different categories that are highly similar reveal cross-cutting themes worth investigating.

How to Use

Via Claude:

"Compute the similarity matrix for the documents in [path].
Show me which documents are most related to each other."

Via command line:

python skills/librarian-similarity/scripts/compute_similarity.py \
  --input ./raw \
  --data dashboard/data.json \
  --output dashboard/graph-data.json

The script prints the top 5 most similar pairs and any isolated documents automatically.

How It Works

TF-IDF Vectorization — Each document becomes a vector in a 1000-dimensional space. Terms that appear frequently in one document but rarely across the corpus get high weight. Common words get low weight.

Cosine Similarity — The cosine of the angle between two document vectors. Documents that use similar distinctive vocabulary end up close together regardless of document length.

Threshold Filtering — In the Relationship Map visualization, a slider controls the minimum similarity for an edge to appear. This lets you focus on strong connections or see the full web.

What Comes Next

Organizer Console

Briefing

Present findings and analysis options to the team

Document Similarity