Verse Similarity Analysis
Cross-algorithm comparison of verse-level similarity across 4.2 million Estonian and Finnish runosong verses, with formulaic cluster analysis and geographic spread.
What this page shows
A statistical analysis of verse-level similarity across the entire Finnic runosong corpus (4.29 million verse occurrences, 289,702 poems). The analysis compares how four different algorithms find similar verses and identifies formulaic patterns.
Algorithm Comparison Dashboard
- Jaccard — exact wordform overlap between two verse lines (|intersection| / |union|), with adaptive minimum shared words and IDF weighting.
- TF-IDF — lemma-level cosine similarity, weighted by term rarity across the corpus.
- Translation — cross-lingual similarity via English translation vectors, enabling Estonian-Finnish matching.
- Char Bigram — character bigram overlap using FAISS approximate nearest neighbor search, capturing orthographic similarity between verses.
- Each card shows coverage (how many verses have matches), average score, and match type distribution (s = same language, x = cross-lingual, w = within-poem).
Cross-Algorithm Discordance
For verses that appear in multiple algorithms, how much do their neighbor lists overlap? High discordance means the algorithms find fundamentally different similar verses. Low overlap suggests the algorithms capture complementary aspects of similarity.
Formulaic Verse Clusters
- The 200 largest groups of near-identical verses found by RRF neighborhood clustering on the combined similarity graph.
- Size = total verse occurrences in the cluster. Places = distinct collection locations.
- Click a cluster row to see its verse variants, geographic spread, and explore links.
- Use language filters and text search to narrow the table. Click column headers to sort.
Geographic Spread Charts
Scatter plot shows how cluster size relates to geographic distribution. Histogram shows the frequency of clusters by number of distinct places. Widely distributed formulas represent the most universal elements of Finnic oral poetry.