Poem & Verse Similarity Guide
What is this page?
A reference guide to the similarity system in RunoVerse. It documents all 7 algorithms (5 poem-level, 4 verse-level) and the tools that use them to find related poems and verses across 292K poems and 4.29M verse lines.
How to navigate
Each section describes one tool — what it does, its key features, and a direct link to open it. The final table summarises the poem-level algorithms and how they differ. Scroll or use the browser's find-in-page (Ctrl/Cmd+F) to jump to a section.
Related explorer pages
Similarity Explorer — poem and verse similarity search · Poem Reader — per-poem similarity tabs · Verse Network — interactive similarity graph · Path Finder — shortest verse chains · Formula Explorer — formulaic patterns · Verse Analysis — cross-algorithm dashboard
RunoVerse provides multiple ways to explore how poems and verses relate to each other across the Finnish and Estonian runosong traditions. Whether you are tracing the spread of a formulaic verse line, comparing thematically related poems from different regions, or mapping cross-lingual parallels, the tools below offer complementary perspectives on the 292,092-poem, 4.29-million-verse corpus.
Poem Reader
The Poem Reader is an interactive viewer for all 292,092 poems from the SKVR (Finnish, published), JR (Finnish, unpublished), and ERAB (Estonian) collections. Each word in a poem is annotated with its standard orthography, English gloss, part-of-speech tag, and a link to its lemma in the dictionary. Clicking a word opens a dictionary lookup panel with full definitions from up to nine lexicographic sources.
- Word-level annotations: standard orthography, English glosses, POS tags, and lemma links for every token
- Dictionary lookups on click, drawing from Estonian and Finnish lexicographic sources
- Per-verse similarity expansion — click the arrow beside any verse line to see similar verses from other poems across the corpus
- Related Poems panel with five similarity algorithm tabs, showing the nearest neighbor poems ranked by each method
- Geographic map displaying where similar poems were collected
- Direct linking via URL — open any poem with
?poem=ID(e.g.,reader.html?poem=skvr01001001)
Similarity Explorer
The Similarity Explorer is a standalone tool for investigating poem and verse similarity in depth. It operates in two modes.
Poem mode lets you select any poem and view its nearest neighbors across all five similarity algorithms. Each algorithm displays its top matches with numeric similarity scores. A side-by-side comparison panel lets you read two poems together, line by line. A network graph visualizes the similarity neighborhood, showing how poems cluster. Geographic and temporal analytics reveal where and when similar poems were collected. Cross-algorithm agreement badges highlight poems that appear as top matches in multiple algorithms.
Verse mode lets you search for individual verse lines and see similar verses from across the corpus. You can also browse the top 200 formulaic patterns — the most widely distributed recurring verse lines, ranked by how many poems and collection places they span.
- Five poem-level similarity algorithms with ranked results and scores
- Side-by-side poem comparison panel
- Network graph showing similarity neighborhoods
- Geographic and temporal distribution of similar poems
- Algorithm agreement badges (BOTH) for cross-algorithm matches and ET/FI badges for cross-lingual matches
- Verse search with similar-verse results
- Top 200 formulaic verse patterns
Verse Concordance
The Verse Concordance provides full-text search across 2,906,535 unique verse types drawn from 4.29 million total verse lines. Enter any text fragment to find matching verses. Results include a geographic distribution map showing where each verse was collected, a language breakdown (Estonian vs. Finnish), and occurrence tables listing which poems contain the verse and how frequently it appears.
- Full-text search across 2.9 million unique verse lines
- Geographic distribution map for each matching verse
- Language breakdown showing Estonian and Finnish occurrences
- Occurrence tables with poem IDs and frequencies
Verse Network
The Verse Network displays an interactive force-directed graph of verse similarity neighborhoods. Start from any verse and explore multi-hop connections to see how verses are linked through shared similarity across the corpus. Each edge shows a per-algorithm score breakdown, so you can see which similarity measures contribute to each connection. Results can be exported as CSV for further analysis.
- Interactive force-directed graph centered on any starting verse
- Multi-hop exploration — follow chains of similar verses
- Per-algorithm score breakdown on each connection (Jaccard, TF-IDF, Translation, CharBigram)
- CSV export of network data
Verse Path Finder
The Verse Path Finder locates the shortest chain of similar verses connecting any two verses in the Finnic runosong corpus. The result is displayed as an interactive chain showing each intermediate verse and the similarity scores between consecutive steps. This reveals how seemingly unrelated verses may be connected through a sequence of incremental textual similarities.
- Shortest-path search between any two verses
- Interactive chain visualization with intermediate verses
- Similarity scores shown between each step
Formula Explorer
The Formula Explorer lets you browse 200 formulaic verse patterns ranked by frequency and geographic spread. A “formula” here is a cluster of similar verse lines found across many poems — evidence of oral tradition transmission, where singers in different times and places used recognizably similar wording. Each cluster shows its variant texts, the number of member verse occurrences across poems, geographic spread across collection places, and cross-links to the network visualization for further exploration.
- 200 formulaic verse clusters ranked by frequency and geographic spread
- Variant texts within each cluster
- Occurrence counts and poem distribution
- Geographic spread across collection places
- Cross-links to the Verse Network for deeper exploration
Verse Similarity Analysis
The Verse Similarity Analysis is a cross-algorithm dashboard showing how the four verse-level similarity algorithms — Jaccard, TF-IDF, Translation-pivot, and CharBigram — compare across the 4.29 million verse lines in the corpus. It includes formulaic cluster analysis and geographic coverage metrics, providing a high-level view of how similarity patterns distribute across the material.
- Cross-algorithm comparison dashboard for four verse similarity measures
- Formulaic cluster analysis
- Geographic coverage metrics
Poem-level similarity algorithms
Five algorithms are used to identify related poems. Each captures a different aspect of similarity — from shared vocabulary to cross-lingual thematic overlap to structural verse alignment. Results from all five are available in the Poem Reader and the Similarity Explorer.
| Algorithm | Basis | Description |
|---|---|---|
| TF-IDF Lemma | Lemma-level | Cosine similarity on TF-IDF vectors of lemmatized poem texts. Captures thematic similarity through shared vocabulary, weighted by corpus-level term importance. |
| Wordform Overlap (Jaccard) | Exact wordforms | Jaccard index over raw wordform sets. Identifies poems sharing exact surface forms, useful for detecting formulaic lines and direct textual parallels. |
| Thematic (Translation-pivot) | Cross-lingual | Boolean-IDF cosine similarity over English translations derived from DeepSeek annotations. Enables cross-lingual comparison between Estonian and Finnish poems via a shared semantic space. |
| Alignment | Character n-gram | Verse sequence alignment using character bigram cosine similarity and dynamic programming, from the FILTER project (Janicki, Kallio & Sarv 2023). Captures structural similarity — poems that follow the same verse order score high. |
| Verse-level RRF | Verse-level fusion | Fuses Jaccard, TF-IDF, Translation, and CharBigram similarity at the verse level using Average-Best-Per-Verse aggregation, then combines all four via Reciprocal Rank Fusion into a single poem-level ranking. |