RunoVerse

About the RunoVerse

What this page covers

This About page is the reference guide for the entire RunoVerse platform. It documents the three source corpora (SKVR, JR, ERAB), corpus-wide statistics (439K lemmas, 15.3M tokens, 292K poems), and the methodology behind the lemmatization and AI annotation pipelines.

Sections on this page

Source corpora — descriptions and poem counts for each collection, with a visual bar chart.
Lexicon statistics — counts for lemmas, word forms, tokens, cognate pairs, translation/etymology confidence, and gloss coverage.
Data sources — explains the three source categories (Corpus Only, Both Sources, DeepSeek Only) and what the color-coded word forms and agreement badges mean.
Similarity — documents all 5 poem-level and 4 verse-level similarity algorithms with their basis and coverage.
Dictionary annotations — lists all 9 lexicographic sources (EMS, EKSS, IMS, ERLA, VMS, Seto, SMS, KKS, VKS).
Lemmatization — describes the Estonian (EstNLTK) and Finnish (Omorfi/Voikko/Stanza) processing pipelines, plus the DeepSeek AI annotation layer.

Explore cards

The card grid below links to all 70+ pages in RunoVerse. Each card shows a short description of the tool. For longer explanations, see the five Feature Guides (Lexicon, Similarity, Poetics, Cross-Lingual, Corpus).

Navigating the site

Use the top navigation bar to reach the main pages (Dictionary, Reader, Similarity, About). The More dropdown provides access to every explorer page. The Site Map organizes all pages into 11 categories. The Dashboard offers a visual starting point with hero statistics and a research question guide.

What is this?

The RunoVerse is a combined word index of Finnish and Estonian runosong (folk poetry) corpora. It brings together lemmatized word data from three major collections, allowing cross-linguistic exploration of the shared Finnic poetic tradition.

Please note that the RunoVerse is under active development. The lemmatization of historical dialectal texts is inherently approximate, and the AI-generated translations, etymological analyses, and similarity metrics should be considered experimental. The statistics and counts shown may change as the data is refined. This tool is intended as an exploratory aid, not as a definitive reference.

439,746
Lemmas
15.3M
Tokens
292,092
Poems
3
Corpora
9
Dictionaries
4.29M
Verse lines

Source corpora

Collection Language Description
SKVR Finnish Suomen Kansan Vanhat Runot – published Kalevala-metre poetry. Finnish Literature Society (SKS). 89,247 poems.
JR Finnish Julkaisemattomat Runot – unpublished folk poetry from SKS folklore archives. 96,129 poems.
ERAB Estonian Eesti Regilaulude Andmebaas – Database of Estonian Runosongs. Estonian Folklore Archives, Estonian Literary Museum. 108,969 poems.

Lexicon statistics

Measure Count Notes
Unique lemmas 439,746 Distinct base forms across all corpora (incl. 183,137 DeepSeek-only)
Unique wordforms 1,480,455 Distinct word tokens occurring across all poem texts
Wordform–lemma mappings 2,083,995 Total mappings from inflected forms to lemmas (one wordform can map to multiple lemmas)
Total tokens 15,264,640 Total word occurrences in source texts
Poems 292,092 Unique poems with full verse texts available in the poem context viewer (294,345 total in source corpora; some excluded due to missing verse text data)
Finnish-only lemmas 206,518 Lemmas from Finnish corpus only (SKVR/JR collections)
Estonian-only lemmas 100,835 Lemmas from Estonian corpus only (ERAB)
Shared (Finnic) lemmas 1,240 Lemmas found in both Finnish and Estonian sources
Cognate pairs (ET↔FI) 6,382 Automatically discovered Estonian-Finnish cognate pairs based on translation overlap, etymological roots, and orthographic similarity (1,114 exact, 2,390 near-exact, 2,873 bridged, 5 orthographic)
Translation confidence 192,236 Lemmas with DeepSeek translation consistency score (8,446 strong, 11,069 good, 41,650 moderate, 131,071 low; 113,502 no data)
Etymology confidence 211,919 Lemmas with DeepSeek etymology consistency score (11,748 strong, 13,183 good, 46,078 moderate, 140,910 low; 93,819 no data)
Gloss coverage 91.3% Word forms with English translation (1,344,094 of 1,472,442), including 11,423 Claude Opus supplementary glosses
Corpus attestations 15,264,640 SKVR: 4,522,811 · JR: 3,398,967 · ERAB: 7,341,908

Statistics reflect the current state of the lemmatized data and may change as lemmatization is refined.

Data sources and source filter

Each lemma in the lexicon has been tagged with one of three source categories, reflecting how it was identified. The source filter dropdown in the main view lets you filter by these categories:

Source Lemmas Meaning
Corpus Only 5,217 Lemma was identified by the corpus lemmatization pipeline. None of the word forms listed under this lemma were matched to DeepSeek annotations during the merge. However, the lemma string itself may still appear as a word form in DeepSeek data, which means some “Corpus Only” entries can still have AI-generated translations visible via the A–Z browse.
Both Sources 251,392 Lemma comes from the corpus pipeline, and at least one of its word forms also appears in the DeepSeek annotations (possibly under a different lemma). These entries typically include AI-generated translations and may have cross-references (dsLemma) to alternative lemmatizations. Word forms in “Both Sources” entries are color-coded: green when both systems agree on the lemma, amber when DS assigns a different lemma, and gray when the word form is not in DS data.
DeepSeek Only 183,137 Lemma exists only in the DeepSeek annotations. The underlying word forms often appear in the corpus under different lemmas (96% of cases), but this particular lemmatization is unique to the AI analysis.

The source categories reflect word-form-level overlap between the two lemmatization systems, not whether an entry has translations. Because the corpus pipeline and DeepSeek sometimes lemmatize the same word forms differently, a word form can belong to a “Corpus Only” lemma while also appearing independently in the DeepSeek data under a different lemma. The “Both Sources” category captures entries where the same word forms were recognized by both systems.

The agreement badge in the DeepSeek tab shows a ratio like “30/35 agree +15 n/a”, meaning 30 out of 35 DS-covered word forms have the same lemma in both systems, and 15 word forms are not present in the DS data. Hover over the badge for a full breakdown.

DeepSeek AI annotations

A subset of the corpus was independently annotated using DeepSeek, a large language model, to provide additional linguistic analysis. The AI annotations include:

Measure Count Notes
DeepSeek tokens 5,962,070 AI-annotated token occurrences (ET: 2,867,388 + FI: 3,094,682)
DeepSeek-only lemmas 183,137 Lemmas unique to the AI analysis
English translations 241,141 Unique English terms extracted from AI annotations, browsable via A–Z (1,252,781 total mappings)
Cross-references 91,754 Entries linking to alternative lemmatizations between corpus and DeepSeek

AI-generated annotations are provided as supplementary material and have not been manually verified. They should be used with appropriate caution, particularly for etymological claims and translations of rare dialectal forms.

Similarity and embedding data

The lexicon includes two word-level similarity systems to help explore relationships between word forms:

Poem similarity

Five poem-level similarity algorithms identify related poems across the 292,092-poem corpus. Results are available in the Poem Reader (Related Poems panel) and the standalone Similarity Explorer with side-by-side comparison, network graphs, and geographic/temporal analytics.

Algorithm Basis Description
TF-IDF Lemma Lemma-level Cosine similarity on TF-IDF vectors of lemmatized poem texts. Captures thematic similarity through shared vocabulary, weighted by corpus-level term importance. Top 50 neighbors per poem.
Wordform Overlap (Jaccard) Exact wordforms Jaccard index (|A∩B| / |A∪B|) over raw wordform sets. Identifies poems sharing exact surface forms, useful for detecting formulaic lines and direct textual parallels.
Thematic (Translation-pivot) Cross-lingual Boolean-IDF cosine similarity over English translations derived from DeepSeek annotations, with lemma-level fallback for improved coverage. Enables cross-lingual comparison between Estonian and Finnish poems via a shared semantic space. Top 50 neighbors per poem.
Alignment Character n-gram Verse sequence alignment using character bigram cosine similarity and Wagner-Fischer dynamic programming, from the FILTER project (Janicki, Kallio & Sarv 2023). Covers 256,970 poems across SKVR, JR, KR, and ERAB. Captures structural similarity — poems that follow the same verse order score high. Shows aligned verse pair excerpts for top matches. Top 50 neighbors per poem.
Verse-level RRF Verse-level fusion Fuses Jaccard, TF-IDF, Translation, and CharBigram similarity at the verse level using Average-Best-Per-Verse aggregation, then combines all four via Reciprocal Rank Fusion (k=60) into a single poem-level ranking. Shows T/J/Tr/C algorithm contribution badges.

The Similarity Explorer shows cross-algorithm agreement badges (BOTH) when poems appear in multiple algorithms' results, and ET↔FI badges for cross-lingual matches in the Translation-pivot, Alignment, and Verse-level RRF algorithms.

Verse similarity

Four algorithms (Jaccard, TF-IDF, Translation-pivot, CharBigram) operate at the individual verse level across 4.29 million verse occurrences. Each verse is compared against all verses in other poems, with up to 20 nearest neighbors stored per algorithm.

Metric Value
Total verses indexed 4,291,553
Poems with verse data 289,702
Unique verse types (search index) 2,906,535
Formulaic pattern clusters 200

Verse similarity is available in the Poem Reader (click the expand arrow on any verse line) and the Verse Similarity Explorer, which also provides full-text verse search and a browser for the top 200 formulaic patterns – recurring verse lines ranked by cluster size across both corpora.

Explore the lexicon

Feature guides

The RunoVerse contains over 30 interconnected tools for exploring the Finnic runosong tradition. These guides describe each tool in detail — what it shows, what data powers it, and how to use it.

Lexicon & Dictionary
Main dictionary, word comparison, coverage analysis, categories, frequency, and ambiguity
Poem & Verse Similarity
Poem reader, 5 similarity algorithms, verse search, network graphs, path finder, formulas
Poetic Structure & Style
Alliteration, parallelism, meter, phrases, collocates, and emotion vocabulary
Cross-Lingual Analysis
Cognates, etymology, shared vocabulary, dialects, concepts, and thematic domains
Corpus & Geography
Statistics, frequency distributions, regional map, timeline, and collector explorer

Key features

Dictionary annotations

Word entries are enriched with definitions from Estonian and Finnish lexicographic sources:

Lemmatization

Estonian texts were lemmatized using EstNLTK morphological analysis combined with multiple lexical resources (EMS, EKSS, VES, ERLA, and others), expert manual annotations (37% of the corpus), and iterative automated correction cycles.

Finnish texts were lemmatized using a combinatory approach with a multi-tier fallback chain including Omorfi, Voikko, and Stanza, supplemented by a dialectal dictionary derived from Suomen murteiden sanakirja (SMS).

In addition, approximately 165,000 poems were independently annotated using DeepSeek-R1, a large language model, run on the LUMI supercomputer. The AI analysis produced lemmatizations, English translations, morphological descriptions, and etymological roots for each word token. These annotations yielded 107,110 additional lemmas not present in the corpus pipeline, and provided cross-references between the two lemmatization systems for entries where both recognized the same word forms.

References and acknowledgements

Corpora

Lemmatization tools

Lexical resources

Contact

For questions about this lexicon, contact kaarel.veskis@kirmus.ee

← Back to lexicon