About the RunoVerse

What this page covers

This About page is the reference guide for the entire RunoVerse platform. It documents the three source corpora (SKVR, JR, ERAB), corpus-wide statistics (439K lemmas, 15.3M tokens, 292K poems), and the methodology behind the lemmatization and AI annotation pipelines.

Sections on this page

Source corpora — descriptions and poem counts for each collection, with a visual bar chart.
Lexicon statistics — counts for lemmas, word forms, tokens, cognate pairs, translation/etymology confidence, and gloss coverage.
Data sources — explains the three source categories (Corpus Only, Both Sources, DeepSeek Only) and what the color-coded word forms and agreement badges mean.
Similarity — documents all 5 poem-level and 4 verse-level similarity algorithms with their basis and coverage.
Dictionary annotations — lists all 9 lexicographic sources (EMS, EKSS, IMS, ERLA, VMS, Seto, SMS, KKS, VKS).
Lemmatization — describes the Estonian (EstNLTK) and Finnish (Omorfi/Voikko/Stanza) processing pipelines, plus the DeepSeek AI annotation layer.

Explore cards

The card grid below links to all 70+ pages in RunoVerse. Each card shows a short description of the tool. For longer explanations, see the five Feature Guides (Lexicon, Similarity, Poetics, Cross-Lingual, Corpus).

Navigating the site

Use the top navigation bar to reach the main pages (Dictionary, Reader, Similarity, About). The More dropdown provides access to every explorer page. The Site Map organizes all pages into 11 categories. The Dashboard offers a visual starting point with hero statistics and a research question guide.

What is this?

The RunoVerse is a combined word index of Finnish and Estonian runosong (folk poetry) corpora. It brings together lemmatized word data from three major collections, allowing cross-linguistic exploration of the shared Finnic poetic tradition.

Please note that the RunoVerse is under active development. The lemmatization of historical dialectal texts is inherently approximate, and the AI-generated translations, etymological analyses, and similarity metrics should be considered experimental. The statistics and counts shown may change as the data is refined. This tool is intended as an exploratory aid, not as a definitive reference.

439,746

Lemmas

15.3M

Tokens

292,092

Poems

Corpora

Dictionaries

4.29M

Verse lines

Source corpora

Collection	Language	Description
SKVR	Finnish	Suomen Kansan Vanhat Runot – published Kalevala-metre poetry. Finnish Literature Society (SKS). 89,247 poems.
JR	Finnish	Julkaisemattomat Runot – unpublished folk poetry from SKS folklore archives. 96,129 poems.
ERAB	Estonian	Eesti Regilaulude Andmebaas – Database of Estonian Runosongs. Estonian Folklore Archives, Estonian Literary Museum. 108,969 poems.

Lexicon statistics

Measure	Count	Notes
Unique lemmas	439,746	Distinct base forms across all corpora (incl. 183,137 DeepSeek-only)
Unique wordforms	1,480,455	Distinct word tokens occurring across all poem texts
Wordform–lemma mappings	2,083,995	Total mappings from inflected forms to lemmas (one wordform can map to multiple lemmas)
Total tokens	15,264,640	Total word occurrences in source texts
Poems	292,092	Unique poems with full verse texts available in the poem context viewer (294,345 total in source corpora; some excluded due to missing verse text data)
Finnish-only lemmas	206,518	Lemmas from Finnish corpus only (SKVR/JR collections)
Estonian-only lemmas	100,835	Lemmas from Estonian corpus only (ERAB)
Shared (Finnic) lemmas	1,240	Lemmas found in both Finnish and Estonian sources
Cognate pairs (ET↔FI)	6,382	Automatically discovered Estonian-Finnish cognate pairs based on translation overlap, etymological roots, and orthographic similarity (1,114 exact, 2,390 near-exact, 2,873 bridged, 5 orthographic)
Translation confidence	192,236	Lemmas with DeepSeek translation consistency score (8,446 strong, 11,069 good, 41,650 moderate, 131,071 low; 113,502 no data)
Etymology confidence	211,919	Lemmas with DeepSeek etymology consistency score (11,748 strong, 13,183 good, 46,078 moderate, 140,910 low; 93,819 no data)
Gloss coverage	91.3%	Word forms with English translation (1,344,094 of 1,472,442), including 11,423 Claude Opus supplementary glosses
Corpus attestations	15,264,640	SKVR: 4,522,811 · JR: 3,398,967 · ERAB: 7,341,908

Statistics reflect the current state of the lemmatized data and may change as lemmatization is refined.

Data sources and source filter

Each lemma in the lexicon has been tagged with one of three source categories, reflecting how it was identified. The source filter dropdown in the main view lets you filter by these categories:

Source	Lemmas	Meaning
Corpus Only	5,217	Lemma was identified by the corpus lemmatization pipeline. None of the word forms listed under this lemma were matched to DeepSeek annotations during the merge. However, the lemma string itself may still appear as a word form in DeepSeek data, which means some “Corpus Only” entries can still have AI-generated translations visible via the A–Z browse.
Both Sources	251,392	Lemma comes from the corpus pipeline, and at least one of its word forms also appears in the DeepSeek annotations (possibly under a different lemma). These entries typically include AI-generated translations and may have cross-references (dsLemma) to alternative lemmatizations. Word forms in “Both Sources” entries are color-coded: green when both systems agree on the lemma, amber when DS assigns a different lemma, and gray when the word form is not in DS data.
DeepSeek Only	183,137	Lemma exists only in the DeepSeek annotations. The underlying word forms often appear in the corpus under different lemmas (96% of cases), but this particular lemmatization is unique to the AI analysis.

The source categories reflect word-form-level overlap between the two lemmatization systems, not whether an entry has translations. Because the corpus pipeline and DeepSeek sometimes lemmatize the same word forms differently, a word form can belong to a “Corpus Only” lemma while also appearing independently in the DeepSeek data under a different lemma. The “Both Sources” category captures entries where the same word forms were recognized by both systems.

The agreement badge in the DeepSeek tab shows a ratio like “30/35 agree +15 n/a”, meaning 30 out of 35 DS-covered word forms have the same lemma in both systems, and 15 word forms are not present in the DS data. Hover over the badge for a full breakdown.

DeepSeek AI annotations

A subset of the corpus was independently annotated using DeepSeek, a large language model, to provide additional linguistic analysis. The AI annotations include:

English translations of Estonian and Finnish word forms
Etymological notes and cognate identification
Morphological descriptions (case, number, tense, etc.)
Part-of-speech tagging

Measure	Count	Notes
DeepSeek tokens	5,962,070	AI-annotated token occurrences (ET: 2,867,388 + FI: 3,094,682)
DeepSeek-only lemmas	183,137	Lemmas unique to the AI analysis
English translations	241,141	Unique English terms extracted from AI annotations, browsable via A–Z (1,252,781 total mappings)
Cross-references	91,754	Entries linking to alternative lemmatizations between corpus and DeepSeek

AI-generated annotations are provided as supplementary material and have not been manually verified. They should be used with appropriate caution, particularly for etymological claims and translations of rare dialectal forms.

Similarity and embedding data

The lexicon includes two word-level similarity systems to help explore relationships between word forms:

Word form similarity – Edit-distance and phonological similarity between inflected forms across the corpus. Covers 1,166,348 word forms with ranked nearest neighbors and lemma-agreement indicators.
BERT embeddings – Contextual nearest neighbors from a BERT model fine-tuned on Estonian runosong texts. Provides 190,975 query lemmas with their 10 nearest semantic neighbors, capturing meaning-based rather than form-based similarity.

Poem similarity

Five poem-level similarity algorithms identify related poems across the 292,092-poem corpus. Results are available in the Poem Reader (Related Poems panel) and the standalone Similarity Explorer with side-by-side comparison, network graphs, and geographic/temporal analytics.

Algorithm	Basis	Description
TF-IDF Lemma	Lemma-level	Cosine similarity on TF-IDF vectors of lemmatized poem texts. Captures thematic similarity through shared vocabulary, weighted by corpus-level term importance. Top 50 neighbors per poem.
Wordform Overlap (Jaccard)	Exact wordforms	Jaccard index (\|A∩B\| / \|A∪B\|) over raw wordform sets. Identifies poems sharing exact surface forms, useful for detecting formulaic lines and direct textual parallels.
Thematic (Translation-pivot)	Cross-lingual	Boolean-IDF cosine similarity over English translations derived from DeepSeek annotations, with lemma-level fallback for improved coverage. Enables cross-lingual comparison between Estonian and Finnish poems via a shared semantic space. Top 50 neighbors per poem.
Alignment	Character n-gram	Verse sequence alignment using character bigram cosine similarity and Wagner-Fischer dynamic programming, from the FILTER project (Janicki, Kallio & Sarv 2023). Covers 256,970 poems across SKVR, JR, KR, and ERAB. Captures structural similarity — poems that follow the same verse order score high. Shows aligned verse pair excerpts for top matches. Top 50 neighbors per poem.
Verse-level RRF	Verse-level fusion	Fuses Jaccard, TF-IDF, Translation, and CharBigram similarity at the verse level using Average-Best-Per-Verse aggregation, then combines all four via Reciprocal Rank Fusion (k=60) into a single poem-level ranking. Shows T/J/Tr/C algorithm contribution badges.

The Similarity Explorer shows cross-algorithm agreement badges (BOTH) when poems appear in multiple algorithms' results, and ET↔FI badges for cross-lingual matches in the Translation-pivot, Alignment, and Verse-level RRF algorithms.

Verse similarity

Four algorithms (Jaccard, TF-IDF, Translation-pivot, CharBigram) operate at the individual verse level across 4.29 million verse occurrences. Each verse is compared against all verses in other poems, with up to 20 nearest neighbors stored per algorithm.

Metric	Value
Total verses indexed	4,291,553
Poems with verse data	289,702
Unique verse types (search index)	2,906,535
Formulaic pattern clusters	200

Verse similarity is available in the Poem Reader (click the expand arrow on any verse line) and the Verse Similarity Explorer, which also provides full-text verse search and a browser for the top 200 formulaic patterns – recurring verse lines ranked by cluster size across both corpora.

Explore the lexicon

Feature guides

The RunoVerse contains over 30 interconnected tools for exploring the Finnic runosong tradition. These guides describe each tool in detail — what it shows, what data powers it, and how to use it.

Lexicon & Dictionary

Main dictionary, word comparison, coverage analysis, categories, frequency, and ambiguity

Poem & Verse Similarity

Poem reader, 5 similarity algorithms, verse search, network graphs, path finder, formulas

Poetic Structure & Style

Alliteration, parallelism, meter, phrases, collocates, and emotion vocabulary

Cross-Lingual Analysis

Cognates, etymology, shared vocabulary, dialects, concepts, and thematic domains

Corpus & Geography

Statistics, frequency distributions, regional map, timeline, and collector explorer

Key features

Search by lemma, word form, or English translation with diacritics-insensitive matching
Filter by language, part of speech, and data source (corpus, DeepSeek, or both)
Dictionary annotations from 9 Estonian and Finnish lexicographic sources
DeepSeek AI translations, etymology, and morphological descriptions for 165K poems
Five poem similarity algorithms with network graphs, geographic maps, and temporal analytics
Verse-level similarity across 4.29M verses with inline expansion and full-text search
Cross-lingual exploration: 6,382 cognate pairs, 1,240 shared lemmas, 49K etymology families
Poetic analysis: alliteration patterns, semantic parallelism, formulaic phrases, and meter
Poem reader with word-level glosses, POS tags, and per-verse similarity
Geographic and temporal corpus analysis across 803 collection places and four centuries
Bookmarkable deep links, keyboard navigation, and CSV export

Dictionary annotations

Word entries are enriched with definitions from Estonian and Finnish lexicographic sources:

EMS – Eesti murrete sõnaraamat (Dictionary of Estonian Dialects). Institute of the Estonian Language.
EKSS – Eesti keele seletav sõnaraamat (Explanatory Dictionary of Estonian). Institute of the Estonian Language.
IMS – Ida-Eesti murdesõnastik (Eastern Estonian Dialect Dictionary). Institute of the Estonian Language.
ERLA – Harva ja vähem-kasutatavate sõnade sõnastik (Glossary of Rare Folk-Song Words). Estonian Literary Museum.
VMS – Vähemtuntud murdesõnade seletusi (Glossary of Lesser-Known Dialect Words). Estonian Literary Museum.
Seto – Seto sõnastik (Seto Dictionary). Inge Käsi, Institute of the Estonian Language, 2016.
SMS – Suomen murteiden sanakirja (Dictionary of Finnish Dialects). Kotimaisten kielten keskus (Kotus). CC BY 4.0.
KKS – Karjalan kielen sanakirja (Dictionary of the Karelian Language). Kotimaisten kielten keskus (Kotus). CC BY 4.0.
VKS – Vanhan kirjasuomen sanakirja (Dictionary of Old Literary Finnish). Kotimaisten kielten keskus (Kotus). CC BY 4.0.

Lemmatization

Estonian texts were lemmatized using EstNLTK morphological analysis combined with multiple lexical resources (EMS, EKSS, VES, ERLA, and others), expert manual annotations (37% of the corpus), and iterative automated correction cycles.

Finnish texts were lemmatized using a combinatory approach with a multi-tier fallback chain including Omorfi, Voikko, and Stanza, supplemented by a dialectal dictionary derived from Suomen murteiden sanakirja (SMS).

In addition, approximately 165,000 poems were independently annotated using DeepSeek-R1, a large language model, run on the LUMI supercomputer. The AI analysis produced lemmatizations, English translations, morphological descriptions, and etymological roots for each word token. These annotations yielded 107,110 additional lemmas not present in the corpus pipeline, and provided cross-references between the two lemmatization systems for entries where both recognized the same word forms.

References and acknowledgements

Corpora

SKVR – Finnish Literature Society (Suomalaisen Kirjallisuuden Seura, SKS). Suomen Kansan Vanhat Runot. Digital corpus. skvr.fi. CC BY 4.0.
JR – Finnish Literature Society (SKS). Julkaisemattomat Runot. Available within the SKVR database.
ERAB – Oras, J.; Saarlo, L.; Sarv, M.; Labi, K.; Uus, M.; Šmitaite, R. (comps.). Eesti Regilaulude Andmebaas. Estonian Folklore Archives, Estonian Literary Museum. 2003–present. folklore.ee/regilaul/andmebaas

Lemmatization tools

EstNLTK – Laur, S.; Orasmaa, S.; Särg, D.; Tammo, P. (2020). EstNLTK 1.6: Remastered Estonian NLP Pipeline. Proceedings of LREC 2020, pp. 7154–7162. github.com/estnltk/estnltk
Vabamorf – Kaalep, H. J.; Vaino, T. (2001). Complete morphological analysis in the linguist’s toolbox. Congressus Nonus Internationalis Fenno-Ugristarum, 5, pp. 9–16.
Omorfi – Pirinen, T. A. (2015). Omorfi – Free and open source morphological lexical database for Finnish. Proceedings of NODALIDA 2015, pp. 313–315. github.com/flammie/omorfi
Voikko – Pitkänen, H. Voikko – Free linguistic software for Finnish. voikko.puimula.org
Stanza – Qi, P.; Zhang, Y.; Zhang, Y.; Bolton, J.; Manning, C. D. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. Proceedings of ACL 2020: System Demonstrations. stanfordnlp.github.io/stanza
DeepSeek-R1 – DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948. deepseek.com

Lexical resources

SMS – Kotimaisten kielten keskus (Kotus). Suomen murteiden sanakirja (Dictionary of Finnish Dialects). kaino.kotus.fi/sms. CC BY 4.0.
KKS – Kotimaisten kielten keskus (Kotus). Karjalan kielen sanakirja (Dictionary of the Karelian Language). kaino.kotus.fi/kks. CC BY 4.0.
VKS – Kotimaisten kielten keskus (Kotus). Vanhan kirjasuomen sanakirja (Dictionary of Old Literary Finnish). CC BY 4.0.
EMS – Institute of the Estonian Language. Eesti murrete sõnaraamat. eki.ee/dict/ems
EKSS – Institute of the Estonian Language. Eesti keele seletav sõnaraamat. eki.ee/dict/ekss
VES – Võro Institute. Võro-eesti sýnaraamat (comp. Jüvä Sullõv). folklore.ee/Synaraamat
ERLA – Estonian Literary Museum. Harva ja vähem-kasutatavate sõnade sõnastik. folklore.ee/laulud/erla

Contact

For questions about this lexicon, contact kaarel.veskis@kirmus.ee

← Back to lexicon

About the RunoVerse ? Help

What this page covers

Sections on this page

Explore cards

Navigating the site

What is this?

Source corpora

Lexicon statistics

Data sources and source filter

DeepSeek AI annotations

Similarity and embedding data

Poem similarity

Verse similarity

Explore the lexicon

Feature guides

Key features

Dictionary annotations

Lemmatization

References and acknowledgements

Corpora

Lemmatization tools

Lexical resources

Contact

About the RunoVerse