Corpus & Geography Guide
What is this page?
This guide introduces five RunoVerse tools for exploring corpus metadata: who collected the poems, where they were gathered, when, and how many. Each section below describes one tool, what data it provides, and links directly to it.
How to use this guide
Read from top to bottom for a full overview, or jump to the summary table at the end to find the right tool for your question. Each section ends with a direct link to the explorer page it describes.
Related guides
Dictionary Guide covers annotation sources and word lookup. Languages Guide explains the Estonian/Finnish bilingual corpus. Poetics Guide covers alliteration, parallelism, and meter. Similarity Guide explains the 7 similarity algorithms.
Tip
The Dashboard and Regional Heritage pages combine corpus statistics with other data layers for cross-cutting exploration.
The RunoVerse brings together three major folklore collections spanning four centuries of documentation across Estonia and Finland. These tools help you explore the who, where, when, and how of the corpus itself — the collection history, geographic distribution, and statistical properties of this remarkable heritage.
Corpus Statistics
The Statistics page provides an interactive dashboard of corpus-wide metrics. It gives you an overview of how the three source corpora — SKVR, JR, and ERAB — compare in size, vocabulary diversity, and linguistic features. Use it to understand the overall shape of the data before diving into specific explorations.
- Frequency distributions showing how often words occur across the corpus
- Part-of-speech breakdowns for nouns, verbs, adjectives, and other categories
- Language breakdowns comparing Estonian and Finnish vocabulary
- Vocabulary richness metrics revealing how diverse each corpus is
- Side-by-side comparisons of the three source collections
Frequency Distribution
The Distribution page lets you explore how the 439,746 lemmas are distributed by frequency. Like most natural language corpora, runosong vocabulary follows Zipf's law: a small number of very common words account for most of the text, while the vast majority of words are rare. This tool visualizes that pattern and lets you explore its implications.
- Frequency band visualization showing how many lemmas fall into each range
- Hapax legomena — words that occur only once in the entire corpus
- Cumulative coverage curves showing how many words you need to cover 50%, 90%, or 99% of all tokens
- Estonian versus Finnish frequency comparisons
- Interactive exploration of specific frequency ranges
Regional Vocabulary
The Places page maps the geographic dimension of the corpus across 803 collection places and 292,092 poems. An interactive Leaflet map shows where poems were gathered, and you can search for any word to see where it appears geographically. This is particularly useful for studying dialectal variation and regional poetic traditions.
- Interactive map with 803 collection locations across Estonia and Finland
- Word search to see the geographic distribution of any term
- 10 map modes showing different aspects of the data (poem density, vocabulary richness, language distribution, and more)
- Region filtering to discover locally distinctive vocabulary
- Click any location to see the poems collected there
Corpus Timeline
The Timeline page shows when the 285,946 dated poems were collected across four centuries of Finnic folklore documentation, from the 1560s through the 1970s. It reveals the waves of collection activity that built up these corpora, and how the focus shifted between regions and traditions over time.
- Temporal distribution of poem collection from the 1560s to the 1970s
- Collection activity peaks corresponding to major folklore campaigns
- Per-corpus timelines showing when SKVR, JR, and ERAB poems were gathered
- Collector activity periods showing who was active when
- Interactive filtering by decade, century, or custom time range
Collector Explorer
The Collector Explorer lets you browse the 7,482 individuals who gathered the 292,092 poems in the corpus. Behind every poem is a collector — from well-known folklorists like Jakob Hurt and Elias Lönnrot to anonymous local contributors who recorded a handful of songs from their communities. This tool lets you explore the human effort behind the data.
- Searchable index of 7,482 collectors with poem counts
- Collection place information showing where each collector worked
- Time period data revealing when each collector was active
- Links to the poems gathered by each collector
- Sorting and filtering to find the most prolific collectors or those active in specific regions
Overview
| Tool | Key Data | Best For |
|---|---|---|
| Statistics | 3 corpora, 439K lemmas, 15.3M tokens | Understanding corpus composition and size |
| Distribution | 439,746 lemmas across frequency bands | Exploring vocabulary frequency and coverage |
| Places | 803 locations, 292,092 poems, 10 map modes | Geographic patterns and dialectal variation |
| Timeline | 285,946 poems, 1560s–1970s | Historical collection patterns |
| Collectors | 7,482 collectors, 292,092 poems | People behind the corpus |