Corpus & Geography Guide

What is this page?

This guide introduces five RunoVerse tools for exploring corpus metadata: who collected the poems, where they were gathered, when, and how many. Each section below describes one tool, what data it provides, and links directly to it.

How to use this guide

Read from top to bottom for a full overview, or jump to the summary table at the end to find the right tool for your question. Each section ends with a direct link to the explorer page it describes.

Related guides

Dictionary Guide covers annotation sources and word lookup. Languages Guide explains the Estonian/Finnish bilingual corpus. Poetics Guide covers alliteration, parallelism, and meter. Similarity Guide explains the 7 similarity algorithms.

Tip

The Dashboard and Regional Heritage pages combine corpus statistics with other data layers for cross-cutting exploration.

The RunoVerse brings together three major folklore collections spanning four centuries of documentation across Estonia and Finland. These tools help you explore the who, where, when, and how of the corpus itself — the collection history, geographic distribution, and statistical properties of this remarkable heritage.

Corpus Statistics

The Statistics page provides an interactive dashboard of corpus-wide metrics. It gives you an overview of how the three source corpora — SKVR, JR, and ERAB — compare in size, vocabulary diversity, and linguistic features. Use it to understand the overall shape of the data before diving into specific explorations.

Frequency distributions showing how often words occur across the corpus
Part-of-speech breakdowns for nouns, verbs, adjectives, and other categories
Language breakdowns comparing Estonian and Finnish vocabulary
Vocabulary richness metrics revealing how diverse each corpus is
Side-by-side comparisons of the three source collections

Open Corpus Statistics →

Frequency Distribution

The Distribution page lets you explore how the 439,746 lemmas are distributed by frequency. Like most natural language corpora, runosong vocabulary follows Zipf's law: a small number of very common words account for most of the text, while the vast majority of words are rare. This tool visualizes that pattern and lets you explore its implications.

Frequency band visualization showing how many lemmas fall into each range
Hapax legomena — words that occur only once in the entire corpus
Cumulative coverage curves showing how many words you need to cover 50%, 90%, or 99% of all tokens
Estonian versus Finnish frequency comparisons
Interactive exploration of specific frequency ranges

Open Frequency Distribution →

Regional Vocabulary

The Places page maps the geographic dimension of the corpus across 803 collection places and 292,092 poems. An interactive Leaflet map shows where poems were gathered, and you can search for any word to see where it appears geographically. This is particularly useful for studying dialectal variation and regional poetic traditions.

Interactive map with 803 collection locations across Estonia and Finland
Word search to see the geographic distribution of any term
10 map modes showing different aspects of the data (poem density, vocabulary richness, language distribution, and more)
Region filtering to discover locally distinctive vocabulary
Click any location to see the poems collected there

Open Regional Vocabulary →

Corpus Timeline

The Timeline page shows when the 285,946 dated poems were collected across four centuries of Finnic folklore documentation, from the 1560s through the 1970s. It reveals the waves of collection activity that built up these corpora, and how the focus shifted between regions and traditions over time.

Temporal distribution of poem collection from the 1560s to the 1970s
Collection activity peaks corresponding to major folklore campaigns
Per-corpus timelines showing when SKVR, JR, and ERAB poems were gathered
Collector activity periods showing who was active when
Interactive filtering by decade, century, or custom time range

Open Corpus Timeline →

Collector Explorer

The Collector Explorer lets you browse the 7,482 individuals who gathered the 292,092 poems in the corpus. Behind every poem is a collector — from well-known folklorists like Jakob Hurt and Elias Lönnrot to anonymous local contributors who recorded a handful of songs from their communities. This tool lets you explore the human effort behind the data.

Searchable index of 7,482 collectors with poem counts
Collection place information showing where each collector worked
Time period data revealing when each collector was active
Links to the poems gathered by each collector
Sorting and filtering to find the most prolific collectors or those active in specific regions

Open Collector Explorer →

Overview

Tool	Key Data	Best For
Statistics	3 corpora, 439K lemmas, 15.3M tokens	Understanding corpus composition and size
Distribution	439,746 lemmas across frequency bands	Exploring vocabulary frequency and coverage
Places	803 locations, 292,092 poems, 10 map modes	Geographic patterns and dialectal variation
Timeline	285,946 poems, 1560s–1970s	Historical collection patterns
Collectors	7,482 collectors, 292,092 poems	People behind the corpus

← Back to About

Corpus & Geography Guide ? Help