Corpus Statistics - RunoVerse

Statistics are computed from the full RunoVerse: 439,746 lemmas with 15.3M token attestations across three collections (SKVR, JR, ERAB). Includes 251,392 lemmas with both corpus and DeepSeek analysis, and 183,137 DeepSeek-only lemmas.
Each lemma has a corpus frequency (total occurrences), POS tag, language assignment, and word form list.

Zipf's Law: Plots lemma rank (x-axis) vs frequency (y-axis) on log-log scale. Natural language follows a straight line (Zipf's law). The slope indicates vocabulary diversity.
POS Distribution: Part-of-speech breakdown. Nouns dominate in runosong as in most corpora; high INTJ (interjection) count reflects formulaic exclamations.
Language & Source: Donut charts showing Estonian vs Finnish breakdown and corpus vs DeepSeek-derived lemma sources.
Vocabulary Overlap (Venn): Two-circle diagram showing lemmas unique to Estonian (ERAB only), unique to Finnish (SKVR+JR only), and shared across both.
Collection Distribution: SKVR (Finnish), JR (Finnish), ERAB (Estonian) token counts. SKVR is the largest collection.
Vocabulary Richness: TTR (type-token ratio), hapax legomena (words occurring once), dis legomena (occurring twice), and other standard corpus statistics.

Annotation Confidence: tc = translation confidence (0-100%), ec = etymology confidence. Distribution shows how reliable DeepSeek's annotations are.
Top Lemmas: Sortable table of the most frequent lemmas with POS, language badge, and frequency. Filterable by language and POS.
Morphological Complexity: Distribution of word forms per lemma. Most lemmas have 1-5 word forms; highly inflected lemmas (20+) are listed separately.
Grammatical Categories: Keyword tags extracted from grammatical annotations, grouped by frequency tier.

Frequency Distribution (Zipf's Law)

Loading categories...

7 bands, hapax, coverage, POS

4.3M verses, ET + FI