Persisk corpora
Scholarly databases that document Persisk and its dialects. Each card opens the corpus in a new tab.
- Corpus
UD Persian Seraji Treebank
Universal Dependencies treebank for Iranian Persian (~6,000 sentences, 152k tokens) converted from the Uppsala Persian Dependency Treebank (Uppsala University / Universal Dependencies).
Dialects- Farsi
- Corpus
CHILDES — Family Persian Corpus
Audio + transcripts of two Tehran-Persian children recorded for L1 acquisition research (TalkBank / Carnegie Mellon).
Dialects- Farsi
- Corpus
Tajik National Corpus
58.4-million-word annotated corpus of Tajik with English and Russian glosses (Russian-Tajik Slavic University).
Dialects- Tadsjikisk
- Corpus
Normalized Bijankhan Corpus
Normalized release of the Bijankhan Persian POS-tagged news corpus (~2.6M tokens) from the Database Research Group, University of Tehran (Tihu NLP / University of Tehran).
Dialects- Farsi
- Dictionary
Living Dictionary — Hazaragi
Community-built dictionary of Hazaragi, the Persian variety spoken by the Hazara of Afghanistan, with audio recordings and Arabic-script orthography (Living Tongues Institute for Endangered Languages).
Dialects- Hazaragi