english.csv

English ARPABET

A subset of the CMU Pronouncing Dictionary with CELEX frequencies > 1. This is notated in ARPABET. Numbers indicating vowel stress have been removed.

french.csv

French IPA

French corpus used in Goldsmith & Xanthos (2009) and Mayer (2020). Represented in IPA.

finnish.csv

Finnish Ortho

From a word list generated by the Institute for the Languages of Finland (http://kaino.kotus.fi/sanat/nykysuomi/). Represented orthographically. See Mayer (2020) for details.

samoan.csv

Samoan IPA

Samoan word list from Milner (1993), compiled by Kie Zuraw. Represented in IPA.

english_freq.csv

English ARPABET Frequencies

A subset of the CMU Pronouncing Dictionary with CELEX frequencies. This data is represented in ARPABET.

english_onsets.csv

English Onsets ARPABET

55 English onsets and their CELEX type frequencies in ARPABET format from Hayes & Wilson (2008). A subset of the onsets in the CMU Pronouncing Dictionary.

polish_onsets.csv

Polish Onsets IPA

Polish onsets with accompanying type frequencies from Jarosz (2017). Generated from a corpus of child-directed speech consisting of about 43,000 word types (Haman et al. 2011). Represented orthographically.

english_needle.csv

English ARPABET Neeedle

Data set from Needle et al. (2022). Consists of about 11,000 monomorphemic words from CELEX (Baayen et al. 1995) in ARPABET transcription.

spanish_stress.csv

Spanish IPA Stress

A set of 103,005 word types including citation and inflected forms taken from the EsPal database (Duchon et al. 2013) in IPA with stress encoded. Frequencies come from a large collection of Spanish subtitle data. 

turkish.csv

Turkish IPA

A set of about 18,000 citation forms from the Turkish Electronic Living Lexicon database (TELL; Inkelas et al. 2000) in IPA.