NGRAM
AGH n-gram model of the Polish language is a powerful database dictionary of words attendance. It includes frequencies of words, pairs and triples.
Model statistics were collected from a huge collection of texts from various sources (about 10GB of text, over a billion of words). Semi-manual corrections, which devoted about 40-days of work, were carried out using a specially prepared software Fixgram.
This model of Polish language has about 8 million different words. Some of them are foreign, and part is due to typing errors in the analysed texts. A lot of names are included. Statistical parameters of the model reflect the occurrence of specific words in Polish, especially in the 1- and 2-grams very well. The model is still being developed and revised.
More information about the model is available in documentation .
Authors invite people interested in audio processing technologies to contact spin-off company techmo.pl