NGRAM

AGH n-gram model of the Polish language is a powerful database dictionary of words attendance. It includes frequencies of words, pairs and triples.

Model statistics were collected from a huge collection of texts from various sources (about 10GB of text, over a billion of words). Semi-manual corrections, which devoted about 40-days of work, were carried out using a specially prepared software Fixgram.

Fixgram window view

This model of Polish language has about 8 million different words. Some of them are foreign, and part is due to typing errors in the analysed texts. A lot of names are included. Statistical parameters of the model reflect the occurrence of specific words in Polish, especially in the 1- and 2-grams very well. The model is still being developed and revised.

N-gram database model schema

More information about the model is available in documentation .

Authors invite people interested in audio processing technologies to contact spin-off company techmo.pl

Copyright © Zespół Przetwarzania Sygnałów AGH 2011-2014