ABC - Language Identifier

dc.contributor.other	Tufiş, Dan
dc.contributor.other	Ceauşu, Alexandru
dc.date.accessioned	2014-07-30T21:16:05Z
dc.date.available	2014-07-30T21:16:05Z
dc.date.issued	2014-07-30
dc.identifier.uri	http://hdl.handle.net/11372/LRT-198
dc.description	The application, developed in C#, automatically identifies the language of a text written in one of the 21 European Union languages. By using training texts in different languages (approx. 1.5Mb of text for each language), a training module counts the prefixes (the first 3 characters) and the suffixes (4 characters endings) for all the words in the texts, for each language. For every language two models are constructed, containing the weights (percentages) of prefixes and suffixes in the texts representing a language. In the prediction phase, for a new text, two models are built on the fly in a similar manner. These models are then compared with the stored models representing each language for which the application was trained. Using comparison functions, the best model is chose. More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers\|the following papers]]: -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2008). RACAI's Linguistic Web Services. In Proceedings of the 6th Language Resources and Evaluation Conference - LREC 2008, Marrakech, Morocco, May 2008. ELRA - European Language Resources Association. ISBN 2-9517408-4-0. -- Dan Tufiş and Alexandru Ceauşu (2007). Diacritics Restoration in Romanian Texts. In Elena Paskaleva and Milena Slavcheva (eds.), A Common Natural Language Processing Paradigm for Balkan Languages - RANLP 2007 Workshop Proceedings, pp. 49-56, Borovets, Bulgaria, September 2007. INCOMA Ltd., Shoumen, Bulgaria. ISBN 978-954-91743-8-0. -- Dan Tufiş and Adrian Chiţu (1999). Automatic Insertion of Diacritics in Romanian Texts. In Ferenc Kiefer, Gábor Kiss, and Júlia Pajzs (eds.), Proceedings of the 5th International Workshop on Computational Lexicography (COMPLEX 1999), pp. 185-194, Pecs, Hungary, May 1999. Linguistics Institute, Hungarian Academy of Sciences.
dc.publisher	Research Institute for Artificial Intelligence, Romanian Academy of Sciences
dc.source.uri	http://www.racai.ro/webservices/
dc.title	ABC - Language Identifier
dc.type	toolService
has.files	no
additional.metadata	Nid:3591 Readily Available (field_resource_available):Yes
branding	LRT + Open Submissions
dc.coverage.placeName	Romania
files.size	0
files.count	0

Show simple item record