• Home
  • Repository
  • Corpus Search
  • TreeQuery
  • Treex
  • More Apps
  • About
  • CLARIN
  •  Login
  • English čeština
  • LINDAT/CLARIAH-CZ Repository Home
  • View Item
  •  
  • LINDAT/CLARIAH-CZ logo
    CLARIN logo
  •   Browse  
    •    All of the Repository  
      •   Issue Date
      •   Authors
      •   Titles
      •   Subjects
      •   Publisher
      •   Language
      •   Type
      •   Rights Label
  •   My Account  
    •    Login
  •   Statistics  
    •    StatisticsBETA
  •   General Information  
    •    Deposit
    •    Cite
    •    Submission Lifecycle
    •    FAQ
    •    About
    •    Help Desk
 
 

ABC - Language Identifier

 
LRT + Open Submissions
  Authors
Tufiş, Dan ; Ceauşu, Alexandru
  Item identifier
http://hdl.handle.net/11372/LRT-198
 Project URL
http://www.racai.ro/webservices/
 Date issued
2014-07-30
 Type
toolService
 Description
The application, developed in C#, automatically identifies the language of a text written in one of the 21 European Union languages. By using training texts in different languages (approx. 1.5Mb of text for each language), a training module counts the prefixes (the first 3 characters) and the suffixes (4 characters endings) for all the words in the texts, for each language. For every language two models are constructed, containing the weights (percentages) of prefixes and suffixes in the texts representing a language. In the prediction phase, for a new text, two models are built on the fly in a similar manner. These models are then compared with the stored models representing each language for which the application was trained. Using comparison functions, the best model is chose. More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]: -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2008). RACAI's Linguistic Web Services. In Proceedings of the 6th Language Resources and Evaluation Conference - LREC 2008, Marrakech, Morocco, May 2008. ELRA - European Language Resources Association. ISBN 2-9517408-4-0. -- Dan Tufiş and Alexandru Ceauşu (2007). Diacritics Restoration in Romanian Texts. In Elena Paskaleva and Milena Slavcheva (eds.), A Common Natural Language Processing Paradigm for Balkan Languages - RANLP 2007 Workshop Proceedings, pp. 49-56, Borovets, Bulgaria, September 2007. INCOMA Ltd., Shoumen, Bulgaria. ISBN 978-954-91743-8-0. -- Dan Tufiş and Adrian Chiţu (1999). Automatic Insertion of Diacritics in Romanian Texts. In Ferenc Kiefer, Gábor Kiss, and Júlia Pajzs (eds.), Proceedings of the 5th International Workshop on Computational Lexicography (COMPLEX 1999), pp. 185-194, Pecs, Hungary, May 1999. Linguistics Institute, Hungarian Academy of Sciences.
 Publisher
Research Institute for Artificial Intelligence, Romanian Academy of Sciences
 Collection(s)
LRT + Open Submissions Data & Tools
Show full item record
 
 

Partners, Coordination, Funding

  • Dept. of Cybernetics, Univ. of West Bohemia
  • Institute of Formal and Applied Linguistics (Prague)
  • Institute of Czech Language (Prague)
  • NLP Centre, Masaryk University (Brno)
  • Ministry of Education, Sports and Youth of the Czech Republic

Repository

  • Main page
  • Contact
  • Submission Lifecycle
  • FAQ
  • About and Policies

More

  • CLARIN Knowledge Centre | INESS
  • CLARIN
  • META-Net
  • Service Status
  • How to Sign in
  • Terms of Use

THE LINDAT/CLARIAH-CZ PROJECT (LM2018101; which is a direct legal successor of the LINDAT/CLARIN projects LM2010013 and LM2015071)
IS FULLY SUPPORTED BY THE MINISTRY OF EDUCATION, SPORTS AND YOUTH OF THE CZECH REPUBLIC
UNDER THE PROGRAMME LM OF "LARGE INFRASTRUCTURES".

Copyright (c) 2020 UFAL MFF UK. All rights reserved.