LINDAT/CLARIAH-CZ logo
  • Catalog
  • Repository
  • Education
  • Projects
  • Tools
  • Services
  • About
    Partners Mission Statement CLARIN DARIAH Service integrations Project partnerships
  • DARIAH logo
  • CLARIN logo
  •  Login
  • English čeština
  • LINDAT/CLARIAH-CZ Repository Home
  • View Item
  •  
  • LINDAT/CLARIAH-CZ logo
    CLARIN logo
  •   Browse  
    •    All of the Repository  
      •   Issue Date
      •   Authors
      •   Titles
      •   Subjects
      •   Publisher
      •   Language
      •   Type
      •   Rights Label
  •   My Account  
    •    Login
  •   Statistics  
    •    StatisticsBETA
  •   General Information  
    •    Deposit
    •    Cite
    •    Submission Lifecycle
    •    FAQ
    •    About
    •    Help Desk
 
 

ABC - Language Identifier

 
LRT + Open Submissions
  Authors
Tufiş, Dan and Ceauşu, Alexandru
  Item identifier
http://hdl.handle.net/11372/LRT-198
 Project URL
http://www.racai.ro/webservices/
 Date issued
2014-07-30
 Type
toolService
 Description
The application, developed in C#, automatically identifies the language of a text written in one of the 21 European Union languages. By using training texts in different languages (approx. 1.5Mb of text for each language), a training module counts the prefixes (the first 3 characters) and the suffixes (4 characters endings) for all the words in the texts, for each language. For every language two models are constructed, containing the weights (percentages) of prefixes and suffixes in the texts representing a language. In the prediction phase, for a new text, two models are built on the fly in a similar manner. These models are then compared with the stored models representing each language for which the application was trained. Using comparison functions, the best model is chose. More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]: -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2008). RACAI's Linguistic Web Services. In Proceedings of the 6th Language Resources and Evaluation Conference - LREC 2008, Marrakech, Morocco, May 2008. ELRA - European Language Resources Association. ISBN 2-9517408-4-0. -- Dan Tufiş and Alexandru Ceauşu (2007). Diacritics Restoration in Romanian Texts. In Elena Paskaleva and Milena Slavcheva (eds.), A Common Natural Language Processing Paradigm for Balkan Languages - RANLP 2007 Workshop Proceedings, pp. 49-56, Borovets, Bulgaria, September 2007. INCOMA Ltd., Shoumen, Bulgaria. ISBN 978-954-91743-8-0. -- Dan Tufiş and Adrian Chiţu (1999). Automatic Insertion of Diacritics in Romanian Texts. In Ferenc Kiefer, Gábor Kiss, and Júlia Pajzs (eds.), Proceedings of the 5th International Workshop on Computational Lexicography (COMPLEX 1999), pp. 185-194, Pecs, Hungary, May 1999. Linguistics Institute, Hungarian Academy of Sciences.
 Publisher
Research Institute for Artificial Intelligence, Romanian Academy of Sciences
 Collection(s)
LRT + Open Submissions Data & Tools
Show full item record
 
 

LINDAT/CLARIAH-CZ

  • Mission Statement
  • Advisory Board
  • Events
  • CLARIN Participation
  • DARIAH Participation

  • FAQ
  • Helpdesk
  • User Feedback Form

  • Acknowledge LINDAT/CLARIAH-CZ

Partners

  • Charles University
    • Faculty of Mathematics and Physics
    • Faculty of Arts
  • Masaryk University
    • Faculty of Arts
    • Faculty of Informatics
  • University of West Bohemia
    • Faculty of Applied Sciences
  • Czech Academy of Sciences
    • Czech Language Institute
    • Library of Academy
    • Institute of History
    • Institute of Philosophy
  • Archives, Libraries and Galleries
    • National Library of the Czech Republic
    • Moravian Library in Brno
    • National Gallery Prague
    • National Film Archive

Services

  • Service Status
  • About and Policies
  • Terms of Use
CLARIN CENTRE B CLARIN CENTRE K CoreTrustSeal Certification
Follow us on Twitter Link to Profile Home Page
THE LINDAT/CLARIAH-CZ PROJECT (LM2018101; formerly LM2010013, LM2015071) IS FULLY SUPPORTED BY THE MINISTRY OF EDUCATION, SPORTS AND YOUTH OF THE CZECH REPUBLIC UNDER THE PROGRAMME LM OF "LARGE INFRASTRUCTURES"
Icons © Smashicons and Freepik from flaticon.com licensed by CC 3.0 BY
website © 2022 by ÚFAL