• Home
  • Repository
  • Corpus Search
  • TreeQuery
  • Treex
  • More Apps
  • About
  • CLARIN
  •  Login
  • English čeština
  • LINDAT/CLARIAH-CZ Repository Home
  • View Item
  •  
  • LINDAT/CLARIAH-CZ logo
    CLARIN logo
  •   Browse  
    •    All of the Repository  
      •   Issue Date
      •   Authors
      •   Titles
      •   Subjects
      •   Publisher
      •   Language
      •   Type
      •   Rights Label
  •   My Account  
    •    Login
  •   Statistics  
    •    StatisticsBETA
  •   General Information  
    •    Deposit
    •    Cite
    •    Submission Lifecycle
    •    FAQ
    •    About
    •    Help Desk
 
 

YAWA - Yet Another Word Aligner

 
LRT + Open Submissions
  Authors
Tufiş, Dan ; Ion, Radu
  Item identifier
http://hdl.handle.net/11372/LRT-1300
 Date issued
2014-07-30
 Type
toolService
 Language(s)
English , Romanian
 Description
YAWA is a four stage lexical aligner that uses bilingual translation lexicons produced by [[http://www.clarin.eu/tools/translation-equivalents-extractor|TREQ]] and phrase boundaries detection to align words of a given bitext. Using this alignment, in stage 2 a language dependent module takes over and produces alignments of the remaining lexical tokens within aligned chunks. Stage 3 is specialized in aligning blocks of consecutive unaligned tokens and stage 4 deletes alignments that are likely to be wrong. Developed in PERL, YAWA is language independent, except for the modules that realise alignments specific to the pairs of aligned languages. So far, it works just for Ro-En pair of languages. It requires a parallel corpus in [[http://www.xces.org|XCES]] format, morpho-syntactically annotated and lemmatized (using [[http://www.clarin.eu/tools/ttl-tokenizing-tagging-and-lemmatizing-free-running-texts|TTL]]), and translation dictionaries produced by [[http://www.clarin.eu/tools/translation-equivalents-extractor|TREQ]]. YAWA’s individual F-measure is 81.22%. Currently YAWA is a part of the [[http://www.clarin.eu/tools/cowal-combined-word-aligner|COWAL]] combined lexical alignment platform. More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]: -- Radu Ion (2007). Word Sense Disambiguation Methods Applied to English and Romanian. (in Romanian). PhD thesis. Romanian Academy, Bucharest -- Dan Tufiş (2007). Exploiting Aligned Parallel Corpora in Multilingual Studies and Applications. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Intercultural Collaboration. First International Workshop (IWIC 2007), volume 4568 of Lecture Notes in Computer Science, pp. 103-117. Springer-Verlag, August 2007. ISBN 978-3-540-73999-9. -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2006). Improved Lexical Alignment by Combining Multiple Reified Alignments. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Proceedings of the 11th Conference EACL2006, pp. 153-160, Trento, Italy, April 2006. Association for Computational Linguistics. ISBN 1-9324-32-61-2.
 Publisher
Research Institute for Artificial Intelligence, Romanian Academy of Sciences
 Subject(s)
word aligner
 Collection(s)
LRT + Open Submissions Data & Tools
Show full item record
 
 

Partners, Coordination, Funding

  • Dept. of Cybernetics, Univ. of West Bohemia
  • Institute of Formal and Applied Linguistics (Prague)
  • Institute of Czech Language (Prague)
  • NLP Centre, Masaryk University (Brno)
  • Ministry of Education, Sports and Youth of the Czech Republic

Repository

  • Main page
  • Contact
  • Submission Lifecycle
  • FAQ
  • About and Policies

More

  • CLARIN Knowledge Centre | INESS
  • CLARIN
  • META-Net
  • Service Status
  • How to Sign in
  • Terms of Use

THE LINDAT/CLARIAH-CZ PROJECT (LM2018101; which is a direct legal successor of the LINDAT/CLARIN projects LM2010013 and LM2015071)
IS FULLY SUPPORTED BY THE MINISTRY OF EDUCATION, SPORTS AND YOUTH OF THE CZECH REPUBLIC
UNDER THE PROGRAMME LM OF "LARGE INFRASTRUCTURES".

Copyright (c) 2020 UFAL MFF UK. All rights reserved.