• Home
  • Repository
  • TreeQuery
  • Treex
  • More Apps
  • About
  • CLARIN
  •  Login
  • English čeština
  • LINDAT/CLARIN Repository Home
  • View Item
  •  
  • LINDAT/CLARIN logo
    CLARIN logo
  •   What can you do?
  •   Browse  
    •    All of the Repository  
      •   Issue Date
      •   Authors
      •   Titles
      •   Subjects
      •   Publisher
      •   Language
      •   Type
      •   Rights Label
  •   My Account  
    •    Login
  •   Statistics  
    •    StatisticsBETA
  •   General Information  
    •    Deposit
    •    Cite
    •    Submission Lifecycle
    •    FAQ
    •    About
    •    Help Desk
 
 

LX-Tokenizer

 
LRT + Open Submissions
  Authors
Branco, António ; Silva, João
  Item identifier
http://hdl.handle.net/11372/LRT-1230
 Project URL
http://lxsuite.di.fc.ul.pt
 Date issued
2014-07-30
 Type
toolService
 Language(s)
Portuguese
 Description
Automatic segmenter of lexemes of Portuguese. Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries more cleary. um exemplo → |um|exemplo| Expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol: do → |de_|o| Marks spacing around punctuation or symbols. The \* and the */ symbols indicate a space to the left and a space to the right, respectively: um, dois e três → |um|,*/|dois|e|três| 5.3 → |5|.|3| 1. 2 → |1|.*/|2| 8 . 6 → |8|\*.*/|6| Detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol: dá-se-lho → |dá|-se|-lhe|-o| afirmar-se-ia → |afirmar-CL-ia|-se| vê-las → |vê#|-las| This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance: deste → |deste| when occurring as a Verb deste → |de|este| when occurring as a contraction (Preposition + Demonstrative) This tool achieves a f-score of 99.72%.
 Publisher
NLX-Natural Language and Speech Group, University of Lisbon
 Collection(s)
LRT + Open Submissions Data & Tools
Show full item record
 
 

Partners, Coordination, Funding

  • Dept. of Cybernetics, Univ. of West Bohemia
  • Institute of Formal and Applied Linguistics (Prague)
  • Institute of Czech Language (Prague)
  • NLP Centre, Masaryk University (Brno)
  • Ministry of Education, Sports and Youth of the Czech Republic

Repository

  • Main page
  • Contact
  • Submission Lifecycle
  • FAQ
  • About and Policies

More

  • CLARIN Knowledge Centre | INESS
  • CLARIN
  • META-Net
  • Service Status
  • How to Sign in
  • Terms of Use

THE LINDAT/CLARIN PROJECT (LM2015071 and CZ.02.1.01/0.0/0.0/16_013/0001781; formerly LM2010013) IS FULLY SUPPORTED BY THE MINISTRY OF EDUCATION, SPORTS
AND YOUTH OF THE CZECH REPUBLIC UNDER THE PROGRAMME LM OF "LARGE INFRASTRUCTURES".

Copyright (c) 2018 UFAL MFF UK. All rights reserved.