2024-03-28T13:56:03Zhttp://lindat.mff.cuni.cz/repository/oai/requestoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4872-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Arabic Dependency Treebank 1.0
2009-11-02T10:34:20Z
http://hdl.handle.net/11858/00-097C-0000-0001-4872-3
Hajič, Jan
Smrž, Otakar
Zemánek, Petr
Pajas, Petr
Šnaidauf, Jan
Beška, Emanuel
Kracmar, Jakub
Hassanová, Kamila
2011-06-27T11:59:09Z
The PADT project might be summarized as an open-ended activity of the Center for Computational Linguistics, the Institute of Formal and Applied Linguistics, and the Institute of Comparative Linguistics, Charles University in Prague, resting in multi-level annotation of Arabic language resources in the light of the theory of Functional Generative Description (Sgall et al., 1986; Hajičová and Sgall, 2003).
http://hdl.handle.net/11858/00-097C-0000-0001-4872-3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
corpus
Arabic
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-487A-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Lexico-Semantic Annotation of PDT using Czech WordNet
2011-06-27T13:00:08Z
http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
Bejček, Eduard
Hoffmannová, Petra
Holub, Martin
Hučínová, Marie
Pecina, Pavel
Straňák, Pavel
Šidák, Pavel
Hajič, Jan
2011-06-27T13:00:08Z
This dataset contains annotation of PDT using Czech WordNet ontology: http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
Data is stored in PML format. This is a stand-off annotation and for most use cases it requires PDT 2.0 and the Czech WordNet 1.9 PDT that we have used for annotation.
1ET100300517, 1ET201120505
http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PDT
Czech WordNet
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4916-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CzEng 0.7
2009-11-02T10:32:27Z
http://hdl.handle.net/11858/00-097C-0000-0001-4916-9
Bojar, Ondřej
Žabokrtský, Zdeněk
Češka, Pavel
Beňa, Peter
Janíček, Miroslav
2011-06-28T16:13:23Z
CzEng 0.7 is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague. The corpus contains no manual annotation. It is limited only to texts which have been already available in an electronic form and which are not protected by authors' rights in the Czech Republic. The main purpose of the corpus is to support Czech-English and English-Czech machine translation research with the necessary data. CzEng 0.7 consists of a large set of parallel textual documents mainly from the fields of European law, information technology, and fiction, all of them converted into a uniform XML-based file format and provided with automatic sentence alignment.
http://hdl.handle.net/11858/00-097C-0000-0001-4916-9
http://hdl.handle.net/11234/1-1458
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
parallel corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4908-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
VALLEX 2.5
2009-11-02T11:50:55Z
http://hdl.handle.net/11858/00-097C-0000-0001-4908-9
Lopatková, Markéta
Žabokrtský, Zdeněk
Kettnerová, Václava
2011-06-28T10:07:47Z
The Valency Lexicon of Czech Verbs, Version 2.5 (VALLEX 2.5), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. VALLEX 2.5 has been developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague.
VALLEX 2.5 provides information on the valency structure (combinatorial potential) of verbs in their particular senses - there are roughly 2,730 lexeme entries containing together around 6,460 lexical units ("senses").
LC 536 - Center for Computational Linguistics, 1ET100300517 and 1ET101120503.
http://hdl.handle.net/11858/00-097C-0000-0001-4908-9
http://hdl.handle.net/11234/1-2307
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
valency
Czech
lexicalConceptualResource
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4880-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech WordNet 1.9 PDT
2011-01-24T09:00:29Z
http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
Pala, Karel
Čapek, Tomáš
Zajíčková, Barbora
Bartůšková, Dita
Kulková, Kateřina
Hoffmannová, Petra
Bejček, Eduard
Straňák, Pavel
Hajič, Jan
2011-06-27T14:04:01Z
A slightly modified version of the Czech Wordnet. This is the version used to annotate "The Lexico-Semantic Annotation of PDT using Czech WordNet": http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
The Czech WordNet was developed by the Centre of Natural Language Processing at the Faculty of Informatics, Masaryk University, Czech Republic.
The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 23,094 word senses (synsets). 203 of these were created or modified by UFAL during correction of annotations. This version of WordNet was used to annotate word senses in PDT: http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
A more recent version of Czech WordNet is distributed by ELRA: http://catalog.elra.info/product_info.php?products_id=1089
1ET201120505, LM2010013
http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
ontology
wordnet
Czech WordNet
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-487E-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CoNLL 2009 Shared Task Czech Trial Set
2009-01-05T00:00:00Z
http://hdl.handle.net/11858/00-097C-0000-0001-487E-B
Hajič, Jan
Straňák, Pavel
Štěpánek, Jan
2011-06-27T13:16:27Z
Czech trial (example) data for CoNLL 2009 Shared Task. The data are generated from PDT 2.0. LDC2009E32B
MSM 0021620838 (http://ufal.mff.cuni.cz:8080/bib/?section=grant&id=116488695895567&mode=view)
http://hdl.handle.net/11858/00-097C-0000-0001-487E-B
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
conll-st
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4909-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
UMC 0.1: Czech-Russian-English Multilingual Corpus
2008-10-02T00:00:00Z
http://hdl.handle.net/11858/00-097C-0000-0001-4909-7
Klyueva, Natalia
Bojar, Ondřej
2011-06-28T10:42:32Z
UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of UMC is to extend the set of languages covered by the corpus CzEng mainly for the purposes of machine translation.
All the texts were downloaded from a single source — The Project Syndicate (Copyright: Project Syndicate 1995-2008), which contains a huge collection of high-quality news articles and commentaries. We were given the permission to use the texts for research and non-commercial purposes.
FP6-IST-5-034291-STP (EuroMatrix)
http://hdl.handle.net/11858/00-097C-0000-0001-4909-7
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
multi-language corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B098-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 2.0 (PDT 2.0)
2006-07-21T00:00:00Z
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
Hajič, Jan
Panevová, Jarmila
Hajičová, Eva
Sgall, Petr
Pajas, Petr
Štěpánek, Jan
Havelka, Jiří
Mikulová, Marie
Žabokrtský, Zdeněk
Ševčíková-Razímová, Magda
Urešová, Zdeňka
2011-11-03T21:33:25Z
The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (two million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.
PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well.
1ET101120413 (Data a nástroje pro informační systémy) MSM 0021620838 (Moderní metody, struktury a systémy informatiky) 1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů) 1P05ME752 (Vícejazyčný valenční a predikátový slovník přirozeného jazyka) LC536 (Centrum komputační lingvistiky)
LDC2006T01
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
PDT 2.0 License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2
corpus
Czech
treebank
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B43E-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 2.0 - sample data
2006-06-21T00:00:00Z
http://hdl.handle.net/11858/00-097C-0000-0001-B43E-6
Hajič, Jan
Panevová, Jarmila
Sgall, Petr
Pajas, Petr
Štěpánek, Jan
Havelka, Jiří
Mikulová, Marie
Žabokrtský, Zdeněk
Ševčíková-Razímová, Magda
2011-11-04T15:03:18Z
A small subset of PDT 2.0 made available under a permissive license.
Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.
PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well.
* Ministry of Education of the Czech Republic projects No. VS96151, LN00A063, 1P05ME752, MSM0021620838 and LC536,
* Grant Agency of the Czech Republic grants Nos. 405/96/0198, 405/96/K214 and 405/03/0913,
* research funds of the Faculty of Mathematics and Physics,
* Charles University, Prague, Czech Republic,
* Grant Agency of the Czech Academy of Science, Prague, Czech Republic projects No. 1ET101120503, 1ET101120413, and 1ET201120505
* Grant Agency of the Charles University No. 489/04, 350/05, 352/05 and 375/05
* the U.S. NSF Grant #IIS9732388.
http://hdl.handle.net/11858/00-097C-0000-0001-B43E-6
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
treebank
dependency
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4914-D2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank of Spoken Language (PDTSL) 0.5
2009-11-02T10:40:55Z
http://hdl.handle.net/11858/00-097C-0000-0001-4914-D
Hajič, Jan
Pajas, Petr
Mareček, David
Mikulová, Marie
Urešová, Zdeňka
Podveský, Petr
2011-06-28T11:19:19Z
The first edition of a speech corpus with a speech reconstruction layer (edited transcript).
The project of speech reconstruction of Czech and English has been started at UFAL together with the PIRE project in 2005, and has gradually grown from ideas to (first) annotation specification, annotation software and actual annotation. It is part of the Prague Dependency Treebank family of annotated corpus resources and tools, to which it adds the spoken language layer(s).
LC536; MSM0021620838; IST-034344; ME838
http://hdl.handle.net/11858/00-097C-0000-0001-4914-D
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
PDTSL
https://lindat.mff.cuni.cz/repository/xmlui/page/licence-pdtsl
corpus
spoken language
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-C6D1-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CoNLL 2009 Shared Task - Czech Data
2009-01-19T00:00:00Z
http://hdl.handle.net/11858/00-097C-0000-0001-C6D1-9
Hajič, Jan
Straňák, Pavel
Štěpánek, Jan
2011-11-08T21:34:04Z
Czech data - both train and test+eval sets, as well as the valency dictionary - for the CoNLL 2009 Shared Task. Documentation is included. The data are generated from PDT 2.0. LDC catalog number: LDC2009E34B
MSM 0021620838 (http://ufal.mff.cuni.cz:8080/bib/?section=grant&id=116488695895567&mode=view)
LDC2009E34B, LDC2009E35B
http://hdl.handle.net/11858/00-097C-0000-0001-C6D1-9
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
conll-st
treebank
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F3-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
XSH
2009-11-02T09:51:39Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F3-0
Pajas, Petr
2011-06-28T09:38:08Z
XSH is a powerfull command-line tool for querying, processing and editing XML documents. It features a shell-like interface with auto-completion for comfortable interactive work, but can be as well used for off-line (batch) processing of XML data.
http://hdl.handle.net/11858/00-097C-0000-0001-48F3-0
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Artistic License (Perl) 1.0
http://opensource.org/licenses/Artistic-Perl-1.0
XML processing
command-line
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F7-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
TrEd
2009-10-13T13:11:11Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F7-8
Pajas, Petr
2011-06-28T09:39:07Z
Tree Editor
TrEd is a fully customizable and programmable graphical editor and viewer for tree-like structures. Among other projects, it was used as the main annotation tool for syntactical and tectogrammatical annotations in The Prague Dependency Treebank, as well as for decision-tree based morphological annotation of The Prague Arabic Dependency Treebank.
http://hdl.handle.net/11858/00-097C-0000-0001-48F7-8
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
annotation
tree
editor
XML
PML
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F8-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MEd
2009-11-02T09:33:08Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F8-6
Pajas, Petr
Mareček, David
2011-06-28T09:39:18Z
MEd is an annotation tool in which linearly-structured annotations of text or audio data can be created and edited. The tool supports multiple stacked layers of annotations that can be interconnected by links. MEd can also be used for other purposes, such as word-to-word alignment of parallel corpora.
http://hdl.handle.net/11858/00-097C-0000-0001-48F8-6
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
annotation tool
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F9-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
HMM tagger
2009-11-02T09:25:18Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F9-4
Krbec, Pavel
2011-06-28T09:39:30Z
The HMM-based Tagger is a software for morphological disambiguation (tagging) of Czech texts. The algorithm is statistical, based on the Hidden Markov Models.
http://hdl.handle.net/11858/00-097C-0000-0001-48F9-4
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
tagger
morphology
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FA-22017-04-10T13:34:17Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F2-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Dspace modifications for use of EPIC handles
2010-01-13T15:06:26Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F2-1
Pajas, Petr
2011-06-28T09:37:08Z
Modifications to DSpace made by Petr Pajas in order to support pidconsortium.eu PID handle system instead of the default handle.com system used by DSpace.
http://hdl.handle.net/11858/00-097C-0000-0001-48F2-1
http://hdl.handle.net/11858/00-097C-0000-0023-4087-6
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
BSD 2-Clause "Simplified" or "FreeBSD" license
http://opensource.org/licenses/BSD-2-Clause
DSpace
handle
EPIC
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FB-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
STYX
2009-11-02T09:42:50Z
http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F
Kučera, Ondřej
2011-06-28T09:39:55Z
The STYX system is an electronic exercise book for practising Czech morphology and syntax consisting of more than 11, 000 sentences.
http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public Licence, version 3
http://opensource.org/licenses/GPL-3.0
education
morphology
syntax
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FC-D2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MMI_clustering
2009-11-02T09:34:32Z
http://hdl.handle.net/11858/00-097C-0000-0001-48FC-D
Klusáček, David
2011-06-28T09:40:13Z
MMI_clustering is a set of command line tools implementing Mercer's maximum mutual information-based clustering technique.
http://hdl.handle.net/11858/00-097C-0000-0001-48FC-D
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
clustering
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FD-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Victor
2009-11-02T09:48:39Z
http://hdl.handle.net/11858/00-097C-0000-0001-48FD-B
Marek, Michal
2011-06-28T09:40:25Z
Victor is a web page cleaning tool. It is aimed at removing menu, ads, footers, headers, etc. from HTML web pages, so that only main web page content remains. Victor is based on a conditional random fields algorithm.
http://hdl.handle.net/11858/00-097C-0000-0001-48FD-B
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
html cleaning
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FE-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Morče
2009-11-02T09:36:29Z
http://hdl.handle.net/11858/00-097C-0000-0001-48FE-9
Raab, Jan
2011-06-28T09:40:39Z
The MORČE tagger is a software for morphological disambiguation (part-of-speech tagging) of Czech text. The algorithm is statistical, based on an idea of so-called "Averaged Perceptron" published by Michael Collins in 2002.
http://hdl.handle.net/11858/00-097C-0000-0001-48FE-9
http://hdl.handle.net/11858/00-097C-0000-0023-43CD-0
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
tagger
morphology
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FF-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Victoria
2009-11-02T09:50:15Z
http://hdl.handle.net/11858/00-097C-0000-0001-48FF-7
Spousta, Miroslav
2011-06-28T09:40:54Z
Victoria is an on-line HTML web page annotation tool suitable for selecting texts on the web pages. It can be used to mark important/interesting parts of web pages for further processing.
http://hdl.handle.net/11858/00-097C-0000-0001-48FF-7
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
web page processing
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4900-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MORFO
2009-11-02T09:37:56Z
http://hdl.handle.net/11858/00-097C-0000-0001-4900-A
Kolovratník, David
2011-06-28T09:41:07Z
The MORFO system for morphological analysis of Czech consists of four units: the analyzer, the generator, the dictionary editor, and the library with the shared source code for handling dictionary objects.
http://hdl.handle.net/11858/00-097C-0000-0001-4900-A
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
PDT 2.0 License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2
morphological analysis
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4901-82017-04-10T13:32:37Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4902-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
LAW
2009-11-02T09:27:18Z
http://hdl.handle.net/11858/00-097C-0000-0001-4902-6
Hana, Jiří
2011-06-28T09:41:34Z
Lexical Annotation Workbench (LAW) is an integrated environment for morphological annotation. It supports simple morphological annotation (assigning a lemma and tag to a word), integration and comparison of different annotations of the same text, searching for particular word, tag etc.
http://hdl.handle.net/11858/00-097C-0000-0001-4902-6
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
language annotation
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4904-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Feature-based tagger
2009-11-02T09:22:59Z
http://hdl.handle.net/11858/00-097C-0000-0001-4904-2
Hajič, Jan
2011-06-28T09:42:24Z
The Feature-based (exponential model) Tagger is a fast implementation of the Czech tagger developed at UFAL and described in the PDT 1.0 documentation (Czech Language Tagging page). In order to get the best possible results, the tagger requires preprocessing by a Czech morphological module with a very high coverage. This module covers a superset of the Czech "FM" morphology. Both the morphological module and the tagger are supplied as binary executables, together with all necessary precompiled Czech data. Input must be in the ISO Latin 2 (iso-8859-2) code and follow the csts.dtd definition, and output is produced in the same way (ISO Latin 2 code, csts.dtd). (As is the case with many of the tools provided with PDT 1.0, both executables also accept - and then produce - a "simplified SGML", which is not a real, valid SGML, but simply contains at least the tags for words, punctuation, and sentence breaks, one item per line.)
http://hdl.handle.net/11858/00-097C-0000-0001-4904-2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
PDT 2.0 License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2
morphology
tagger
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4905-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Netgraph
2009-11-02T09:41:19Z
http://hdl.handle.net/11858/00-097C-0000-0001-4905-F
Mírovský, Jiří
Ondruška, Roman
2011-06-28T09:42:37Z
Netgraph is a graphically oriented client-server application for searching in linguistically annotated treebanks. The query language of Netgraph is simple and intuitive, yet powerful enough for treebanks with complex annotations schemes. The primary purpose of Netgraph is searching in the Prague Dependency Treebank 2.0, nevertheless it can be used for other treebanks as well.
http://hdl.handle.net/11858/00-097C-0000-0001-4905-F
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public Licence, version 3
http://opensource.org/licenses/GPL-3.0
search
treebank
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F4-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ElixirFM
2009-11-02T09:19:05Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F4-E
Smrž, Otakar
Bielický, Viktor
Buckwalter, Tim
2011-06-28T09:38:24Z
ElixirFM is a high-level implementation of Functional Arabic
Morphology documented at http://elixir-fm.wiki.sourceforge.net/. The
core of ElixirFM is written in Haskell, while interfaces in Perl
support lexicon editing and other interactions.
http://hdl.handle.net/11858/00-097C-0000-0001-48F4-E
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://opensource.org/licenses/GPL-3.0
Arabic morphology
ElixirFM
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B08B-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Multiword expressions in the Prague Dependency Treebank 2.0
2011-11-02T19:50:32Z
http://hdl.handle.net/11858/00-097C-0000-0001-B08B-3
Bejček, Eduard
Klyueva, Natalia
Straňák, Pavel
Šidák, Pavel
Šťastná, Eva
Vimmrová, Pavlína
Hajič, Jan
2011-11-02T19:50:32Z
This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as the original PDT 2.0 data. It is to be used together with the PDT 2.0.
grant 1ET201120505 of the Academy of Sciences of the Czech Republic and grant MSM0021620838 of the Ministry of Youth, Education and Sport of The Czech Republic
http://hdl.handle.net/11858/00-097C-0000-0001-B08B-3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
MWE
multiword expressions
idiom
phraseme
named entity
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CC1E-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hindi Web Texts
2011-11-23T15:47:18Z
http://hdl.handle.net/11858/00-097C-0000-0001-CC1E-B
Bojar, Ondřej
Straňák, Pavel
Zeman, Daniel
2011-11-23T15:47:18Z
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens
FP7-ICT-2007-3-231720 (EuroMatrix Plus), 7E09003 (Czech part of EM+)
UMC004
http://hdl.handle.net/11858/00-097C-0000-0001-CC1E-B
http://hdl.handle.net/11858/00-097C-0000-0023-6260-A
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
news
web texts
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-BD17-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
English-Hindi Parallel Corpus
2011-11-07T16:18:29Z
http://hdl.handle.net/11858/00-097C-0000-0001-BD17-1
Bojar, Ondřej
Straňák, Pavel
Zeman, Daniel
Jain, Gaurav
Damani, Om Prakesh
2011-11-07T16:18:29Z
English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A part of the data is our patch for the Emille parallel corpus.
FP7-ICT-2007-3-231720 (EuroMatrix Plus) 7E09003 (Czech part of EM+)
UMC002
http://hdl.handle.net/11858/00-097C-0000-0001-BD17-1
http://hdl.handle.net/11858/00-097C-0000-0023-625F-0
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
English-Hindi parallel corpus
parallel corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCD-02014-05-13T09:21:27Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCA1-02022-11-25T16:00:44Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Air Traffic Control Communication
2011-12-15T13:51:07Z
http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0
Šmídl, Luboš
2011-12-15T13:51:07Z
Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono.
Technology Agency of the Czech Republic, project No. TA01030476.
ZCU_CZ_ATC
http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0
University of West Bohemia, Department of Cybernetics
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
speech corpus
acoustic model
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCF-C2022-04-26T13:51:47Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
czes
2011-12-15T16:46:56Z
http://hdl.handle.net/11858/00-097C-0000-0001-CCCF-C
(:unav) Unknown author
2011-12-15T16:46:56Z
First version of the very large Czech corpus Czes created with a new set of tools. It comprises 465,102,710 tokens.
Lexical Computing Ltd.
http://hdl.handle.net/11858/00-097C-0000-0001-CCCF-C
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
Czech corpus large
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCE-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Integrated lexicographic platform for Russian
2011-12-15T15:19:41Z
http://hdl.handle.net/11858/00-097C-0000-0001-CCCE-E
Rambousek, Adam
2011-12-15T15:19:41Z
Integrated lexicographic platform for Russian.
http://hdl.handle.net/11858/00-097C-0000-0001-CCCE-E
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
lexicography platform
russian
web dictionary
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCD2-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
IDENTICv1.0-raw
2011-12-16T08:11:47Z
http://hdl.handle.net/11858/00-097C-0000-0001-CCD2-2
Larasati, Septina Dian
2011-12-16T08:11:47Z
Raw Text
http://hdl.handle.net/11858/00-097C-0000-0001-CCD2-2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
Indonesian-English parallel corpus
parallel corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCDB-02022-04-26T13:52:15Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
skTenTen
2011-12-16T09:34:39Z
http://hdl.handle.net/11858/00-097C-0000-0001-CCDB-0
(:unav) Unknown author
2011-12-16T09:34:39Z
Slovak large web corpus skTenTen, comprising 876,003,720 tokens.
Lexical Computing Ltd.
http://hdl.handle.net/11858/00-097C-0000-0001-CCDB-0
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
Slovak large corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCDF-82022-04-26T13:50:51Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
enTenTen
2011-12-16T09:58:26Z
http://hdl.handle.net/11858/00-097C-0000-0001-CCDF-8
(:unav) Unknown author
2011-12-16T09:58:26Z
Very large English web corpus enTenTEn, comprising 3,268,798,627 tokens.
Lexical Computing Ltd.
http://hdl.handle.net/11858/00-097C-0000-0001-CCDF-8
Masaryk University, NLP Centre
NLP Centre Web Corpus License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC
English large corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-D709-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
BushBank
2011-12-16T15:03:23Z
http://hdl.handle.net/11858/00-097C-0000-0001-D709-F
Grác, Marek
2011-12-16T15:03:23Z
Czech corpus annotated for NP and clause chunks by 3-11 annotators (with average inter-annotator agreement at 88%). It consists of 10,000 sentences.
http://hdl.handle.net/11858/00-097C-0000-0001-D709-F
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
interannotator agreement
corpus
chunks
phrases
clauses
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BCCF-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Extended Textual Coreference and Bridging Relations in PDT 2.0
2012-02-20T13:56:58Z
http://hdl.handle.net/11858/00-097C-0000-0005-BCCF-3
Nedoluzhko, Anna
Mírovský, Jiří
2012-02-20T13:56:58Z
Annotation of extended textual coreference and bridging relations in the Prague Dependency Treebank 2.0
project LINDAT-Clarin LM2010013, grant GAČR GA405/09/0729
http://hdl.handle.net/11858/00-097C-0000-0005-BCCF-3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
bridging anaphora
textual coreference
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BF85-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
IDENTICv1.0
2012-03-13T14:34:36Z
http://hdl.handle.net/11858/00-097C-0000-0005-BF85-F
Larasati, Septina Dian
2012-03-13T14:34:36Z
IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide researchers a proper Indonesian-English textual data set and also to promote research in this language pair. The corpus contains texts coming from different sources with different genres.
The research leading to these results has received funding from the European Commission’s 7th Framework Program under grant agreement no 238405 (CLARA) and by the grant LC536 Centrum Komputacni Lingvistiky of the Czech Ministry of Education.
http://hdl.handle.net/11858/00-097C-0000-0005-BF85-F
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
Indonesian-English parallel corpus
parallel corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BF95-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
VPS-30-En
2012-03-19T14:07:13Z
http://hdl.handle.net/11858/00-097C-0000-0005-BF95-B
Cinková, Silvie
Holub, Martin
Rambousek, Adam
Smejkalová, Lenka
2012-03-19T14:07:13Z
VPS-30-En is a small lexical resource that contains the following 30 English verbs: access, ally, arrive, breathe,
claim, cool, crush, cry, deny, enlarge, enlist, forge, furnish, hail, halt, part, plough, plug, pour, say, smash, smell, steer, submit, swell,
tell, throw, trouble, wake and yield. We have created and have been using VPS-30-En to explore the interannotator agreement potential
of the Corpus Pattern Analysis. VPS-30-En is a small snapshot of the Pattern Dictionary of English Verbs (Hanks and Pustejovsky,
2005), which we revised (both the entries and the annotated concordances) and enhanced with additional annotations.
This work has been partly supported by the Ministry of
Education of CR within the LINDAT-Clarin project
LM2010013, and by the Czech Science Foundation under
the projects P103/12/G084, P406/2010/0875 and
P401/10/0792.
http://hdl.handle.net/11858/00-097C-0000-0005-BF95-B
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
corpus pattern analysis
clustering
lexical semantics
verbs
lexicalConceptualResource
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-CF9C-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Parliament Meetings
2012-03-28T14:45:25Z
http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4
Pražák, Aleš
Šmídl, Luboš
2012-03-28T14:45:25Z
The corpus consists of recordings from the Chamber of Deputies of the Parliament of the Czech Republic. It currently consists of 88 hours of speech data, which corresponds roughly to 0.5 million tokens. The annotation process is semi-automatic, as we are able to perform the speech recognition on the data with high accuracy (over 90%) and consequently align the resulting automatic transcripts with the speech. The annotator’s task is then to check the transcripts, correct errors, add proper punctuation and label speech sections with information about the speaker. The resulting corpus is therefore suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 18 sound files (WAV PCM, 16-bit, 44.1 kHz, mono) and corresponding transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
The date of airing of a particular recording is encoded in the filename in the form SOUND_YYMMDD_*. Note that the recordings are usually aired in the early morning on the day following the actual Parliament session. If the recording is too long to fit in the broadcasting scheme, it is divided into several parts and aired on the consecutive days.
ZCU_CZ_Parliament
http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4
University of West Bohemia, Department of Cybernetics
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
speech corpus
acoustic model
speaker identification
speaker verification
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADA-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
WMT 2011 Testing Set
2012-05-15T12:36:59Z
http://hdl.handle.net/11858/00-097C-0000-0006-AADA-9
Galuščáková, Petra
Bojar, Ondřej
2012-05-15T12:36:59Z
Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is described in [2].
References:
[1] http://www.statmt.org/wmt11/evaluation-task.html
[2] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press.
The work on this project was supported by the grant EuroMatrixPlus (FP7-ICT-
2007-3-231720 of the EU and 7E09003 of the Czech Republic)
http://hdl.handle.net/11858/00-097C-0000-0006-AADA-9
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
WMT
test data
Slovak
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADB-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Manually Classified Errors in Cs->Sk Translation
2012-05-15T13:42:49Z
http://hdl.handle.net/11858/00-097C-0000-0006-AADB-7
Galuščáková, Petra
Bojar, Ondřej
2012-05-15T13:42:49Z
Manual classification of errors of Czech-Slovak translation according to the classification introduced by Vilar et al. [1]. First 50 sentences from WMT 2010 test set were translated by 5 MT systems (Česílko, Česílko2, Google Translate and two Moses setups) and MT errors were manually marked and classified. Classification was applied in MT systems comparison [3]. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://matrix.statmt.org/test_sets/list
[3] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011
This work has been supported by the grants Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
http://hdl.handle.net/11858/00-097C-0000-0006-AADB-7
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
machine translation
errors classification
CS-SK translation
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADC-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Manually Classified Errors in En->Sk Translation
2012-05-15T13:59:24Z
http://hdl.handle.net/11858/00-097C-0000-0006-AADC-5
Galuščáková, Petra
Bojar, Ondřej
2012-05-15T13:59:24Z
Manual classification of errors of English-Slovak translation according to the classification introduced by Vilar et al. [1]. 50 sentences randomly selected from WMT 2011 test set [2] were translated by 3 MT systems described in [3] and MT errors were manually marked and classified. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://www.statmt.org/wmt11/evaluation-task.html
[3] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press.
This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
http://hdl.handle.net/11858/00-097C-0000-0006-AADC-5
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
machine translation
errors classification
EN-SK translation
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADD-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Manually Ranked Translation Outputs
2012-05-15T14:45:32Z
http://hdl.handle.net/11858/00-097C-0000-0006-AADD-3
Bojar, Ondřej
Galuščáková, Petra
2012-05-15T14:45:32Z
Manually ranked outputs of Czech-Slovak translations. Three annotators manually ranked outputs of five MT systems (Česílko, Česílko2, Google Translate and two Moses setups) on three data sets (100 sentences randomly selected from books, 100 sentences randomly selected from Acquis corpus and 50 first sentences from WMT 2010 test set). Ranking was applied in MT systems comparison in [1].
References:
[1] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011
This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
http://hdl.handle.net/11858/00-097C-0000-0006-AADD-3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
machine translation
evaluation
manual ranking
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADF-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech-Slovak Parallel Corpus
2012-05-15T15:54:40Z
http://hdl.handle.net/11858/00-097C-0000-0006-AADF-0
Galuščáková, Petra
Garabík, Radovan
Bojar, Ondřej
2012-05-15T15:54:40Z
Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation.
References:
[1] http://langtech.jrc.it/JRC-Acquis.html/
[2] http://www.statmt.org/europarl/
[3] http://apertium.eu/data
[4] http://opus.lingfil.uu.se/
[5] http://ec.europa.eu/
This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
http://hdl.handle.net/11858/00-097C-0000-0006-AADF-0
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
parallel corpus
Czech-Slovak corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AAE0-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
English-Slovak Parallel Corpus
2012-05-15T16:11:21Z
http://hdl.handle.net/11858/00-097C-0000-0006-AAE0-A
Galuščáková, Petra
Garabík, Radovan
Bojar, Ondřej
2012-05-15T16:11:21Z
English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation.
References:
[1] http://langtech.jrc.it/JRC-Acquis.html/
[2] http://www.statmt.org/europarl/
[3] http://apertium.eu/data
[4] http://opus.lingfil.uu.se/
[5] http://ec.europa.eu/
This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
http://hdl.handle.net/11858/00-097C-0000-0006-AAE0-A
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
parallel corpus
English-Slovak corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AAFE-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Česílko
2012-05-22T16:48:19Z
http://hdl.handle.net/11858/00-097C-0000-0006-AAFE-A
Hajič, Jan
Kuboň, Vladislav
Homola, Petr
2012-05-22T16:48:19Z
Česílko is a tool enabling the fast and efficient translation from one source language into many target languages, which are mutually related.
http://hdl.handle.net/11858/00-097C-0000-0006-AAFE-A
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
machine translation
Czech-Slovak translation
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-B847-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CWC2011
2012-06-21T11:53:56Z
http://hdl.handle.net/11858/00-097C-0000-0006-B847-6
Spoustová, Johanka
Spousta, Miroslav
2012-06-21T11:53:56Z
Web corpus of Czech, created in 2011. Contains newspapers+magazines, discussions, blogs. See http://www.lrec-conf.org/proceedings/lrec2012/summaries/120.html for details.
GA405/09/0278
http://hdl.handle.net/11858/00-097C-0000-0006-B847-6
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
corpus
Czech
web
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-DB11-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 2.5
2012-08-09T17:00:20Z
http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
Bejček, Eduard
Hajič, Jan
Panevová, Jarmila
Mírovský, Jiří
Spoustová, Johanka
Štěpánek, Jan
Straňák, Pavel
Šidák, Pavel
Vimmrová, Pavlína
Šťastná, Eva
Ševčíková, Magda
Smejkalová, Lenka
Homola, Petr
Popelka, Jan
Lopatková, Markéta
Hrabalová, Lucie
Klyueva, Natalia
Žabokrtský, Zdeněk
2012-08-09T17:00:20Z
The Prague Dependency Treebank 2.5 annotates the same texts as the PDT 2.0. The annotation on the original four layers was fixed or improved in various aspects (see Documentation). Moreover, new information was added to the data:
Annotation of multiword expressions
Pair/group meaning
Clause segmentation
Ministry of Education of the Czech Republic projects No.:
LM2010013
LC536
MSM0021620838
Grant Agency of the Czech Republic grants No.:
P406/2010/0875
P202/10/1333
P406/10/P193
http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
treebank
multiword expressions
clauses
tectogrammatics
dependency
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0007-70FD-E2022-03-14T14:21:37Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
DZ Interset
2012-10-25T12:42:49Z
http://hdl.handle.net/11858/00-097C-0000-0007-70FD-E
Zeman, Daniel
2012-10-25T12:42:49Z
DZ Interset is a means of converting among various tag sets in natural language processing. The core idea is similar to interlingua-based machine translation. DZ Interset defines a set of features that are encoded by the various tag sets. The set of features should be as universal as possible. It does not need to encode everything that is encoded by any tag set but it should encode all information that people may want to access and/or port from one tag set to another.
New tag sets are attached by writing a driver for them. Once the driver is ready, you can easily convert tags between the new set and any other set for which you also have a driver. This reusability is an obvious advantage over writing a targeted conversion procedure each time you need to convert between a particular pair of tag sets.
grant MSM 0021620838 of the Ministry of Education of the Czech Republic
http://hdl.handle.net/11858/00-097C-0000-0007-70FD-E
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
morphology
NLP
Perl
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0008-D259-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Additional German-Czech reference translations of the WMT'11 test set
2012-11-13T16:36:01Z
http://hdl.handle.net/11858/00-097C-0000-0008-D259-7
Bojar, Ondřej
Zeman, Daniel
Dušek, Ondřej
Břečková, Jana
Farkačová, Hana
Grošpic, Pavel
Kačenová, Kristýna
Knechtová, Eva
Koubová, Anna
Lukavská, Jana
Nováková, Petra
Petrdlíková, Jana
2012-11-13T16:36:01Z
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved.
This project has been sponsored by the grants GAČR P406/11/1499 and EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic)
http://hdl.handle.net/11858/00-097C-0000-0008-D259-7
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
reference translation
German-Czech
parallel corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-60D6-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
W2C – Web to Corpus – tool
2013-06-25T13:21:15Z
http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
Majliš, Martin
2013-06-25T13:21:15Z
A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc.
A set of 120 corpora collected using this tool is available at https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0022-6133-9
http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
web data
wikipedia
corpus creation
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0008-E130-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Discourse Treebank 1.0
2012-11-14T08:58:57Z
http://hdl.handle.net/11858/00-097C-0000-0008-E130-A
Poláková, Lucie
Jínová, Pavlína
Zikánová, Šárka
Hajičová, Eva
Mírovský, Jiří
Nedoluzhko, Anna
Rysová, Magdaléna
Pavlíková, Veronika
Zdeňková, Jana
Pergler, Jiří
Ocelák, Radek
2012-11-14T08:58:57Z
Annotation of discourse relations is a project related to the Prague Dependency Treebank 2.5. It represents a new manually annotated layer of language description, above the existing layers of the PDT, and it portrays linguistic phenomena from the perspective of discourse structure and coherence.
GACR P406/12/0658, GACR P406/2010/0875, GACR 405/09/0729, Ministry of Education ME10018, Ministry of Education LM2010013
PDiT 1.0
http://hdl.handle.net/11858/00-097C-0000-0008-E130-A
http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
discourse
treebank
annotation
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000C-2112-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 3
2012-12-12T11:24:11Z
http://hdl.handle.net/11858/00-097C-0000-000C-2112-B
Šebesta, Karel
Bedřichová, Zuzanna
Šormová, Kateřina
Štindlová, Barbora
Hrdlička, Milan
Hrdličková, Tereza
Hana, Jiří
Rosen, Alexandr
Petkevič, Vladimír
Jelínek, Tomáš
Škodová, Svatava
Poláčková, Marie
Janeš, Petr
Lundáková, Kateřina
Skoumalová, Hana
Šťastný, Klement
Sládek, Šimon
Pierscieniak, Piotr
2012-12-12T11:24:11Z
Corpus AKCES 3 includes texts written in czech by non-native speakers (AKCES/CLAC - Czech Language Acquisition Corpora)
ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
http://hdl.handle.net/11858/00-097C-0000-000C-2112-B
Charles University in Prague, ÚČJTK
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
Czech as a foreign language
Czech language acquisition corpora
non-native speakers
AKCES
second language aquisition
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67C-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Korektor
2013-02-02T00:16:12Z
http://hdl.handle.net/11858/00-097C-0000-000D-F67C-5
Richter, Michal
2013-02-02T00:16:12Z
Statistical spell- and (occasional) grammar-checker. There are three versions: a unix command line utility and an OS X SpellServer with a System Service, that integrates with native OS X GUI applications, and a web service run by Lindat-Clarin, that can be used either through a web form in a browser, or by web applications using API.
The LINDAT-CLARIN project (LM2010013), fully supported by TheMinistry of Education, Sports and Youth of The Czech Republic under the programme LM of "Large Infrastructures"
http://hdl.handle.net/11858/00-097C-0000-000D-F67C-5
http://hdl.handle.net/11234/1-1469
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
BSD 2-Clause "Simplified" or "FreeBSD" license
http://opensource.org/licenses/BSD-2-Clause
grammar checker
spellchecker
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000C-2293-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 4
2012-12-12T11:45:49Z
http://hdl.handle.net/11858/00-097C-0000-000C-2293-0
Šebesta, Karel
Bedřichová, Zuzanna
Štindlová, Barbora
Hrdlička, Milan
Hrdličková, Tereza
Hana, Jiří
Rosen, Alexandr
Petkevič, Vladimír
Jelínek, Tomáš
Škodová, Svatava
Janeš, Petr
Lundáková, Kateřina
Skoumalová, Hana
Šťastný, Klement
Sládek, Šimon
2012-12-12T11:45:49Z
Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora)
ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
http://hdl.handle.net/11858/00-097C-0000-000C-2293-0
Charles University
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
language of children
Czech language acquisition
adolescents
AKCES
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC91-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech translation of the EBUContentGenre thesaurus
2013-01-01T14:55:41Z
http://hdl.handle.net/11858/00-097C-0000-000D-EC91-2
Ircing, Pavel
2013-01-01T14:55:41Z
The EBUContentGenre is a thesaurus containing the hierarchical description of various genres utilized in the TV broadcasting industry. This thesaurus is a part of a complex metadata specification called EBUCore intended for multifaceted description of audiovisual content. EBUCore (http://tech.ebu.ch/docs/tech/tech3293v1_3.pdf) is a set of descriptive and technical metadata based on the Dublin Core and adapted to media. EBUCore is the flagship metadata specification of European Broadcasting Union, the largest professional association of broadcasters around the world. It is developed and maintained by EBU's Technical Department (http://tech.ebu.ch). The translated thesaurus can be used for effective cataloguing of (mostly TV) audiovisual content and consequent development of systems for automatic cataloguing (topic/genre detection).
Technology Agency of the Czech Republic, project No. TA01011264
ZCU_CZ_ ebu_ContentGenreCS_CZ
http://hdl.handle.net/11858/00-097C-0000-000D-EC91-2
University of West Bohemia, Department of Cybernetics
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
thesaurus
metadata annotation
topic detection
lexicalConceptualResource
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC92-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ATCC: Pronunciation lexicon and n-gram counts for ASR module
2013-01-01T14:56:06Z
http://hdl.handle.net/11858/00-097C-0000-000D-EC92-F
Šmídl, Luboš
2013-01-01T14:56:06Z
The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0).
Technology Agency of the Czech Republic, project No. TA01030476
ZCU_CZ_ ATCC-LM4ASR
http://hdl.handle.net/11858/00-097C-0000-000D-EC92-F
University of West Bohemia, Department of Cybernetics
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
pronunciation lexicon
n-gram counts
language model
lexicalConceptualResource
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC98-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
OVM – Otázky Václava Moravce
2013-01-04T13:24:56Z
http://hdl.handle.net/11858/00-097C-0000-000D-EC98-3
Šmídl, Luboš
Pražák, Aleš
2013-01-04T13:24:56Z
The corpus consists of transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. It contains 35 hours of speech and corresponding word-by-word transcriptions, including the transcription of some non-speech events. Speakers’ names are also assigned to corresponding segments. The resulting corpus is suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 16 sound files (WAV PCM, 16-bit, 48 kHz, mono) and transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
ZCU_CZ_OVM
http://hdl.handle.net/11858/00-097C-0000-000D-EC98-3
University of West Bohemia, Department of Cybernetics
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
speech corpus
acoustic model
speaker identification
speaker verification
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F696-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
jusText
2013-02-05T12:04:53Z
http://hdl.handle.net/11858/00-097C-0000-000D-F696-9
Pomikálek, Jan
2013-02-05T12:04:53Z
jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/justext/). It is successfully used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The boilerplate removal algorithm is able to remove most of non-grammatical sentences from a web page like navigation, advertisements, tables, short notes and so on. It has been shown it overperforms or at least keeps up with it's competitors (according to comparison with participants of Cleaneval competition in author's Ph.D. thesis). The precise removal of unwanted content and scalability of the algorithm has been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- over 20 TB of HTML pages were processed resulting in corpora of 70 billions tokens altogether.
PRESEMT, Lexical Computing Ltd
http://hdl.handle.net/11858/00-097C-0000-000D-F696-9
Masaryk University, NLP Centre
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
boilerplate
web documents
text cleaning
boilerplate removal
text corpora
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67A-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Chared
2013-02-01T16:32:21Z
http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9
Pomikálek, Jan
2013-02-01T16:32:21Z
Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9.
PRESEMT, Lexical Computing Ltd
http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9
Masaryk University, NLP Centre
BSD 3-Clause "New" or "Revised" license
http://opensource.org/licenses/BSD-3-Clause
character encoding
character encoding detection
charset
unicode
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67B-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
onion
2013-02-01T16:34:32Z
http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7
Pomikálek, Jan
2013-02-01T16:34:32Z
onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether.
PRESEMT, Lexical Computing Ltd
http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7
Masaryk University, NLP Centre
BSD 3-Clause "New" or "Revised" license
http://opensource.org/licenses/BSD-3-Clause
deduplication
corpus
text deduplication
n-gram deduplication
n-gram model
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0015-8DAF-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Czech-English Dependency Treebank 2.0
2013-03-28T14:16:10Z
http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
Hajič, Jan
Hajičová, Eva
Panevová, Jarmila
Sgall, Petr
Cinková, Silvie
Fučíková, Eva
Mikulová, Marie
Pajas, Petr
Popelka, Jan
Semecký, Jiří
Šindlerová, Jana
Štěpánek, Jan
Toman, Josef
Urešová, Zdeňka
Žabokrtský, Zdeněk
2013-03-28T14:16:10Z
Texts
The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part.
Data
The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 sentence-aligned. An additional automatic alignment on the node level (different for each annotation layer) is part of this release, too. The original Penn Treebank-like file structure (25 sections, each containing up to one hundred files) has been preserved. Only those PTB documents which have both POS and structural annotation (total of 2312 documents) have been translated to Czech and made part of this release.
Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are:
dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values)
semantic labeling of content words and types of coordinating structures
argument structure, including an argument structure ("valency") lexicon for both languages
ellipsis and anaphora resolution.
This annotation style is called tectogrammatical annotation and it constitutes the tectogrammatical layer in the corpus. For more details see below and documentation.
Annotation of the Czech part
Sentences of the Czech translation were automatically morphologically annotated and parsed into surface-syntax dependency trees in the PDT 2.0 annotation style. This annotation style is sometimes called analytical annotation; it constitutes the analytical layer of the corpus. The manual tectogrammatical (deep-syntax) annotation was built as a separate layer above the automatic analytical (surface-syntax) parse. A sample of 2,000 sentences was manually annotated on the analytical layer.
Annotation of the English part
The resulting manual tectogrammatical annotation was built above an automatic transformation of the original phrase-structure annotation of the Penn Treebank into surface dependency (analytical) representations, using the following additional linguistic information from other sources:
PropBank (LDC2004T14)
VerbNet
NomBank (LDC2008T23)
flat noun phrase structures (by courtesy of D. Vadas and J.R. Curran)
For each sentence, the original Penn Treebank phrase structure trees are preserved in this corpus together with their links to the analytical and tectogrammatical annotation.
Ministry of Education of the Czech Republic projects No.:
MSM0021620838
LC536
ME09008
LM2010013
7E09003+7E11051
7E11041
Czech Science Foundation, grants No.:
GAP406/10/0875
GPP406/10/P193
GA405/09/0729
Research funds of the Faculty of Mathematics and Physics, Charles University, Czech Republic, Grant Agency of the Academy of Sciences of the Czech Republic: No. 1ET101120503
Students participating in this project have been running their own student grants from the Grant Agency of the Charles University, which were connected to this project. Only ongoing projects are mentioned: 116310, 158010, 3537/2011
Also, this work was funded in part by the following projects sponsored by the European Commission:
Companions, No. 034434
EuroMatrix, No. 034291
EuroMatrixPlus, No. 231720
Faust, No. 247762
http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
http://hdl.handle.net/11234/1-1664
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
CC-BY-NC-SA + LDC99T42
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pcedt2
parallel treebank
PCEDT
parallel corpus
Wall Street Journal
WSJ
Penn Treebank
dependency annotation
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000E-011B-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Corpus of contemporary blogs
2013-02-26T13:40:06Z
http://hdl.handle.net/11858/00-097C-0000-000E-011B-8
Grác, Marek
2013-02-26T13:40:06Z
In NLP Centre, dividing text into sentences is currently done with
a tool which uses rule-based system. In order to make enough training
data for machine learning, annotators manually split the corpus of contemporary text
CBB.blog (1 million tokens) into sentences.
Each file contains one hundredth of the whole corpus and all data were
processed in parallel by two annotators.
The corpus was created from ten contemporary blogs:
hintzu.otaku.cz
modnipeklo.cz
bloc.cz
aleneprokopova.blogspot.com
blog.aktualne.cz
fuchsova.blog.onaidnes.cz
havlik.blog.idnes.cz
blog.aktualne.centrum.cz
klusak.blogspot.cz
myego.cz/welldone
http://hdl.handle.net/11858/00-097C-0000-000E-011B-8
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
corpus
blogs
annotation
annotators
sentences
machine learning
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B2E-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
sholva-0.6
2014-01-09T11:13:28Z
http://hdl.handle.net/11858/00-097C-0000-0023-1B2E-0
Grác, Marek
Čapek, Tomáš
2014-01-09T11:13:28Z
Semantic net `sholva' contains more than 150 000 records for which there was sufficient agreement among annotators. Indvidual words are labeled in the following categories:
person, person / individual, event and substance.
http://hdl.handle.net/11858/00-097C-0000-0023-1B2E-0
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
semantic net
semantic tagging
lexicalConceptualResource
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0015-A780-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MorfFlex CZ
2013-05-02T14:45:11Z
http://hdl.handle.net/11858/00-097C-0000-0015-A780-9
Hajič, Jan
Hlaváčová, Jaroslava
2013-05-02T14:45:11Z
Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
http://hdl.handle.net/11858/00-097C-0000-0015-A780-9
http://hdl.handle.net/11234/1-1673
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
morphological dictionary
morphology
Czech
lexicalConceptualResource
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0019-89A0-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 2
2013-05-13T09:17:21Z
http://hdl.handle.net/11858/00-097C-0000-0019-89A0-9
Šebesta, Karel
Goláňová, Hana
2013-05-13T09:17:21Z
Corpus AKCES 2 consists of trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants.
MŠMT (MSM0021620825), UK (PRVOUK P 10)
http://hdl.handle.net/11858/00-097C-0000-0019-89A0-9
http://hdl.handle.net/11858/00-097C-0000-0023-3FBB-3
Charles University in Prague, ÚČJTK
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
youth language
classroom
language acquisition corpus
AKCES
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-6133-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
W2C – Web to Corpus – Corpora
2013-06-25T15:08:15Z
http://hdl.handle.net/11858/00-097C-0000-0022-6133-9
Majliš, Martin
2013-06-25T15:08:15Z
A set of corpora for 120 languages automatically collected from wikipedia and the web.
Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
http://hdl.handle.net/11858/00-097C-0000-0022-6133-9
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
multilingual corpora
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-AAF5-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MTMonkey
2013-08-14T10:52:07Z
http://hdl.handle.net/11858/00-097C-0000-0022-AAF5-B
Tamchyna, Aleš
Dušek, Ondřej
Rosa, Rudolf
2013-08-14T10:52:07Z
MTMonkey is a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre- and post processing.
It consists of an application server and remote workers which handle text processing and communicate translation requests to MT systems. The communication between the application server and the workers is based on the XML-RPC protocol.
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 257528 (KHRESMOI). This work has been using language resources developed and/or stored and/or distributed by the LINDAT-Clarin project of the Ministry of Education of the Czech Republic (project LM2010013). This work has been supported by the AMALACH grant (DF12P01OVV02) of the Ministry of Culture of the Czech Republic.
http://hdl.handle.net/11858/00-097C-0000-0022-AAF5-B
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Apache License 2.0
http://opensource.org/licenses/Apache-2.0
machine translation
distributed computing
web service
infrastructure
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C73C-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Named Entity Corpus 1.0
2013-09-07T11:15:32Z
http://hdl.handle.net/11858/00-097C-0000-0022-C73C-7
Ševčíková, Magda
Žabokrtský, Zdeněk
Straková, Jana
2013-09-07T11:15:32Z
The presented Czech Named Entity Corpus 1.0 is the first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification.
1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů)
http://hdl.handle.net/11858/00-097C-0000-0022-C73C-7
http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
named entity recognition
named entitity corpus
Czech
NER
corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C7F6-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
PML Tree Query
2013-09-09T16:04:21Z
http://hdl.handle.net/11858/00-097C-0000-0022-C7F6-3
Pajas, Petr
Štěpánek, Jan
Sedlák, Michal
2013-09-09T16:04:21Z
System for querying annotated treebanks in PML format. The querying uses it own query language with graphical representation. It has two different implementations (SQL and Perl) and several clients (TrEd, browser-based, command line interface).
http://hdl.handle.net/11858/00-097C-0000-0022-C7F6-3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
treebank
query
search
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C7FD-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
PMLTQ::Web
2013-09-10T09:59:26Z
http://hdl.handle.net/11858/00-097C-0000-0022-C7FD-6
Sedlák, Michal
2013-09-10T09:59:26Z
Simple web build on the top of the PML Tree Query service.
http://hdl.handle.net/11858/00-097C-0000-0022-C7FD-6
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Artistic License (Perl) 1.0
http://opensource.org/licenses/Artistic-Perl-1.0
Perl
PML-TQ
PML
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-10B2-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Many Czech References for 50 Sentences Selected from WMT11 Data
2013-12-10T13:41:44Z
http://hdl.handle.net/11858/00-097C-0000-0023-10B2-F
Bojar, Ondřej
Macháček, Matouš
Tamchyna, Aleš
Zeman, Daniel
2013-12-10T13:41:44Z
This dataset contains the whole set of very many Czech translations for 50 English source sentences coming from WMT11 test set (http://www.statmt.org/wmt11).
In total, there are 15431447 Czech sentences, i.e. 300k reference translations per source English sentence on average, but the exact number greatly varies across sentences.
You can find more details in included README file.
If you use this dataset, please cite the following paper which describes the technique used to construct the Czech translations:
Bojar Ondřej, Macháček Matouš, Tamchyna Aleš, Zeman Daniel:
Scratching the Surface of Possible Translations.
Lecture Notes in Computer Science, Vol. 8082, Text, Speech and Dialogue: 16th
International Conference, TSD 2013. Proceedings, Copyright © Springer Verlag,
Berlin / Heidelberg, ISBN 978-3-642-40584-6, ISSN 0302-9743, pp. 465-474, 2013, DOI: 10.1007/978-3-642-40585-3_59
P406/11/1499 of the Grant Agency of the Czech Republic, FP7-ICT-2011-7-288487 (MosesCore) of the European Union and 1356213 of the Grant Agency of the Charles University
http://hdl.handle.net/11858/00-097C-0000-0023-10B2-F
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
machine translation
automatic machine translation evaluation
reference translation
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-D9BF-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Khresmoi Query Translation Test Data 1.0
2014-04-02T23:00:03Z
http://hdl.handle.net/11858/00-097C-0000-0022-D9BF-5
Pecina, Pavel
Dušek, Ondřej
Hajič, Jan
Urešová, Zdeňka
2013-10-11T07:54:49Z
This package contains data sets for development and testing of machine translation of medical search short queries between Czech, English, French, and German. The queries come from general public and medical experts.
This work was supported by the EU FP7 project Khresmoi (European Comission contract No. 257528). The language resources are distributed by the LINDAT/Clarin project of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2010013).
We thank Health on the Net Foundation for granting the license for the English general public queries, TRIP database for granting the license for the English medical expert queries, and three anonymous translators and three medical experts for translating amd revising the data.
Khresmoi-Query-MT-Test-Data-1.0
http://hdl.handle.net/11858/00-097C-0000-0022-D9BF-5
http://hdl.handle.net/11234/1-2121
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
corpus
test data
medical
health
machine translation
Czech
French
German
English
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-EE02-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Plain-Moses-Chimera
2013-11-09T22:44:43Z
http://hdl.handle.net/11858/00-097C-0000-0022-EE02-C
Bojar, Ondřej
Tamchyna, Aleš
2013-11-09T22:44:43Z
Statistical component of Chimera, a state-of-the-art MT system.
Project DF12P01OVV022 of the Ministry of Culture of the Czech Republic (NAKI -- Amalach).
http://hdl.handle.net/11858/00-097C-0000-0022-EE02-C
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
moses
machine translation
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-FE82-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Facebook Data for Sentiment Analysis
2013-11-29T15:41:00Z
http://hdl.handle.net/11858/00-097C-0000-0022-FE82-7
Habernal, Ivan
Ptáček, Tomáš
Steinberger, Josef
2013-11-29T15:41:00Z
Corpus consisting of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.
http://hdl.handle.net/11858/00-097C-0000-0022-FE82-7
University of West Bohemia
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
sentiment analysis
opinion mining
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-FF60-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech SubLex 1.0
2013-12-02T22:10:38Z
http://hdl.handle.net/11858/00-097C-0000-0022-FF60-B
Veselovská, Kateřina
Bojar, Ondřej
2013-12-02T22:10:38Z
Czech subjectivity lexicon, i.e. a list of subjectivity clues for sentiment analysis in Czech. The list contains 4626 evaluative items (1672 positive and 2954 negative) together with their part of speech tags, polarity orientation and source information.
The core of the Czech subjectivity lexicon has been gained by automatic translation of a freely available English subjectivity lexicon downloaded from http://www.cs.pitt.edu/mpqa/subj_lexicon.html. For translating the data into Czech, we used parallel corpus CzEng 1.0 containing 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep layers of syntactic representation. Afterwards, the lexicon has been manually refined by an experienced annotator.
The work on this project has been supported by the GAUK 3537/2011 grant and by SVV project number 267 314.
http://hdl.handle.net/11858/00-097C-0000-0022-FF60-B
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
subjectivity lexicon
sentiment analysis
opinion mining
polarity clues
lexicalConceptualResource
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119C-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ORAL2006: Corpus of informal spoken Czech
2013-12-13T11:55:09Z
http://hdl.handle.net/11858/00-097C-0000-0023-119C-C
Kopřivová, Marie
Waclawičová, Martina
2013-12-13T11:55:09Z
Corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 221 recordings made in 2002–2006 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 754, the metadata include sociolinguistic information about them.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC.
Výzkumný záměr MSM0021620823 – Český národní korpus a korpusy dalších jazyků
http://hdl.handle.net/11858/00-097C-0000-0023-119C-C
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
corpus
informal spoken language
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119D-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ORAL2008: Balanced corpus of informal spoken Czech
2013-12-13T11:56:16Z
http://hdl.handle.net/11858/00-097C-0000-0023-119D-A
Waclawičová, Martina
Kopřivová, Marie
Křen, Michal
Válková, Lucie
2013-12-13T11:56:16Z
Balanced corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 297 recordings made in 2002–2007 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 995, the corpus is balanced in their main sociolinguistic categories (gender, age group, education, region of childhood residence).
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC.
MSM0021620823 – Český národní korpus a korpusy dalších jazyků
http://hdl.handle.net/11858/00-097C-0000-0023-119D-A
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
informal spoken language
balanced corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119E-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2005: balanced corpus of written Czech
2013-12-13T15:01:52Z
http://hdl.handle.net/11858/00-097C-0000-0023-119E-8
Čermák, František
Hlaváčová, Jaroslava
Hnátková, Milena
Jelínek, Tomáš
Kocek, Jan
Kopřivová, Marie
Křen, Michal
Novotná, Renata
Petkevič, Vladimír
Schmiedtová, Věra
Skoumalová, Hana
Spoustová, Johanka
Šulc, Michal
Velíšek, Zdeněk
2013-12-13T15:01:52Z
Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2000–2004 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
MSM0021620823 – Český národní korpus a korpusy dalších jazyků
http://hdl.handle.net/11858/00-097C-0000-0023-119E-8
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
balanced corpus
written language
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119F-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2010: balanced corpus of written Czech
2013-12-13T16:55:38Z
http://hdl.handle.net/11858/00-097C-0000-0023-119F-6
Křen, Michal
Bartoň, Tomáš
Cvrček, Václav
Hnátková, Milena
Jelínek, Tomáš
Kocek, Jan
Novotná, Renata
Petkevič, Vladimír
Procházka, Pavel
Schmiedtová, Věra
Skoumalová, Hana
2013-12-13T16:55:38Z
Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2005–2009 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
MSM0021620823 – Český národní korpus a korpusy dalších jazyků
http://hdl.handle.net/11858/00-097C-0000-0023-119F-6
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
balanced corpus
written language
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1358-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2006PUB: corpus of Czech newspapers
2013-12-18T09:00:57Z
http://hdl.handle.net/11858/00-097C-0000-0023-1358-3
Čermák, František
Hlaváčová, Jaroslava
Hnátková, Milena
Jelínek, Tomáš
Kocek, Jan
Kopřivová, Marie
Křen, Michal
Novotná, Renata
Petkevič, Vladimír
Schmiedtová, Věra
Skoumalová, Hana
Spoustová, Johanka
Šulc, Michal
Velíšek, Zdeněk
2013-12-18T09:00:57Z
Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
MSM0021620823 – Český národní korpus a korpusy dalších jazyků
http://hdl.handle.net/11858/00-097C-0000-0023-1358-3
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
corpus
written language
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1359-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2009PUB: corpus of Czech newspapers
2013-12-18T09:06:37Z
http://hdl.handle.net/11858/00-097C-0000-0023-1359-1
Křen, Michal
Bartoň, Tomáš
Hnátková, Milena
Jelínek, Tomáš
Petkevič, Vladimír
Procházka, Pavel
Skoumalová, Hana
2013-12-18T09:06:37Z
Corpus of contemporary Czech newspapers and magazines sized 700 MW. It contains various titles published between 1995–2007. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
MSM0021620823 – Český národní korpus a korpusy dalších jazyků
http://hdl.handle.net/11858/00-097C-0000-0023-1359-1
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
corpus
written language
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1AAF-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 3.0
2014-01-08T20:17:10Z
http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
Bejček, Eduard
Hajičová, Eva
Hajič, Jan
Jínová, Pavlína
Kettnerová, Václava
Kolářová, Veronika
Mikulová, Marie
Mírovský, Jiří
Nedoluzhko, Anna
Panevová, Jarmila
Poláková, Lucie
Ševčíková, Magda
Štěpánek, Jan
Zikánová, Šárka
2014-01-08T20:17:10Z
PDT 3.0 is a new version of Prague Dependency Treebank. It contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and semantic annotation (0.8 MW); in addition, certain properties of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations are annotated at the semantic level.
the Grant Agency of the Czech Republic: grants P406/12/0658 "Coreference, discourse relations and information structure in a contrastive perspective", P406/2010/0875 "Computational Linguistics: Explicit description of language and annotated data focused on Czech", 405/09/0729 "From the structure of a sentence to textual relationships", and GPP406/12/P175 (Selected derivational relations for automatic processing of Czech);
the Ministry of Education, Youth and Sports of the Czech Republic: the KONTAKT project ME10018 "Towards a computational analysis of text structure" and the LINDAT-Clarin project LM2010013;
the Grant Agency of Charles University in Prague: GAUK 103609 "Textual (Inter-sentential) Relations and their Representation in a Language Corpus" and GAUK 4383/2009 "Methods of coreference resolution".
PDT 3.0
http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
http://hdl.handle.net/11234/1-1905
http://hdl.handle.net/11234/1-2621
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
http://hdl.handle.net/11858/00-097C-0000-0008-E130-A
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
treebank
dependency
tectogrammatics
topic-focus articulation
multiword expressions
coreference
bridging relations
discourse
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B04-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Named Entity Corpus 1.1
2014-01-09T10:03:56Z
http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C
Ševčíková, Magda
Žabokrtský, Zdeněk
Straková, Jana
Straka, Milan
2014-01-09T10:03:56Z
Czech Named Entity Corpus 1.1 fixes some issues of the Czech Named Entity Corpus 1.0: misannotated entities are fixed, all formats contain the same data, tmt format is replaced with treex format, all formats contain splitting into training, development and testing portion of the data.
SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky), LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat), GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny), PRVOUK (PRVOUK)
http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0022-C73C-7
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
named entity recognition
corpus
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B22-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Named Entity Corpus 2.0
2014-01-09T10:24:31Z
http://hdl.handle.net/11858/00-097C-0000-0023-1B22-8
Ševčíková, Magda
Žabokrtský, Zdeněk
Straková, Jana
Straka, Milan
2014-01-09T10:24:31Z
Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named entities.
SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky), LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat), GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny), PRVOUK (PRVOUK)
http://hdl.handle.net/11858/00-097C-0000-0023-1B22-8
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
named entity recognition
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1D76-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Senior COMPANION Expressive Speech Corpus
2014-01-13T10:49:11Z
http://hdl.handle.net/11858/00-097C-0000-0023-1D76-9
Grůber, Martin
2014-01-13T10:49:11Z
The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded natural dialogues between a computer and seniors.
European Commission Sixth Framework Programme
Information Society Technologies Integrated Project IST-34434
http://hdl.handle.net/11858/00-097C-0000-0023-1D76-9
University of West Bohemia
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
speech corpus
expressive
text-to-speech synthesis
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-3B09-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2013PUB: corpus of written Czech newspapers
2014-01-29T12:40:44Z
http://hdl.handle.net/11858/00-097C-0000-0023-3B09-4
Křen, Michal
Hnátková, Milena
Jelínek, Tomáš
Petkevič, Vladimír
Procházka, Pavel
Skoumalová, Hana
2014-01-29T12:40:44Z
Corpus of contemporary Czech newspapers and magazines sized 935 MW. It contains various titles published between 2005–2009. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
LM2011023 – Český národní korpus
http://wiki.korpus.cz/doku.php/en:cnk:syn2013pub
http://hdl.handle.net/11858/00-097C-0000-0023-3B09-4
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
corpus
written language
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-3FBB-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 2 ver. 2
2014-02-06T12:11:46Z
http://hdl.handle.net/11858/00-097C-0000-0023-3FBB-3
Šebesta, Karel
Goláňová, Hana
2014-02-06T12:11:46Z
Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants.
UK, PRVOUK P10
http://hdl.handle.net/11858/00-097C-0000-0023-3FBB-3
Charles University in Prague, ÚČJTK
http://hdl.handle.net/11858/00-097C-0000-0019-89A0-9
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
youth language
classroom
language acquisition corpus
AKCES
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4087-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Linguistic digital repository based on DSpace
2014-02-08T23:10:55Z
http://hdl.handle.net/11858/00-097C-0000-0023-4087-6
Pajas, Petr
Vandas, Karel
Mišutka, Jozef
Kamran, Amir
Jawaid, Bushra
Košarko, Ondřej
Sedlák, Michal
Josífko, Michal
Straňák, Pavel
Hajič, Jan
2014-02-08T23:10:55Z
One of the goals of LINDAT/CLARIN Centre for Language Research Infrastructure is to provide technical background to institutions or researchers who wants to share their tools and data used for research in linguistics or related research fields. The digital repository is built on a highly customised DSpace platform.
LM2010013 - FULLY SUPPORTED BY THE MINISTRY OF EDUCATION, SPORTS AND YOUTH OF THE CZECH REPUBLIC
http://hdl.handle.net/11858/00-097C-0000-0023-4087-6
http://hdl.handle.net/11234/1-1481
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0001-48F2-1
linguistics
digital data
digital repository
language repository
linguistic data
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4336-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Morphological Analyzer v1
2014-02-13T22:01:22Z
http://hdl.handle.net/11858/00-097C-0000-0023-4336-4
Hajič, Jan
2014-02-13T22:01:22Z
One of the very first steps in automatic processing of Czech text is morphological analysis and lemmatization.
http://hdl.handle.net/11858/00-097C-0000-0023-4336-4
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
morphological analysis
lemmatization
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4337-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
EngVallex - English Valency Lexicon
2014-02-13T22:05:17Z
http://hdl.handle.net/11858/00-097C-0000-0023-4337-2
Cinková, Silvie
Fučíková, Eva
Šindlerová, Jana
Hajič, Jan
2014-02-13T22:05:17Z
EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank and Verbnet, two existing English predicate-argument lexicons used, i.a., for the PropBank project. The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank, which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT.
http://hdl.handle.net/11858/00-097C-0000-0023-4337-2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
Annotations
Corpora
Data
Lexicons
Monolingual
Semantics
Valency
lexicalConceptualResource
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4338-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
PDT-Vallex: Czech Valency lexicon linked to treebanks
2014-02-13T22:05:12Z
http://hdl.handle.net/11858/00-097C-0000-0023-4338-F
Urešová, Zdeňka
Štěpánek, Jan
Hajič, Jan
Panevova, Jarmila
Mikulová, Marie
2014-02-13T22:05:12Z
The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool), and also in more human readable form including corpus examples (see the WEBSITE link below). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives.
http://hdl.handle.net/11858/00-097C-0000-0023-4338-F
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
annotation
corpora
data
lexicon
semantics
valency
lexicalConceptualResource
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-43CD-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MorphoDiTa: Morphological Dictionary and Tagger
2014-02-14T13:50:36Z
http://hdl.handle.net/11858/00-097C-0000-0023-43CD-0
Straka, Milan
Straková, Jana
2014-02-14T13:50:36Z
MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological generation, tagging and tokenization and is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, MorphoDiTa achieves state-of-the-art results with a throughput around 10-200K words per second. MorphoDiTa is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
http://hdl.handle.net/11858/00-097C-0000-0023-43CD-0
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0001-48FE-9
tagging
morphological analysis
morphological generation
tokenization
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-43CE-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
NameTag
2014-02-14T13:51:18Z
http://hdl.handle.net/11858/00-097C-0000-0023-43CE-E
Straka, Milan
Straková, Jana
2014-02-14T13:51:18Z
NameTag is an open-source tool for named entity recognition (NER). NameTag identifies proper names in text and classifies them into predefined categories, such as names of persons, locations, organizations, etc. NameTag is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, NameTag achieves state-of-the-art performance (Straková et al. 2013). NameTag is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
http://hdl.handle.net/11858/00-097C-0000-0023-43CE-E
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
named entity recognizer
toolService
Software
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4670-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Vystadial 2013 – Czech data
2014-02-21T10:42:18Z
http://hdl.handle.net/11858/00-097C-0000-0023-4670-6
Korvas, Matěj
Plátek, Ondřej
Dušek, Ondřej
Žilka, Lukáš
Jurčíček, Filip
2014-02-21T10:42:18Z
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts.
The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits.
This is the Czech data part of the dataset.
This research was funded by the Ministry of
Education, Youth and Sports of the Czech Republic under the grant agreement
LK11221.
http://hdl.handle.net/11858/00-097C-0000-0023-4670-6
http://hdl.handle.net/11234/1-1740
Charles University, Faculty of Mathematics and Physics
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
acoustic data
speech corpus
spoken corpus
orthographic transcriptions
telephone speech
voip
dialogue system
corpus
Text
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4671-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Vystadial 2013 – English data
2014-02-21T10:45:40Z
http://hdl.handle.net/11858/00-097C-0000-0023-4671-4
Korvas, Matěj
Plátek, Ondřej
Dušek, Ondřej
Žilka, Lukáš
Jurčíček, Filip
2014-02-21T10:45:40Z
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts.
The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits.
This is the English data part of the dataset.
This research was funded by the Ministry of
Education, Youth and Sports of the Czech Republic under the grant agreement
LK11221.
http://hdl.handle.net/11858/00-097C-0000-0023-4671-4
Charles University, Faculty of Mathematics and Physics
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
acoustic data
speech corpus
spoken corpus
orthographic transcriptions
telephone speech
voip
dialogue system
corpus
Text
olac///hdl_11858_00-097C-0000-0001-4877-A/100