2024-03-29T06:07:40Zhttp://lindat.mff.cuni.cz/repository/oai/requestoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4872-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Arabic Dependency Treebank 1.0
Hajič, Jan
Smrž, Otakar
Zemánek, Petr
Pajas, Petr
Šnaidauf, Jan
Beška, Emanuel
Kracmar, Jakub
Hassanová, Kamila
corpus
Arabic
The PADT project might be summarized as an open-ended activity of the Center for Computational Linguistics, the Institute of Formal and Applied Linguistics, and the Institute of Comparative Linguistics, Charles University in Prague, resting in multi-level annotation of Arabic language resources in the light of the theory of Functional Generative Description (Sgall et al., 1986; Hajičová and Sgall, 2003).
2009-11-02T10:34:20Z
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-4872-3
ara
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/html
text/plain; charset=utf-8
downloadable_files_count: 2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/padt
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-487A-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Lexico-Semantic Annotation of PDT using Czech WordNet
Bejček, Eduard
Hoffmannová, Petra
Holub, Martin
Hučínová, Marie
Pecina, Pavel
Straňák, Pavel
Šidák, Pavel
Hajič, Jan
PDT
Czech WordNet
PDT
This dataset contains annotation of PDT using Czech WordNet ontology: http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
Data is stored in PML format. This is a stand-off annotation and for most use cases it requires PDT 2.0 and the Czech WordNet 1.9 PDT that we have used for annotation.
2011-01-23
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
ces
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4916-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CzEng 0.7
Bojar, Ondřej
Žabokrtský, Zdeněk
Češka, Pavel
Beňa, Peter
Janíček, Miroslav
parallel corpus
CzEng 0.7 is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague. The corpus contains no manual annotation. It is limited only to texts which have been already available in an electronic form and which are not protected by authors' rights in the Czech Republic. The main purpose of the corpus is to support Czech-English and English-Czech machine translation research with the necessary data. CzEng 0.7 consists of a large set of parallel textual documents mainly from the fields of European law, information technology, and fiction, all of them converted into a uniform XML-based file format and provided with automatic sentence alignment.
2009-11-02
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-4916-9
ces
eng
http://hdl.handle.net/11234/1-1458
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/czeng/czeng07/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4908-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
VALLEX 2.5
Lopatková, Markéta
Žabokrtský, Zdeněk
Kettnerová, Václava
valency
Czech
The Valency Lexicon of Czech Verbs, Version 2.5 (VALLEX 2.5), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. VALLEX 2.5 has been developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague.
VALLEX 2.5 provides information on the valency structure (combinatorial potential) of verbs in their particular senses - there are roughly 2,730 lexeme entries containing together around 6,460 lexical units ("senses").
2009-11-02
lexicalConceptualResource
http://hdl.handle.net/11858/00-097C-0000-0001-4908-9
ces
http://hdl.handle.net/11234/1-2307
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/vallex/2.5/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4880-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech WordNet 1.9 PDT
Pala, Karel
Čapek, Tomáš
Zajíčková, Barbora
Bartůšková, Dita
Kulková, Kateřina
Hoffmannová, Petra
Bejček, Eduard
Straňák, Pavel
Hajič, Jan
ontology
wordnet
Czech WordNet
A slightly modified version of the Czech Wordnet. This is the version used to annotate "The Lexico-Semantic Annotation of PDT using Czech WordNet": http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
The Czech WordNet was developed by the Centre of Natural Language Processing at the Faculty of Informatics, Masaryk University, Czech Republic.
The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 23,094 word senses (synsets). 203 of these were created or modified by UFAL during correction of annotations. This version of WordNet was used to annotate word senses in PDT: http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
A more recent version of Czech WordNet is distributed by ELRA: http://catalog.elra.info/product_info.php?products_id=1089
2011-01-24
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
ces
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-487E-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CoNLL 2009 Shared Task Czech Trial Set
Hajič, Jan
Straňák, Pavel
Štěpánek, Jan
conll-st
Czech trial (example) data for CoNLL 2009 Shared Task. The data are generated from PDT 2.0. LDC2009E32B
2009-01-05
corpus
http://www.aclweb.org/anthology/W09-1201
http://hdl.handle.net/11858/00-097C-0000-0001-487E-B
ces
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4909-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
UMC 0.1: Czech-Russian-English Multilingual Corpus
Klyueva, Natalia
Bojar, Ondřej
multi-language corpus
UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of UMC is to extend the set of languages covered by the corpus CzEng mainly for the purposes of machine translation.
All the texts were downloaded from a single source — The Project Syndicate (Copyright: Project Syndicate 1995-2008), which contains a huge collection of high-quality news articles and commentaries. We were given the permission to use the texts for research and non-commercial purposes.
2008-10-02
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-4909-7
ces
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/umc/cer
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B098-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 2.0 (PDT 2.0)
Hajič, Jan
Panevová, Jarmila
Hajičová, Eva
Sgall, Petr
Pajas, Petr
Štěpánek, Jan
Havelka, Jiří
Mikulová, Marie
Žabokrtský, Zdeněk
Ševčíková-Razímová, Magda
Urešová, Zdeňka
corpus
Czech
treebank
PDT
The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (two million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.
PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well.
2006-07-21
corpus
LDC2006T01
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
ces
http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
PDT 2.0 License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2
ACA
application/zip
application/pdf
application/pdf
application/pdf
application/pdf
application/pdf
application/pdf
application/pdf
text/plain; charset=utf-8
downloadable_files_count: 8
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/pdt2.0/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B43E-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 2.0 - sample data
Hajič, Jan
Panevová, Jarmila
Sgall, Petr
Pajas, Petr
Štěpánek, Jan
Havelka, Jiří
Mikulová, Marie
Žabokrtský, Zdeněk
Ševčíková-Razímová, Magda
treebank
dependency
PDT
A small subset of PDT 2.0 made available under a permissive license.
Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.
PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well.
2006-06-21
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-B43E-6
ces
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/pdt2.0/doc/pdt-guide/en/html/ch03.html#a-data-sample
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4914-D2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank of Spoken Language (PDTSL) 0.5
Hajič, Jan
Pajas, Petr
Mareček, David
Mikulová, Marie
Urešová, Zdeňka
Podveský, Petr
corpus
spoken language
The first edition of a speech corpus with a speech reconstruction layer (edited transcript).
The project of speech reconstruction of Czech and English has been started at UFAL together with the PIRE project in 2005, and has gradually grown from ideas to (first) annotation specification, annotation software and actual annotation. It is part of the Prague Dependency Treebank family of annotated corpus resources and tools, to which it adds the spoken language layer(s).
2009-11-02T10:40:55Z
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-4914-D
ces
eng
PDTSL
https://lindat.mff.cuni.cz/repository/xmlui/page/licence-pdtsl
ACA
application/zip
application/zip
text/plain; charset=utf-8
downloadable_files_count: 2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/pdtsl
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-C6D1-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CoNLL 2009 Shared Task - Czech Data
Hajič, Jan
Straňák, Pavel
Štěpánek, Jan
conll-st
treebank
Czech data - both train and test+eval sets, as well as the valency dictionary - for the CoNLL 2009 Shared Task. Documentation is included. The data are generated from PDT 2.0. LDC catalog number: LDC2009E34B
2009-01-19
corpus
LDC2009E34B, LDC2009E35B
http://www.aclweb.org/anthology/W09-1201
http://hdl.handle.net/11858/00-097C-0000-0001-C6D1-9
ces
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F3-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
XSH
Pajas, Petr
XML processing
command-line
XSH is a powerfull command-line tool for querying, processing and editing XML documents. It features a shell-like interface with auto-completion for comfortable interactive work, but can be as well used for off-line (batch) processing of XML data.
2009-11-02T09:51:39Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-48F3-0
Artistic License (Perl) 1.0
http://opensource.org/licenses/Artistic-Perl-1.0
text/plain; charset=utf-8
downloadable_files_count: 0
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://xsh.sourceforge.net
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F7-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
TrEd
Pajas, Petr
annotation
tree
editor
XML
PML
Tree Editor
TrEd is a fully customizable and programmable graphical editor and viewer for tree-like structures. Among other projects, it was used as the main annotation tool for syntactical and tectogrammatical annotations in The Prague Dependency Treebank, as well as for decision-tree based morphological annotation of The Prague Arabic Dependency Treebank.
2009-10-13T13:11:11Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-48F7-8
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
application/octet-stream
application/octet-stream
application/octet-stream
text/plain; charset=utf-8
downloadable_files_count: 3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/tred/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F8-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MEd
Pajas, Petr
Mareček, David
annotation tool
MEd is an annotation tool in which linearly-structured annotations of text or audio data can be created and edited. The tool supports multiple stacked layers of annotations that can be interconnected by links. MEd can also be used for other purposes, such as word-to-word alignment of parallel corpora.
2009-11-02T09:33:08Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-48F8-6
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
image/png
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F9-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
HMM tagger
Krbec, Pavel
tagger
morphology
The HMM-based Tagger is a software for morphological disambiguation (tagging) of Czech texts. The algorithm is statistical, based on the Hidden Markov Models.
2009-11-02T09:25:18Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-48F9-4
ces
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Tagging/MM_tagger/index.html
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FA-22017-04-10T13:34:17Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F2-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Dspace modifications for use of EPIC handles
Pajas, Petr
DSpace
handle
EPIC
Modifications to DSpace made by Petr Pajas in order to support pidconsortium.eu PID handle system instead of the default handle.com system used by DSpace.
2010-01-13T15:06:26Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-48F2-1
http://hdl.handle.net/11858/00-097C-0000-0023-4087-6
BSD 2-Clause "Simplified" or "FreeBSD" license
http://opensource.org/licenses/BSD-2-Clause
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://svn.ms.mff.cuni.cz/redmine/projects/dspace-modifications
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FB-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
STYX
Kučera, Ondřej
education
morphology
syntax
The STYX system is an electronic exercise book for practising Czech morphology and syntax consisting of more than 11, 000 sentences.
2009-11-02T09:42:50Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F
ces
GNU General Public Licence, version 3
http://opensource.org/licenses/GPL-3.0
PUB
application/pdf
application/pdf
application/octet-stream
application/x-bzip2
application/x-bzip2
application/zip
application/x-bzip2
text/plain; charset=utf-8
downloadable_files_count: 7
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/styx/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FC-D2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MMI_clustering
Klusáček, David
clustering
MMI_clustering is a set of command line tools implementing Mercer's maximum mutual information-based clustering technique.
2009-11-02T09:34:32Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-48FC-D
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
application/octet-stream
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/tools/mmic
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FD-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Victor
Marek, Michal
html cleaning
Victor is a web page cleaning tool. It is aimed at removing menu, ads, footers, headers, etc. from HTML web pages, so that only main web page content remains. Victor is based on a conditional random fields algorithm.
2009-11-02T09:48:39Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-48FD-B
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
application/x-bzip2
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/victor/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FE-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Morče
Raab, Jan
tagger
morphology
The MORČE tagger is a software for morphological disambiguation (part-of-speech tagging) of Czech text. The algorithm is statistical, based on an idea of so-called "Averaged Perceptron" published by Michael Collins in 2002.
2009-11-02T09:36:29Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-48FE-9
ces
http://hdl.handle.net/11858/00-097C-0000-0023-43CD-0
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
application/x-gzip
application/x-gzip
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FF-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Victoria
Spousta, Miroslav
web page processing
Victoria is an on-line HTML web page annotation tool suitable for selecting texts on the web pages. It can be used to mark important/interesting parts of web pages for further processing.
2009-11-02T09:50:15Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-48FF-7
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
application/x-bzip2
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/victor/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4900-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MORFO
Kolovratník, David
morphological analysis
The MORFO system for morphological analysis of Czech consists of four units: the analyzer, the generator, the dictionary editor, and the library with the shared source code for handling dictionary objects.
2009-11-02T09:37:56Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-4900-A
ces
PDT 2.0 License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2
ACA
application/x-gzip
application/pdf
application/pdf
text/plain; charset=utf-8
downloadable_files_count: 3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/morfo
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4901-82017-04-10T13:32:37Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4902-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
LAW
Hana, Jiří
language annotation
Lexical Annotation Workbench (LAW) is an integrated environment for morphological annotation. It supports simple morphological annotation (assigning a lemma and tag to a word), integration and comparison of different annotations of the same text, searching for particular word, tag etc.
2009-11-02T09:27:18Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-4902-6
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
text/html
application/pdf
application/zip
text/plain; charset=utf-8
downloadable_files_count: 3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://purl.org/net/jh/law
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4904-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Feature-based tagger
Hajič, Jan
morphology
tagger
The Feature-based (exponential model) Tagger is a fast implementation of the Czech tagger developed at UFAL and described in the PDT 1.0 documentation (Czech Language Tagging page). In order to get the best possible results, the tagger requires preprocessing by a Czech morphological module with a very high coverage. This module covers a superset of the Czech "FM" morphology. Both the morphological module and the tagger are supplied as binary executables, together with all necessary precompiled Czech data. Input must be in the ISO Latin 2 (iso-8859-2) code and follow the csts.dtd definition, and output is produced in the same way (ISO Latin 2 code, csts.dtd). (As is the case with many of the tools provided with PDT 1.0, both executables also accept - and then produce - a "simplified SGML", which is not a real, valid SGML, but simply contains at least the tags for words, punctuation, and sentence breaks, one item per line.)
2009-11-02T09:22:59Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-4904-2
PDT 2.0 License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2
ACA
application/x-gzip
application/x-gzip
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/pdt2.0/doc/tools/machine-annotation/index.html#a-ma-tagging
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4905-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Netgraph
Mírovský, Jiří
Ondruška, Roman
search
treebank
Netgraph is a graphically oriented client-server application for searching in linguistically annotated treebanks. The query language of Netgraph is simple and intuitive, yet powerful enough for treebanks with complex annotations schemes. The primary purpose of Netgraph is searching in the Prague Dependency Treebank 2.0, nevertheless it can be used for other treebanks as well.
2009-11-02T09:41:19Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-4905-F
GNU General Public Licence, version 3
http://opensource.org/licenses/GPL-3.0
PUB
application/octet-stream
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://quest.ms.mff.cuni.cz/netgraph/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F4-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ElixirFM
Smrž, Otakar
Bielický, Viktor
Buckwalter, Tim
Arabic morphology
ElixirFM
ElixirFM is a high-level implementation of Functional Arabic
Morphology documented at http://elixir-fm.wiki.sourceforge.net/. The
core of ElixirFM is written in Haskell, while interfaces in Perl
support lexicon editing and other interactions.
2009-11-02T09:19:05Z
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-48F4-E
ara
http://opensource.org/licenses/GPL-3.0
text/plain; charset=utf-8
downloadable_files_count: 0
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://github.com/otakar-smrz/elixir-fm
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B08B-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Multiword expressions in the Prague Dependency Treebank 2.0
Bejček, Eduard
Klyueva, Natalia
Straňák, Pavel
Šidák, Pavel
Šťastná, Eva
Vimmrová, Pavlína
Hajič, Jan
MWE
multiword expressions
idiom
phraseme
named entity
This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as the original PDT 2.0 data. It is to be used together with the PDT 2.0.
2010
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-B08B-3
ces
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CC1E-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hindi Web Texts
Bojar, Ondřej
Straňák, Pavel
Zeman, Daniel
news
web texts
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens
2011-11-23
corpus
UMC004
http://hdl.handle.net/11858/00-097C-0000-0001-CC1E-B
hin
info:eu-repo/grantAgreement/EC/FP7/231720
http://hdl.handle.net/11858/00-097C-0000-0023-6260-A
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-BD17-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
English-Hindi Parallel Corpus
Bojar, Ondřej
Straňák, Pavel
Zeman, Daniel
Jain, Gaurav
Damani, Om Prakesh
English-Hindi parallel corpus
parallel corpus
English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A part of the data is our patch for the Emille parallel corpus.
2010-05-11
corpus
UMC002
http://hdl.handle.net/11858/00-097C-0000-0001-BD17-1
hin
eng
info:eu-repo/grantAgreement/EC/FP7/231720
http://hdl.handle.net/11858/00-097C-0000-0023-625F-0
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCD-02014-05-13T09:21:27Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCA1-02022-11-25T16:00:44Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Air Traffic Control Communication
Šmídl, Luboš
speech corpus
acoustic model
Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono.
2011-12-15
corpus
ZCU_CZ_ATC
http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0
eng
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
PUB
application/x-rar-compressed
text/plain; charset=utf-8
downloadable_files_count: 1
University of West Bohemia, Department of Cybernetics
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCF-C2022-04-26T13:51:47Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
czes
(:unav) Unknown author
Czech corpus large
First version of the very large Czech corpus Czes created with a new set of tools. It comprises 465,102,710 tokens.
2011-12-15
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-CCCF-C
ces
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Masaryk University, NLP Centre
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCE-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Integrated lexicographic platform for Russian
Rambousek, Adam
lexicography platform
russian
web dictionary
Integrated lexicographic platform for Russian.
2011-12-15
toolService
http://hdl.handle.net/11858/00-097C-0000-0001-CCCE-E
rus
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
application/octet-stream
application/octet-stream
text/xml
text/plain
text/plain; charset=utf-8
downloadable_files_count: 4
Masaryk University, NLP Centre
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCD2-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
IDENTICv1.0-raw
Larasati, Septina Dian
Indonesian-English parallel corpus
parallel corpus
Raw Text
2011-12-16
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-CCD2-2
ind
eng
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCDB-02022-04-26T13:52:15Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
skTenTen
(:unav) Unknown author
Slovak large corpus
Slovak large web corpus skTenTen, comprising 876,003,720 tokens.
2011-12-16
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-CCDB-0
slk
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
application/x-xz
text/plain; charset=utf-8
downloadable_files_count: 1
Masaryk University, NLP Centre
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCDF-82022-04-26T13:50:51Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
enTenTen
(:unav) Unknown author
English large corpus
Very large English web corpus enTenTEn, comprising 3,268,798,627 tokens.
2011-12-16
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-CCDF-8
eng
NLP Centre Web Corpus License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC
ACA
text/plain; charset=utf-8
application/x-gzip
downloadable_files_count: 1
Masaryk University, NLP Centre
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-D709-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
BushBank
Grác, Marek
interannotator agreement
corpus
chunks
phrases
clauses
Czech corpus annotated for NP and clause chunks by 3-11 annotators (with average inter-annotator agreement at 88%). It consists of 10,000 sentences.
2011-12-16
corpus
http://hdl.handle.net/11858/00-097C-0000-0001-D709-F
ces
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Masaryk University, NLP Centre
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BCCF-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Extended Textual Coreference and Bridging Relations in PDT 2.0
Nedoluzhko, Anna
Mírovský, Jiří
bridging anaphora
textual coreference
PDT
Annotation of extended textual coreference and bridging relations in the Prague Dependency Treebank 2.0
2012-02-20
corpus
http://hdl.handle.net/11858/00-097C-0000-0005-BCCF-3
ces
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
text/html
application/zip
text/plain; charset=utf-8
downloadable_files_count: 2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BF85-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
IDENTICv1.0
Larasati, Septina Dian
Indonesian-English parallel corpus
parallel corpus
IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide researchers a proper Indonesian-English textual data set and also to promote research in this language pair. The corpus contains texts coming from different sources with different genres.
2012-03-13
corpus
http://hdl.handle.net/11858/00-097C-0000-0005-BF85-F
ind
eng
info:eu-repo/grantAgreement/EC/FP7/238405
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BF95-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
VPS-30-En
Cinková, Silvie
Holub, Martin
Rambousek, Adam
Smejkalová, Lenka
corpus pattern analysis
clustering
lexical semantics
verbs
VPS-30-En is a small lexical resource that contains the following 30 English verbs: access, ally, arrive, breathe,
claim, cool, crush, cry, deny, enlarge, enlist, forge, furnish, hail, halt, part, plough, plug, pour, say, smash, smell, steer, submit, swell,
tell, throw, trouble, wake and yield. We have created and have been using VPS-30-En to explore the interannotator agreement potential
of the Corpus Pattern Analysis. VPS-30-En is a small snapshot of the Pattern Dictionary of English Verbs (Hanks and Pustejovsky,
2005), which we revised (both the entries and the annotated concordances) and enhanced with additional annotations.
2012-03-19
lexicalConceptualResource
http://hdl.handle.net/11858/00-097C-0000-0005-BF95-B
eng
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/spr/pdev30verbs.html
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-CF9C-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Parliament Meetings
Pražák, Aleš
Šmídl, Luboš
speech corpus
acoustic model
speaker identification
speaker verification
The corpus consists of recordings from the Chamber of Deputies of the Parliament of the Czech Republic. It currently consists of 88 hours of speech data, which corresponds roughly to 0.5 million tokens. The annotation process is semi-automatic, as we are able to perform the speech recognition on the data with high accuracy (over 90%) and consequently align the resulting automatic transcripts with the speech. The annotator’s task is then to check the transcripts, correct errors, add proper punctuation and label speech sections with information about the speaker. The resulting corpus is therefore suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 18 sound files (WAV PCM, 16-bit, 44.1 kHz, mono) and corresponding transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
The date of airing of a particular recording is encoded in the filename in the form SOUND_YYMMDD_*. Note that the recordings are usually aired in the early morning on the day following the actual Parliament session. If the recording is too long to fit in the broadcasting scheme, it is divided into several parts and aired on the consecutive days.
2012-03-28
corpus
ZCU_CZ_Parliament
http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4
ces
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
application/pdf
text/plain; charset=utf-8
downloadable_files_count: 37
University of West Bohemia, Department of Cybernetics
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADA-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
WMT 2011 Testing Set
Galuščáková, Petra
Bojar, Ondřej
WMT
test data
Slovak
Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is described in [2].
References:
[1] http://www.statmt.org/wmt11/evaluation-task.html
[2] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press.
2012-05-15
corpus
http://hdl.handle.net/11858/00-097C-0000-0006-AADA-9
slk
ces
eng
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADB-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Manually Classified Errors in Cs->Sk Translation
Galuščáková, Petra
Bojar, Ondřej
machine translation
errors classification
CS-SK translation
Manual classification of errors of Czech-Slovak translation according to the classification introduced by Vilar et al. [1]. First 50 sentences from WMT 2010 test set were translated by 5 MT systems (Česílko, Česílko2, Google Translate and two Moses setups) and MT errors were manually marked and classified. Classification was applied in MT systems comparison [3]. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://matrix.statmt.org/test_sets/list
[3] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011
2012-05-15
corpus
http://hdl.handle.net/11858/00-097C-0000-0006-AADB-7
slk
ces
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
text/plain
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADC-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Manually Classified Errors in En->Sk Translation
Galuščáková, Petra
Bojar, Ondřej
machine translation
errors classification
EN-SK translation
Manual classification of errors of English-Slovak translation according to the classification introduced by Vilar et al. [1]. 50 sentences randomly selected from WMT 2011 test set [2] were translated by 3 MT systems described in [3] and MT errors were manually marked and classified. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://www.statmt.org/wmt11/evaluation-task.html
[3] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press.
2012-05-15
corpus
http://hdl.handle.net/11858/00-097C-0000-0006-AADC-5
slk
eng
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
text/plain
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADD-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Manually Ranked Translation Outputs
Bojar, Ondřej
Galuščáková, Petra
machine translation
evaluation
manual ranking
Manually ranked outputs of Czech-Slovak translations. Three annotators manually ranked outputs of five MT systems (Česílko, Česílko2, Google Translate and two Moses setups) on three data sets (100 sentences randomly selected from books, 100 sentences randomly selected from Acquis corpus and 50 first sentences from WMT 2010 test set). Ranking was applied in MT systems comparison in [1].
References:
[1] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011
2012-05-15
corpus
http://hdl.handle.net/11858/00-097C-0000-0006-AADD-3
slk
ces
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADF-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech-Slovak Parallel Corpus
Galuščáková, Petra
Garabík, Radovan
Bojar, Ondřej
parallel corpus
Czech-Slovak corpus
Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation.
References:
[1] http://langtech.jrc.it/JRC-Acquis.html/
[2] http://www.statmt.org/europarl/
[3] http://apertium.eu/data
[4] http://opus.lingfil.uu.se/
[5] http://ec.europa.eu/
2012-05-15
corpus
http://hdl.handle.net/11858/00-097C-0000-0006-AADF-0
slk
ces
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/x-gzip
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AAE0-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
English-Slovak Parallel Corpus
Galuščáková, Petra
Garabík, Radovan
Bojar, Ondřej
parallel corpus
English-Slovak corpus
English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation.
References:
[1] http://langtech.jrc.it/JRC-Acquis.html/
[2] http://www.statmt.org/europarl/
[3] http://apertium.eu/data
[4] http://opus.lingfil.uu.se/
[5] http://ec.europa.eu/
2012-05-15
corpus
http://hdl.handle.net/11858/00-097C-0000-0006-AAE0-A
slk
eng
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/x-gzip
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AAFE-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Česílko
Hajič, Jan
Kuboň, Vladislav
Homola, Petr
machine translation
Czech-Slovak translation
Česílko is a tool enabling the fast and efficient translation from one source language into many target languages, which are mutually related.
2012-05-22
toolService
http://hdl.handle.net/11858/00-097C-0000-0006-AAFE-A
ces
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
text/plain; charset=utf-8
application/x-gzip
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://quest.ms.mff.cuni.cz/cesilko/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-B847-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CWC2011
Spoustová, Johanka
Spousta, Miroslav
corpus
Czech
web
Web corpus of Czech, created in 2011. Contains newspapers+magazines, discussions, blogs. See http://www.lrec-conf.org/proceedings/lrec2012/summaries/120.html for details.
2012-06-21
corpus
http://hdl.handle.net/11858/00-097C-0000-0006-B847-6
ces
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
application/x-bzip2
application/x-bzip2
application/x-bzip2
application/x-bzip2
application/x-bzip2
application/x-bzip2
text/plain; charset=utf-8
downloadable_files_count: 6
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-DB11-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 2.5
Bejček, Eduard
Hajič, Jan
Panevová, Jarmila
Mírovský, Jiří
Spoustová, Johanka
Štěpánek, Jan
Straňák, Pavel
Šidák, Pavel
Vimmrová, Pavlína
Šťastná, Eva
Ševčíková, Magda
Smejkalová, Lenka
Homola, Petr
Popelka, Jan
Lopatková, Markéta
Hrabalová, Lucie
Klyueva, Natalia
Žabokrtský, Zdeněk
treebank
multiword expressions
clauses
tectogrammatics
dependency
PDT
The Prague Dependency Treebank 2.5 annotates the same texts as the PDT 2.0. The annotation on the original four layers was fixed or improved in various aspects (see Documentation). Moreover, new information was added to the data:
Annotation of multiword expressions
Pair/group meaning
Clause segmentation
2011-12-06
corpus
http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
ces
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/pdt2.5
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0007-70FD-E2022-03-14T14:21:37Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
DZ Interset
Zeman, Daniel
morphology
NLP
Perl
DZ Interset is a means of converting among various tag sets in natural language processing. The core idea is similar to interlingua-based machine translation. DZ Interset defines a set of features that are encoded by the various tag sets. The set of features should be as universal as possible. It does not need to encode everything that is encoded by any tag set but it should encode all information that people may want to access and/or port from one tag set to another.
New tag sets are attached by writing a driver for them. Once the driver is ready, you can easily convert tags between the new set and any other set for which you also have a driver. This reusability is an obvious advantage over writing a targeted conversion procedure each time you need to convert between a particular pair of tag sets.
2006-06
toolService
http://hdl.handle.net/11858/00-097C-0000-0007-70FD-E
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
https://wiki.ufal.ms.mff.cuni.cz/user:zeman:interset
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0008-D259-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Additional German-Czech reference translations of the WMT'11 test set
Bojar, Ondřej
Zeman, Daniel
Dušek, Ondřej
Břečková, Jana
Farkačová, Hana
Grošpic, Pavel
Kačenová, Kristýna
Knechtová, Eva
Koubová, Anna
Lukavská, Jana
Nováková, Petra
Petrdlíková, Jana
reference translation
German-Czech
parallel corpus
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved.
2012-11-13
corpus
http://hdl.handle.net/11858/00-097C-0000-0008-D259-7
deu
ces
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-60D6-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
W2C – Web to Corpus – tool
Majliš, Martin
web data
wikipedia
corpus creation
A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc.
A set of 120 corpora collected using this tool is available at https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0022-6133-9
2011-12-20
toolService
http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
http://hdl.handle.net/11858/00-097C-0000-0022-6133-9
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
application/x-gzip
application/pdf
text/plain; charset=utf-8
downloadable_files_count: 2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0008-E130-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Discourse Treebank 1.0
Poláková, Lucie
Jínová, Pavlína
Zikánová, Šárka
Hajičová, Eva
Mírovský, Jiří
Nedoluzhko, Anna
Rysová, Magdaléna
Pavlíková, Veronika
Zdeňková, Jana
Pergler, Jiří
Ocelák, Radek
discourse
treebank
annotation
Annotation of discourse relations is a project related to the Prague Dependency Treebank 2.5. It represents a new manually annotated layer of language description, above the existing layers of the PDT, and it portrays linguistic phenomena from the perspective of discourse structure and coherence.
2012-11-14
corpus
PDiT 1.0
http://hdl.handle.net/11858/00-097C-0000-0008-E130-A
ces
http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
text/plain; charset=utf-8
application/zip
text/html
application/pdf
downloadable_files_count: 3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/discourse/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000C-2112-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 3
Šebesta, Karel
Bedřichová, Zuzanna
Šormová, Kateřina
Štindlová, Barbora
Hrdlička, Milan
Hrdličková, Tereza
Hana, Jiří
Rosen, Alexandr
Petkevič, Vladimír
Jelínek, Tomáš
Škodová, Svatava
Poláčková, Marie
Janeš, Petr
Lundáková, Kateřina
Skoumalová, Hana
Šťastný, Klement
Sládek, Šimon
Pierscieniak, Piotr
Czech as a foreign language
Czech language acquisition corpora
non-native speakers
AKCES
second language aquisition
Corpus AKCES 3 includes texts written in czech by non-native speakers (AKCES/CLAC - Czech Language Acquisition Corpora)
2012-12-12
corpus
http://hdl.handle.net/11858/00-097C-0000-000C-2112-B
ces
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
application/zip
application/pdf
application/pdf
application/pdf
text/plain; charset=utf-8
downloadable_files_count: 4
Charles University in Prague, ÚČJTK
http://utkl.ff.cuni.cz/learncorp/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67C-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Korektor
Richter, Michal
grammar checker
spellchecker
Statistical spell- and (occasional) grammar-checker. There are three versions: a unix command line utility and an OS X SpellServer with a System Service, that integrates with native OS X GUI applications, and a web service run by Lindat-Clarin, that can be used either through a web form in a browser, or by web applications using API.
2013-02-02
toolService
http://hdl.handle.net/11858/00-097C-0000-000D-F67C-5
ces
http://hdl.handle.net/11234/1-1469
BSD 2-Clause "Simplified" or "FreeBSD" license
http://opensource.org/licenses/BSD-2-Clause
PUB
application/zip
application/zip
application/x-bzip2
text/plain; charset=utf-8
downloadable_files_count: 3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
https://redmine.ms.mff.cuni.cz/projects/korektor
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000C-2293-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 4
Šebesta, Karel
Bedřichová, Zuzanna
Štindlová, Barbora
Hrdlička, Milan
Hrdličková, Tereza
Hana, Jiří
Rosen, Alexandr
Petkevič, Vladimír
Jelínek, Tomáš
Škodová, Svatava
Janeš, Petr
Lundáková, Kateřina
Skoumalová, Hana
Šťastný, Klement
Sládek, Šimon
language of children
Czech language acquisition
adolescents
AKCES
Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora)
2012-12-12
corpus
http://hdl.handle.net/11858/00-097C-0000-000C-2293-0
ces
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
application/zip
application/pdf
application/pdf
application/pdf
text/plain; charset=utf-8
downloadable_files_count: 4
Charles University
http://utkl.ff.cuni.cz/learncorp/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC91-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech translation of the EBUContentGenre thesaurus
Ircing, Pavel
thesaurus
metadata annotation
topic detection
The EBUContentGenre is a thesaurus containing the hierarchical description of various genres utilized in the TV broadcasting industry. This thesaurus is a part of a complex metadata specification called EBUCore intended for multifaceted description of audiovisual content. EBUCore (http://tech.ebu.ch/docs/tech/tech3293v1_3.pdf) is a set of descriptive and technical metadata based on the Dublin Core and adapted to media. EBUCore is the flagship metadata specification of European Broadcasting Union, the largest professional association of broadcasters around the world. It is developed and maintained by EBU's Technical Department (http://tech.ebu.ch). The translated thesaurus can be used for effective cataloguing of (mostly TV) audiovisual content and consequent development of systems for automatic cataloguing (topic/genre detection).
2013-01-01
lexicalConceptualResource
ZCU_CZ_ ebu_ContentGenreCS_CZ
http://hdl.handle.net/11858/00-097C-0000-000D-EC91-2
ces
eng
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/rdf+xml; charset=utf-8
text/plain; charset=utf-8
downloadable_files_count: 1
University of West Bohemia, Department of Cybernetics
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC92-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ATCC: Pronunciation lexicon and n-gram counts for ASR module
Šmídl, Luboš
pronunciation lexicon
n-gram counts
language model
The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0).
2013-01-01
lexicalConceptualResource
ZCU_CZ_ ATCC-LM4ASR
http://hdl.handle.net/11858/00-097C-0000-000D-EC92-F
eng
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
PUB
text/plain
application/octet-stream
application/octet-stream
application/octet-stream
application/octet-stream
application/octet-stream
application/octet-stream
text/plain; charset=utf-8
downloadable_files_count: 7
University of West Bohemia, Department of Cybernetics
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC98-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
OVM – Otázky Václava Moravce
Šmídl, Luboš
Pražák, Aleš
speech corpus
acoustic model
speaker identification
speaker verification
The corpus consists of transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. It contains 35 hours of speech and corresponding word-by-word transcriptions, including the transcription of some non-speech events. Speakers’ names are also assigned to corresponding segments. The resulting corpus is suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 16 sound files (WAV PCM, 16-bit, 48 kHz, mono) and transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
2013-01-04
corpus
ZCU_CZ_OVM
http://hdl.handle.net/11858/00-097C-0000-000D-EC98-3
ces
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
PUB
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/xml
audio/x-wav
text/plain; charset=utf-8
downloadable_files_count: 32
University of West Bohemia, Department of Cybernetics
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F696-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
jusText
Pomikálek, Jan
boilerplate
web documents
text cleaning
boilerplate removal
text corpora
jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/justext/). It is successfully used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The boilerplate removal algorithm is able to remove most of non-grammatical sentences from a web page like navigation, advertisements, tables, short notes and so on. It has been shown it overperforms or at least keeps up with it's competitors (according to comparison with participants of Cleaneval competition in author's Ph.D. thesis). The precise removal of unwanted content and scalability of the algorithm has been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- over 20 TB of HTML pages were processed resulting in corpora of 70 billions tokens altogether.
2011
toolService
http://hdl.handle.net/11858/00-097C-0000-000D-F696-9
eng
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Masaryk University, NLP Centre
http://code.google.com/p/justext/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67A-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Chared
Pomikálek, Jan
character encoding
character encoding detection
charset
unicode
Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9.
2011
toolService
http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9
eng
BSD 3-Clause "New" or "Revised" license
http://opensource.org/licenses/BSD-3-Clause
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Masaryk University, NLP Centre
http://code.google.com/p/chared/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67B-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
onion
Pomikálek, Jan
deduplication
corpus
text deduplication
n-gram deduplication
n-gram model
onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether.
2011
toolService
http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7
eng
BSD 3-Clause "New" or "Revised" license
http://opensource.org/licenses/BSD-3-Clause
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Masaryk University, NLP Centre
http://code.google.com/p/onion/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0015-8DAF-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Czech-English Dependency Treebank 2.0
Hajič, Jan
Hajičová, Eva
Panevová, Jarmila
Sgall, Petr
Cinková, Silvie
Fučíková, Eva
Mikulová, Marie
Pajas, Petr
Popelka, Jan
Semecký, Jiří
Šindlerová, Jana
Štěpánek, Jan
Toman, Josef
Urešová, Zdeňka
Žabokrtský, Zdeněk
parallel treebank
PCEDT
parallel corpus
Wall Street Journal
WSJ
Penn Treebank
dependency annotation
PDT
Texts
The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part.
Data
The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 sentence-aligned. An additional automatic alignment on the node level (different for each annotation layer) is part of this release, too. The original Penn Treebank-like file structure (25 sections, each containing up to one hundred files) has been preserved. Only those PTB documents which have both POS and structural annotation (total of 2312 documents) have been translated to Czech and made part of this release.
Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are:
dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values)
semantic labeling of content words and types of coordinating structures
argument structure, including an argument structure ("valency") lexicon for both languages
ellipsis and anaphora resolution.
This annotation style is called tectogrammatical annotation and it constitutes the tectogrammatical layer in the corpus. For more details see below and documentation.
Annotation of the Czech part
Sentences of the Czech translation were automatically morphologically annotated and parsed into surface-syntax dependency trees in the PDT 2.0 annotation style. This annotation style is sometimes called analytical annotation; it constitutes the analytical layer of the corpus. The manual tectogrammatical (deep-syntax) annotation was built as a separate layer above the automatic analytical (surface-syntax) parse. A sample of 2,000 sentences was manually annotated on the analytical layer.
Annotation of the English part
The resulting manual tectogrammatical annotation was built above an automatic transformation of the original phrase-structure annotation of the Penn Treebank into surface dependency (analytical) representations, using the following additional linguistic information from other sources:
PropBank (LDC2004T14)
VerbNet
NomBank (LDC2008T23)
flat noun phrase structures (by courtesy of D. Vadas and J.R. Curran)
For each sentence, the original Penn Treebank phrase structure trees are preserved in this corpus together with their links to the analytical and tectogrammatical annotation.
2012
corpus
http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
ces
eng
info:eu-repo/grantAgreement/EC/FP7/231720
info:eu-repo/grantAgreement/EC/FP7/247762
http://hdl.handle.net/11234/1-1664
CC-BY-NC-SA + LDC99T42
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pcedt2
RES
application/zip
application/zip
application/zip
text/plain; charset=utf-8
downloadable_files_count: 3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/pcedt2.0
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000E-011B-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Corpus of contemporary blogs
Grác, Marek
corpus
blogs
annotation
annotators
sentences
machine learning
In NLP Centre, dividing text into sentences is currently done with
a tool which uses rule-based system. In order to make enough training
data for machine learning, annotators manually split the corpus of contemporary text
CBB.blog (1 million tokens) into sentences.
Each file contains one hundredth of the whole corpus and all data were
processed in parallel by two annotators.
The corpus was created from ten contemporary blogs:
hintzu.otaku.cz
modnipeklo.cz
bloc.cz
aleneprokopova.blogspot.com
blog.aktualne.cz
fuchsova.blog.onaidnes.cz
havlik.blog.idnes.cz
blog.aktualne.centrum.cz
klusak.blogspot.cz
myego.cz/welldone
2011
corpus
http://hdl.handle.net/11858/00-097C-0000-000E-011B-8
ces
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Masaryk University, NLP Centre
http://nlp.fi.muni.cz/projekty/cocb/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B2E-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
sholva-0.6
Grác, Marek
Čapek, Tomáš
semantic net
semantic tagging
Semantic net `sholva' contains more than 150 000 records for which there was sufficient agreement among annotators. Indvidual words are labeled in the following categories:
person, person / individual, event and substance.
2011
lexicalConceptualResource
http://hdl.handle.net/11858/00-097C-0000-0023-1B2E-0
ces
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Masaryk University, NLP Centre
https://nlp.fi.muni.cz/projekty/sholva/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0015-A780-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MorfFlex CZ
Hajič, Jan
Hlaváčová, Jaroslava
morphological dictionary
morphology
Czech
Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
2013
lexicalConceptualResource
http://hdl.handle.net/11858/00-097C-0000-0015-A780-9
ces
http://hdl.handle.net/11234/1-1673
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/x-xz
application/x-xz
application/x-gzip
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 4
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/morfflex
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0019-89A0-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 2
Šebesta, Karel
Goláňová, Hana
youth language
classroom
language acquisition corpus
AKCES
Corpus AKCES 2 consists of trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants.
2013-05-11
corpus
http://hdl.handle.net/11858/00-097C-0000-0019-89A0-9
ces
http://hdl.handle.net/11858/00-097C-0000-0023-3FBB-3
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University in Prague, ÚČJTK
http://akces.ff.cuni.cz
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-6133-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
W2C – Web to Corpus – Corpora
Majliš, Martin
multilingual corpora
A set of corpora for 120 languages automatically collected from wikipedia and the web.
Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
2011-12-20
corpus
http://hdl.handle.net/11858/00-097C-0000-0022-6133-9
afr
als
amh
ara
arg
arz
ast
aze
bel
ben
bos
bpy
bre
bug
bul
cat
ceb
ces
chv
cos
cym
dan
deu
diq
ell
eng
epo
est
eus
fao
fas
fin
fra
fry
gan
gla
gle
glg
glk
guj
hat
hbs
heb
hif
hin
hrv
hsb
hun
hye
ido
ina
ind
isl
ita
jav
jpn
kan
kat
kaz
kor
kur
lat
lav
lim
lit
lmo
ltz
mal
mar
mkd
mlg
mon
mri
msa
mya
nap
nds
nep
new
nld
nno
nor
oci
oss
pam
pms
pol
por
que
ron
rus
sah
scn
sco
slk
slv
spa
sqi
srp
sun
swa
swe
tam
tat
tel
tgk
tgl
tha
tur
ukr
urd
uzb
vec
vie
vol
war
wln
yid
yor
zho
http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 122
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-AAF5-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MTMonkey
Tamchyna, Aleš
Dušek, Ondřej
Rosa, Rudolf
machine translation
distributed computing
web service
infrastructure
MTMonkey is a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre- and post processing.
It consists of an application server and remote workers which handle text processing and communicate translation requests to MT systems. The communication between the application server and the workers is based on the XML-RPC protocol.
2013-08-13
toolService
http://hdl.handle.net/11858/00-097C-0000-0022-AAF5-B
info:eu-repo/grantAgreement/EC/FP7/257528
Apache License 2.0
http://opensource.org/licenses/Apache-2.0
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
https://github.com/ufal/mtmonkey
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C73C-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Named Entity Corpus 1.0
Ševčíková, Magda
Žabokrtský, Zdeněk
Straková, Jana
named entity recognition
named entitity corpus
Czech
NER
corpus
The presented Czech Named Entity Corpus 1.0 is the first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification.
2007
corpus
http://hdl.handle.net/11858/00-097C-0000-0022-C73C-7
ces
http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C7F6-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
PML Tree Query
Pajas, Petr
Štěpánek, Jan
Sedlák, Michal
treebank
query
search
System for querying annotated treebanks in PML format. The querying uses it own query language with graphical representation. It has two different implementations (SQL and Perl) and several clients (TrEd, browser-based, command line interface).
2009-01-01
toolService
http://hdl.handle.net/11858/00-097C-0000-0022-C7F6-3
eng
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/pmltq
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C7FD-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
PMLTQ::Web
Sedlák, Michal
Perl
PML-TQ
PML
Simple web build on the top of the PML Tree Query service.
2013-09-10
toolService
http://hdl.handle.net/11858/00-097C-0000-0022-C7FD-6
Artistic License (Perl) 1.0
http://opensource.org/licenses/Artistic-Perl-1.0
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
https://redmine.ms.mff.cuni.cz/projects/pmltq-web
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-10B2-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Many Czech References for 50 Sentences Selected from WMT11 Data
Bojar, Ondřej
Macháček, Matouš
Tamchyna, Aleš
Zeman, Daniel
machine translation
automatic machine translation evaluation
reference translation
This dataset contains the whole set of very many Czech translations for 50 English source sentences coming from WMT11 test set (http://www.statmt.org/wmt11).
In total, there are 15431447 Czech sentences, i.e. 300k reference translations per source English sentence on average, but the exact number greatly varies across sentences.
You can find more details in included README file.
If you use this dataset, please cite the following paper which describes the technique used to construct the Czech translations:
Bojar Ondřej, Macháček Matouš, Tamchyna Aleš, Zeman Daniel:
Scratching the Surface of Possible Translations.
Lecture Notes in Computer Science, Vol. 8082, Text, Speech and Dialogue: 16th
International Conference, TSD 2013. Proceedings, Copyright © Springer Verlag,
Berlin / Heidelberg, ISBN 978-3-642-40584-6, ISSN 0302-9743, pp. 465-474, 2013, DOI: 10.1007/978-3-642-40585-3_59
2013-09-01
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-10B2-F
ces
info:eu-repo/grantAgreement/EC/FP7/288487
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
application/zip
application/octet-stream
text/plain; charset=utf-8
downloadable_files_count: 2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-D9BF-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Khresmoi Query Translation Test Data 1.0
Pecina, Pavel
Dušek, Ondřej
Hajič, Jan
Urešová, Zdeňka
corpus
test data
medical
health
machine translation
Czech
French
German
English
This package contains data sets for development and testing of machine translation of medical search short queries between Czech, English, French, and German. The queries come from general public and medical experts.
2013-10-10
corpus
Khresmoi-Query-MT-Test-Data-1.0
http://hdl.handle.net/11858/00-097C-0000-0022-D9BF-5
eng
fra
deu
ces
info:eu-repo/grantAgreement/EC/FP7/257528
http://hdl.handle.net/11234/1-2121
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://khresmoi.eu
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-F59C-82015-08-13T07:17:06Zhdl_11858_00-097C-0000-0007-710A-Ahdl_11234_3430hdl_11858_00-097C-0000-0007-710B-8oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-EE02-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Plain-Moses-Chimera
Bojar, Ondřej
Tamchyna, Aleš
moses
machine translation
Statistical component of Chimera, a state-of-the-art MT system.
2013-11-07
toolService
http://hdl.handle.net/11858/00-097C-0000-0022-EE02-C
eng
ces
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/x-gzip
application/x-tar
text/plain; charset=utf-8
downloadable_files_count: 2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-FE82-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Facebook Data for Sentiment Analysis
Habernal, Ivan
Ptáček, Tomáš
Steinberger, Josef
sentiment analysis
opinion mining
Corpus consisting of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.
2013-07-17
corpus
http://hdl.handle.net/11858/00-097C-0000-0022-FE82-7
ces
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
University of West Bohemia
http://liks.fav.zcu.cz/sentiment/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-FF60-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech SubLex 1.0
Veselovská, Kateřina
Bojar, Ondřej
subjectivity lexicon
sentiment analysis
opinion mining
polarity clues
Czech subjectivity lexicon, i.e. a list of subjectivity clues for sentiment analysis in Czech. The list contains 4626 evaluative items (1672 positive and 2954 negative) together with their part of speech tags, polarity orientation and source information.
The core of the Czech subjectivity lexicon has been gained by automatic translation of a freely available English subjectivity lexicon downloaded from http://www.cs.pitt.edu/mpqa/subj_lexicon.html. For translating the data into Czech, we used parallel corpus CzEng 1.0 containing 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep layers of syntactic representation. Afterwards, the lexicon has been manually refined by an experienced annotator.
2013-12-02
lexicalConceptualResource
http://hdl.handle.net/11858/00-097C-0000-0022-FF60-B
ces
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/octet-stream
application/octet-stream
application/pdf
text/plain; charset=utf-8
downloadable_files_count: 3
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/seance
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119C-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ORAL2006: Corpus of informal spoken Czech
Kopřivová, Marie
Waclawičová, Martina
corpus
informal spoken language
Corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 221 recordings made in 2002–2006 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 754, the metadata include sociolinguistic information about them.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC.
2006
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-119C-C
ces
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
https://wiki.korpus.cz/doku.php/cnk:oral2006
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119D-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ORAL2008: Balanced corpus of informal spoken Czech
Waclawičová, Martina
Kopřivová, Marie
Křen, Michal
Válková, Lucie
informal spoken language
balanced corpus
Balanced corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 297 recordings made in 2002–2007 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 995, the corpus is balanced in their main sociolinguistic categories (gender, age group, education, region of childhood residence).
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC.
2008
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-119D-A
ces
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
https://wiki.korpus.cz/doku.php/cnk:oral2008
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119E-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2005: balanced corpus of written Czech
Čermák, František
Hlaváčová, Jaroslava
Hnátková, Milena
Jelínek, Tomáš
Kocek, Jan
Kopřivová, Marie
Křen, Michal
Novotná, Renata
Petkevič, Vladimír
Schmiedtová, Věra
Skoumalová, Hana
Spoustová, Johanka
Šulc, Michal
Velíšek, Zdeněk
balanced corpus
written language
Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2000–2004 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
2005
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-119E-8
ces
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
ACA
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
https://wiki.korpus.cz/doku.php/cnk:syn2005
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119F-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2010: balanced corpus of written Czech
Křen, Michal
Bartoň, Tomáš
Cvrček, Václav
Hnátková, Milena
Jelínek, Tomáš
Kocek, Jan
Novotná, Renata
Petkevič, Vladimír
Procházka, Pavel
Schmiedtová, Věra
Skoumalová, Hana
balanced corpus
written language
Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2005–2009 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
2010
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-119F-6
ces
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
ACA
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
https://wiki.korpus.cz/doku.php/cnk:syn2010
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1358-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2006PUB: corpus of Czech newspapers
Čermák, František
Hlaváčová, Jaroslava
Hnátková, Milena
Jelínek, Tomáš
Kocek, Jan
Kopřivová, Marie
Křen, Michal
Novotná, Renata
Petkevič, Vladimír
Schmiedtová, Věra
Skoumalová, Hana
Spoustová, Johanka
Šulc, Michal
Velíšek, Zdeněk
corpus
written language
Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
2006
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-1358-3
ces
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
ACA
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
https://wiki.korpus.cz/doku.php/cnk:syn2006pub
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1359-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2009PUB: corpus of Czech newspapers
Křen, Michal
Bartoň, Tomáš
Hnátková, Milena
Jelínek, Tomáš
Petkevič, Vladimír
Procházka, Pavel
Skoumalová, Hana
corpus
written language
Corpus of contemporary Czech newspapers and magazines sized 700 MW. It contains various titles published between 1995–2007. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
2010
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-1359-1
ces
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
ACA
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
https://wiki.korpus.cz/doku.php/cnk:syn2009pub
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1AAF-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 3.0
Bejček, Eduard
Hajičová, Eva
Hajič, Jan
Jínová, Pavlína
Kettnerová, Václava
Kolářová, Veronika
Mikulová, Marie
Mírovský, Jiří
Nedoluzhko, Anna
Panevová, Jarmila
Poláková, Lucie
Ševčíková, Magda
Štěpánek, Jan
Zikánová, Šárka
treebank
dependency
tectogrammatics
topic-focus articulation
multiword expressions
coreference
bridging relations
discourse
PDT
PDT 3.0 is a new version of Prague Dependency Treebank. It contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and semantic annotation (0.8 MW); in addition, certain properties of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations are annotated at the semantic level.
2013-12-31
corpus
PDT 3.0
http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
ces
http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
http://hdl.handle.net/11858/00-097C-0000-0008-E130-A
http://hdl.handle.net/11234/1-1905
http://hdl.handle.net/11234/1-2621
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
text/html
application/zip
text/plain; charset=utf-8
downloadable_files_count: 2
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/pdt3.0
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B04-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Named Entity Corpus 1.1
Ševčíková, Magda
Žabokrtský, Zdeněk
Straková, Jana
Straka, Milan
named entity recognition
corpus
Czech Named Entity Corpus 1.1 fixes some issues of the Czech Named Entity Corpus 1.0: misannotated entities are fixed, all formats contain the same data, tmt format is replaced with treex format, all formats contain splitting into training, development and testing portion of the data.
2014-01-09
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C
ces
http://hdl.handle.net/11858/00-097C-0000-0022-C73C-7
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/cnec/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B22-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Named Entity Corpus 2.0
Ševčíková, Magda
Žabokrtský, Zdeněk
Straková, Jana
Straka, Milan
named entity recognition
Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named entities.
2014-01-09
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-1B22-8
ces
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/cnec/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1D76-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Senior COMPANION Expressive Speech Corpus
Grůber, Martin
speech corpus
expressive
text-to-speech synthesis
The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded natural dialogues between a computer and seniors.
2014-01-10
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-1D76-9
ces
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
application/x-gzip
application/x-tar
application/x-tar
application/vnd.openxmlformats-officedocument.wordprocessingml.document
text/plain; charset=utf-8
downloadable_files_count: 4
University of West Bohemia
http://www.companions-project.org/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-3B09-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2013PUB: corpus of written Czech newspapers
Křen, Michal
Hnátková, Milena
Jelínek, Tomáš
Petkevič, Vladimír
Procházka, Pavel
Skoumalová, Hana
corpus
written language
Corpus of contemporary Czech newspapers and magazines sized 935 MW. It contains various titles published between 2005–2009. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
2013
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-3B09-4
ces
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
ACA
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
http://wiki.korpus.cz/doku.php/en:cnk:syn2013pub
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-3FBB-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 2 ver. 2
Šebesta, Karel
Goláňová, Hana
youth language
classroom
language acquisition corpus
AKCES
Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants.
2013-12-18
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-3FBB-3
ces
http://hdl.handle.net/11858/00-097C-0000-0019-89A0-9
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
application/zip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University in Prague, ÚČJTK
http://akces.ff.cuni.cz
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4087-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Linguistic digital repository based on DSpace
Pajas, Petr
Vandas, Karel
Mišutka, Jozef
Kamran, Amir
Jawaid, Bushra
Košarko, Ondřej
Sedlák, Michal
Josífko, Michal
Straňák, Pavel
Hajič, Jan
linguistics
digital data
digital repository
language repository
linguistic data
One of the goals of LINDAT/CLARIN Centre for Language Research Infrastructure is to provide technical background to institutions or researchers who wants to share their tools and data used for research in linguistics or related research fields. The digital repository is built on a highly customised DSpace platform.
2014
toolService
http://hdl.handle.net/11858/00-097C-0000-0023-4087-6
http://hdl.handle.net/11858/00-097C-0000-0001-48F2-1
http://hdl.handle.net/11234/1-1481
downloadable_files_count: 0
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://svn.ms.mff.cuni.cz/redmine/projects/dspace-modifications
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4336-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Morphological Analyzer v1
Hajič, Jan
morphological analysis
lemmatization
One of the very first steps in automatic processing of Czech text is morphological analysis and lemmatization.
2014-02-13
toolService
http://hdl.handle.net/11858/00-097C-0000-0023-4336-4
ces
downloadable_files_count: 0
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://lindat.mff.cuni.cz/services/morph/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4337-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
EngVallex - English Valency Lexicon
Cinková, Silvie
Fučíková, Eva
Šindlerová, Jana
Hajič, Jan
Annotations
Corpora
Data
Lexicons
Monolingual
Semantics
Valency
EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank and Verbnet, two existing English predicate-argument lexicons used, i.a., for the PropBank project. The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank, which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT.
2014-02-13
lexicalConceptualResource
http://hdl.handle.net/11858/00-097C-0000-0023-4337-2
eng
http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
PUB
text/plain; charset=utf-8
application/zip
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://lindat.mff.cuni.cz/services/EngVallex/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4338-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
PDT-Vallex: Czech Valency lexicon linked to treebanks
Urešová, Zdeňka
Štěpánek, Jan
Hajič, Jan
Panevova, Jarmila
Mikulová, Marie
annotation
corpora
data
lexicon
semantics
valency
PDT
The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool), and also in more human readable form including corpus examples (see the WEBSITE link below). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives.
2014-02-13
lexicalConceptualResource
http://hdl.handle.net/11858/00-097C-0000-0023-4338-F
ces
http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
PUB
text/plain; charset=utf-8
application/zip
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://lindat.mff.cuni.cz/services/PDT-Vallex/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-43CD-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MorphoDiTa: Morphological Dictionary and Tagger
Straka, Milan
Straková, Jana
tagging
morphological analysis
morphological generation
tokenization
MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological generation, tagging and tokenization and is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, MorphoDiTa achieves state-of-the-art results with a throughput around 10-200K words per second. MorphoDiTa is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
2014-02-14
toolService
http://hdl.handle.net/11858/00-097C-0000-0023-43CD-0
eng
http://hdl.handle.net/11858/00-097C-0000-0001-48FE-9
downloadable_files_count: 0
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/morphodita
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-43CE-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
NameTag
Straka, Milan
Straková, Jana
named entity recognizer
NameTag is an open-source tool for named entity recognition (NER). NameTag identifies proper names in text and classifies them into predefined categories, such as names of persons, locations, organizations, etc. NameTag is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, NameTag achieves state-of-the-art performance (Straková et al. 2013). NameTag is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
2014-02-14
toolService
http://hdl.handle.net/11858/00-097C-0000-0023-43CE-E
eng
downloadable_files_count: 0
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/nametag
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4670-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Vystadial 2013 – Czech data
Korvas, Matěj
Plátek, Ondřej
Dušek, Ondřej
Žilka, Lukáš
Jurčíček, Filip
acoustic data
speech corpus
spoken corpus
orthographic transcriptions
telephone speech
voip
dialogue system
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts.
The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits.
This is the Czech data part of the dataset.
2014-02-21
corpus
http://hdl.handle.net/11858/00-097C-0000-0023-4670-6
ces
http://hdl.handle.net/11234/1-1740
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
application/x-gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Charles University, Faculty of Mathematics and Physics
https://ufal.mff.cuni.cz/grants/vystadial
oai_dc////100