2024-03-29T10:34:17Zhttp://lindat.mff.cuni.cz/repository/oai/requestoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4872-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Arabic Dependency Treebank 1.0
Dependency treebank
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-4872-3
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
DVD-R
Hajic
Jan
hajic@ufal.mff.cuni.cz
Charles University in Prague, UFAL
hajic@ufal.mff.cuni.cz
2021-06-29
false
Prague Arabic Dependency Treebank
nationalFunds
corpus
text
monolingual
ara
Arabic
113500
tokens
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-487A-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Lexikálně-sémantická anotace PDT pomocí Českého WordNetu
Data is stored in PML format. This is a stand-off annotation and for most use cases it requires PDT 2.0 and the Czech WordNet 1.9 PDT that we have used for annotation.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University in Prague, UFAL
stranak@ufal.mff.cuni.cz
2021-06-29
true
1ET201120505 - Od jazyka ke znalostem a sémantickému webu
nationalFunds
corpus
text
monolingual
ces
Czech
2.5
mb
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4916-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CzEng 0.7
a Eng. Cz parallel corpus
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-4916-9
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University in Prague, UFAL
stranak@ufal.mff.cuni.cz
2021-06-29
true
EuroMatrix
euFunds
corpus
text
bilingual
ces
Czech
eng
English
1375908
sentences
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4908-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
VALLEX 2.5
The Valency Lexicon of Czech Verbs, Version 2.5 (VALLEX 2.5), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. VALLEX 2.5 has been developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague.
VALLEX 2.5 provides information on the valency structure (combinatorial potential) of verbs in their particular senses - there are roughly 2,730 lexeme entries containing together around 6,460 lexical units ("senses").
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-4908-9
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University in Prague, UFAL
stranak@ufal.mff.cuni.cz
2021-06-29
true
Vallex
nationalFunds
lexicalConceptualResource
lexicon
text
monolingual
ces
Czech
6460
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4880-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech WordNet 1.9 PDT
The Czech WordNet was developed by the Centre of Natural Language Processing at the Faculty of Informatics, Masaryk University, Czech Republic.
The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 23,094 word senses (synsets). 203 of these were created or modified by UFAL during correction of annotations. This version of WordNet was used to annotate word senses in PDT: http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
A more recent version of Czech WordNet is distributed by ELRA: http://catalog.elra.info/product_info.php?products_id=1089
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University in Prague, UFAL
stranak@ufal.mff.cuni.cz
2021-06-29
true
1ET201120505 - Od jazyka ke znalostem a sémantickému webu
nationalFunds
corpus
text
monolingual
ces
Czech
23094
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-487E-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CoNLL 2009 Shared Task Czech Trial Set
Czech trial (example) data for CoNLL 2009 Shared Task.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-487E-B
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University in Prague, UFAL
stranak@ufal.mff.cuni.cz
2021-06-29
true
none
ownFunds
corpus
text
monolingual
ces
Czech
194
sentences
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4909-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
UMC 0.1: Czech-Russian-English Multilingual Corpus
UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of UMC is to extend the set of languages covered by the corpus CzEng mainly for the purposes of machine translation.
All the texts were downloaded from a single source — The Project Syndicate (Copyright: Project Syndicate 1995-2008), which contains a huge collection of high-quality news articles and commentaries. We were given the permission to use the texts for research and non-commercial purposes.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-4909-7
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University in Prague, UFAL
stranak@ufal.mff.cuni.cz
2021-06-29
true
EuroMatrix
euFunds
corpus
text
monolingual
ces
Czech
1800000
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B098-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 2.0 (PDT 2.0)
The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (two million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.
PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
available-restrictedUse
other
other
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
stranak@ufal.mff.cuni.cz
2021-06-29
Data a nástroje pro informační systémy
nationalFunds
Moderní metody, struktury a systémy informatiky
nationalFunds
Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů
nationalFunds
Vícejazyčný valenční a predikátový slovník přirozeného jazyka
nationalFunds
Centrum komputační lingvistiky
nationalFunds
corpus
text
monolingual
ces
Czech
2000000
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B43E-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 2.0 - sample data
A small subset of PDT 2.0 made available under a permissive license.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-B43E-6
available-restrictedUse
CC-BY
attribution
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University in Prague, UFAL
stranak@ufal.mff.cuni.cz
2021-06-29
true
Laboratoř počítačového zpracování jazykových dat
nationalFunds
Centrum komputační lingvistiky
nationalFunds
Vícejazyčný valenční a predikátový slovník přirozeného jazyka
nationalFunds
Moderní metody, struktury a systémy informatiky
nationalFunds
Centrum komputační lingvistiky
nationalFunds
Formální reprezentace jazykových struktur
nationalFunds
Čeština ve věku počítačů
nationalFunds
Velké jazykové korpusy a jejich automatická analýza
nationalFunds
Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů
nationalFunds
Data a nástroje pro informační systémy
nationalFunds
Od jazyka ke znalostem a sémantickému webu
nationalFunds
Tektogramatická reprezentace angličtiny - aplikace funkčního generativního popisu (FGP) na hloubkovou syntax cizích jazyků v PZK
nationalFunds
Faktory koherence textu a jejich zpracování v syntakticky anotovaném korpusu textů
nationalFunds
Pražský závislostní korpus: Analýza vybraných jevů z české funkční onomatologie a syntaxe
nationalFunds
Automatická hloubková analýza mluvené češtiny: od akustického signálu k významu
nationalFunds
Data preparation for Workshop 1998, JHU, Baltimore, MD, USA
other
corpus
text
monolingual
ces
Czech
549.2
kb
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4914-D2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank of Spoken Language (PDTSL) 0.5
First edition of speech corpus with speech reconstruction layer (edited transcript).
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-4914-D
available-restrictedUse
other
other
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University in Prague, UFAL
stranak@ufal.mff.cuni.cz
2021-06-29
true
Center for Computational Linguistics
nationalFunds
corpus
audio
bilingual
ces
Czech
eng
English
120000
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-C6D1-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CoNLL 2009 Shared Task - Czech Data
Czech data - both train and test+eval sets, as well as the valency dictionary - for the CoNLL 2009 Shared Task. Documentation is included. The data are generated from PDT 2.0. LDC catalog number: LDC2009E34B
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-C6D1-9
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
stranak@ufal.mff.cuni.cz
2021-06-29
Moderní metody, struktury a systémy informatiky
nationalFunds
corpus
text
monolingual
ces
Czech
1
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F3-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
XSH
XSH is a powerfull command-line tool for querying, processing and editing XML documents. It features a shell-like interface with auto-completion for comfortable interactive work, but can be as well used for off-line (batch) processing of XML data.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-48F3-0
available-restrictedUse
other
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F7-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
TrEd
Tree Editor
TrEd is a fully customizable and programmable graphical editor and viewer for tree-like structures. Among other projects, it was used as the main annotation tool for syntactical and tectogrammatical annotations in The Prague Dependency Treebank, as well as for decision-tree based morphological annotation of The Prague Arabic Dependency Treebank.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-48F7-8
available-restrictedUse
GPL
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F8-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MEd
MEd is an annotation tool in which linearly-structured annotations of text or audio data can be created and edited. The tool supports multiple stacked layers of annotations that can be interconnected by links. MEd can also be used for other purposes, such as word-to-word alignment of parallel corpora.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-48F8-6
available-restrictedUse
GPL
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F9-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
HMM tagger
The HMM-based Tagger is a software for morphological disambiguation (tagging) of Czech texts. The algorithm is statistical, based on the Hidden Markov Models.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-48F9-4
available-restrictedUse
GPL
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FA-22017-04-10T13:34:17Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F2-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Dspace modifications for use of EPIC handles
Modifications to DSpace made by Petr Pajas in order to support pidconsortium.eu PID handle system instead of the default handle.com system used by DSpace.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-48F2-1
available-restrictedUse
BSD
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FB-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
STYX
The STYX system is an electronic exercise book for practising Czech morphology and syntax consisting of more than 11, 000 sentences.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F
available-restrictedUse
GPL
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FC-D2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MMI_clustering
MMI_clustering is a set of command line tools implementing Mercer's maximum mutual information-based clustering technique.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-48FC-D
available-restrictedUse
GPL
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FD-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Victor
Victor is a web page cleaning tool. It is aimed at removing menu, ads, footers, headers, etc. from HTML web pages, so that only main web page content remains. Victor is based on a conditional random fields algorithm.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-48FD-B
available-restrictedUse
GPL
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FE-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Morče
The MORČE tagger is a software for morphological disambiguation (part-of-speech tagging) of Czech text. The algorithm is statistical, based on an idea of so-called "Averaged Perceptron" published by Michael Collins in 2002.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-48FE-9
available-restrictedUse
GPL
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FF-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Victoria
Victoria is an on-line HTML web page annotation tool suitable for selecting texts on the web pages. It can be used to mark important/interesting parts of web pages for further processing.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-48FF-7
available-restrictedUse
GPL
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4900-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MORFO
The MORFO system for morphological analysis of Czech consists of four units: the analyzer, the generator, the dictionary editor, and the library with the shared source code for handling dictionary objects.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-4900-A
available-restrictedUse
other
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4901-82017-04-10T13:32:37Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4902-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
LAW
Lexical Annotation Workbench (LAW) is an integrated environment for morphological annotation. It supports simple morphological annotation (assigning a lemma and tag to a word), integration and comparison of different annotations of the same text, searching for particular word, tag etc.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-4902-6
available-restrictedUse
CC-BY
attribution
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4904-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Feature-based tagger
The Feature-based (exponential model) Tagger is a fast implementation of the Czech tagger developed at UFAL and described in the PDT 1.0 documentation (Czech Language Tagging page). In order to get the best possible results, the tagger requires preprocessing by a Czech morphological module with a very high coverage. This module covers a superset of the Czech "FM" morphology. Both the morphological module and the tagger are supplied as binary executables, together with all necessary precompiled Czech data. Input must be in the ISO Latin 2 (iso-8859-2) code and follow the csts.dtd definition, and output is produced in the same way (ISO Latin 2 code, csts.dtd). (As is the case with many of the tools provided with PDT 1.0, both executables also accept - and then produce - a "simplified SGML", which is not a real, valid SGML, but simply contains at least the tags for words, punctuation, and sentence breaks, one item per line.)
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-4904-2
available-restrictedUse
other
other
downloadable
No value given
No value given
No value given
2021-06-29
http://lindat.mff.cuni.cz/services/morph/
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4905-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Netgraph
Netgraph is a graphically oriented client-server application for searching in linguistically annotated treebanks. The query language of Netgraph is simple and intuitive, yet powerful enough for treebanks with complex annotations schemes. The primary purpose of Netgraph is searching in the Prague Dependency Treebank 2.0, nevertheless it can be used for other treebanks as well.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-4905-F
available-restrictedUse
GPL
other
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F4-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ElixirFM
ElixirFM is a high-level implementation of Functional Arabic
Morphology documented at http://elixir-fm.wiki.sourceforge.net/. The
core of ElixirFM is written in Haskell, while interfaces in Perl
support lexicon editing and other interactions.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-48F4-E
available-restrictedUse
GPL
other
downloadable
No value given
No value given
No value given
2021-06-29
http://lindat.mff.cuni.cz/services/elixirfm/demo.php
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B08B-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Multiword expressions in the Prague Dependency Treebank 2.0
This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as the original PDT 2.0 data. It is to be used together with the PDT 2.0.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-B08B-3
available-restrictedUse
CC-BY
attribution
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
stranak@ufal.mff.cuni.cz
2021-06-29
Od jazyka ke znalostem a sémantickému webu
nationalFunds
Moderní metody, struktury a systémy informatiky
nationalFunds
corpus
text
monolingual
ces
Czech
1
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CC1E-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hindi Web Texts
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-CC1E-B
available-restrictedUse
CC-BY-NC
attribution
academic-nonCommercialUse
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
stranak@ufal.mff.cuni.cz
2021-06-29
EuroMatrix Plus
euFunds
EuroMatrixPlus – Bringing Machine Translation for European Languages to the User
nationalFunds
corpus
text
monolingual
hin
Hindi
308000000
token
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-BD17-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
English-Hindi Parallel Corpus
English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A part of the data is our patch for the Emille parallel corpus.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-BD17-1
available-restrictedUse
CC-BY
attribution
downloadable
Straňák
Pavel
stranak@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
stranak@ufal.mff.cuni.cz
2021-06-29
EuroMatrix Plus
euFunds
EuroMatrixPlus – Bringing Machine Translation for European Languages to the User
nationalFunds
corpus
text
bilingual
hin
Hindi
eng
English
1
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCD-02014-05-13T09:21:27Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCA1-02022-11-25T16:00:44Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Air Traffic Control Communication
Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0
available-restrictedUse
other
other
downloadable
Ircing
Pavel
ircing@kky.zcu.cz
University of West Bohemia, Department of Cybernetics
ircing@kky.zcu.cz
2022-11-25
Inteligentní technologie pro zvýšení bezpečnosti letového provozu
nationalFunds
corpus
audio
monolingual
eng
English
20
hours
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCF-C2022-04-26T13:51:47Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
czes
First version of the very large Czech corpus Czes created with a new set of tools. It comprises 465,102,710 tokens.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-CCCF-C
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
downloadable
Němčík
Václav
xnemcik@fi.muni.cz
Masaryk University, NLP Centre
xnemcik@fi.muni.cz
2022-04-26
corpus
text
monolingual
ces
Czech
465102710
tokens
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCE-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Integrated lexicographic platform for Russian
Integrated lexicographic platform for Russian.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-CCCE-E
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
downloadable
No value given
No value given
No value given
2021-06-29
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCD2-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
IDENTICv1.0-raw
Raw Text
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-CCD2-2
available-restrictedUse
CC-BY
attribution
downloadable
Larasati
Septina Dian
septina.larasati@gmail.com
Charles University in Prague, UFAL
septina.larasati@gmail.com
2021-06-29
corpus
text
bilingual
ind
Indonesian
eng
English
1
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCDB-02022-04-26T13:52:15Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
skTenTen
Slovak large web corpus skTenTen, comprising 876,003,720 tokens.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-CCDB-0
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
downloadable
Němčík
Václav
xnemcik@fi.muni.cz
Masaryk University, NLP Centre
xnemcik@fi.muni.cz
2022-04-26
corpus
text
monolingual
slk
Slovak
876003720
tokens
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCDF-82022-04-26T13:50:51Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
enTenTen
Very large English web corpus enTenTEn, comprising 3,268,798,627 tokens.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-CCDF-8
available-restrictedUse
other
other
downloadable
Němčík
Václav
xnemcik@fi.muni.cz
Masaryk University, NLP Centre
xnemcik@fi.muni.cz
2022-04-26
corpus
text
monolingual
eng
English
3268798627
tokens
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-D709-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
BushBank
Czech corpus annotated for NP and clause chunks by 3-11 annotators (with average inter-annotator agreement at 88%). It consists of 10,000 sentences.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0001-D709-F
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
downloadable
Němčík
Václav
xnemcik@fi.muni.cz
Masaryk University, NLP Centre
xnemcik@fi.muni.cz
2021-06-29
corpus
text
monolingual
ces
Czech
10000
sentences
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BCCF-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Extended Textual Coreference and Bridging Relations in PDT 2.0
Annotation of extended textual coreference and bridging relations in the Prague Dependency Treebank 2.0
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0005-BCCF-3
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Mírovský
Jiří
mirovsky@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
mirovsky@ufal.mff.cuni.cz
2021-06-29
LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat
nationalFunds
Od struktury věty k textovým vztahům
nationalFunds
corpus
text
monolingual
ces
Czech
2
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BF85-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
IDENTICv1.0
IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide researchers a proper Indonesian-English textual data set and also to promote research in this language pair. The corpus contains texts coming from different sources with different genres.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0005-BF85-F
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Larasati
Septina Dian
septina.larasati@gmail.com
Charles University in Prague, UFAL
septina.larasati@gmail.com
2021-06-29
CLARA (Common Language Resources and their Applications)
euFunds
Centrum komputační lingvistiky
nationalFunds
corpus
text
bilingual
ind
Indonesian
eng
English
1
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BF95-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
VPS-30-En
VPS-30-En is a small lexical resource that contains the following 30 English verbs: access, ally, arrive, breathe,
claim, cool, crush, cry, deny, enlarge, enlist, forge, furnish, hail, halt, part, plough, plug, pour, say, smash, smell, steer, submit, swell,
tell, throw, trouble, wake and yield. We have created and have been using VPS-30-En to explore the interannotator agreement potential
of the Corpus Pattern Analysis. VPS-30-En is a small snapshot of the Pattern Dictionary of English Verbs (Hanks and Pustejovsky,
2005), which we revised (both the entries and the annotated concordances) and enhanced with additional annotations.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0005-BF95-B
available-restrictedUse
CC-BY
attribution
downloadable
Cinková
Silvie
cinkova@ufal.mff.cuni.cz
Charles University in Prague, UFAL
cinkova@ufal.mff.cuni.cz
2021-06-29
LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat
nationalFunds
Centrum pro multi-modální interpretaci dat velkého rozsahu
nationalFunds
Komputační lingvistika: Explicitní popis jazyka a anotovaná data se zřetelem na češtinu
nationalFunds
Temporální aspekty znalostí a informací
nationalFunds
lexicalConceptualResource
lexicon
text
monolingual
eng
English
1
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-CF9C-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Parliament Meetings
The corpus consists of recordings from the Chamber of Deputies of the Parliament of the Czech Republic. It currently consists of 88 hours of speech data, which corresponds roughly to 0.5 million tokens. The annotation process is semi-automatic, as we are able to perform the speech recognition on the data with high accuracy (over 90%) and consequently align the resulting automatic transcripts with the speech. The annotator’s task is then to check the transcripts, correct errors, add proper punctuation and label speech sections with information about the speaker. The resulting corpus is therefore suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 18 sound files (WAV PCM, 16-bit, 44.1 kHz, mono) and corresponding transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
The date of airing of a particular recording is encoded in the filename in the form SOUND_YYMMDD_*. Note that the recordings are usually aired in the early morning on the day following the actual Parliament session. If the recording is too long to fit in the broadcasting scheme, it is divided into several parts and aired on the consecutive days.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
downloadable
Ircing
Pavel
ircing@kky.zcu.cz
University of West Bohemia, Department of Cybernetics
ircing@kky.zcu.cz
2021-06-29
corpus
audio
monolingual
ces
Czech
37
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADA-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
WMT 2011 Testing Set
Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is described in [2].
References:
[1] http://www.statmt.org/wmt11/evaluation-task.html
[2] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0006-AADA-9
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Galuščáková
Petra
galuscakova@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
galuscakova@ufal.mff.cuni.cz
2021-06-29
EuroMatrix Plus
euFunds
EuroMatrixPlus – Bringing Machine Translation for European Languages to the User
nationalFunds
corpus
text
multilingual
slk
Slovak
ces
Czech
eng
English
1
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADB-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Manually Classified Errors in Cs->Sk Translation
Manual classification of errors of Czech-Slovak translation according to the classification introduced by Vilar et al. [1]. First 50 sentences from WMT 2010 test set were translated by 5 MT systems (Česílko, Česílko2, Google Translate and two Moses setups) and MT errors were manually marked and classified. Classification was applied in MT systems comparison [3]. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://matrix.statmt.org/test_sets/list
[3] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0006-AADB-7
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Galuščáková
Petra
galuscakova@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
galuscakova@ufal.mff.cuni.cz
2021-06-29
EuroMatrix Plus
euFunds
EuroMatrixPlus – Bringing Machine Translation for European Languages to the User
nationalFunds
corpus
text
bilingual
slk
Slovak
ces
Czech
1
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADC-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Manually Classified Errors in En->Sk Translation
Manual classification of errors of English-Slovak translation according to the classification introduced by Vilar et al. [1]. 50 sentences randomly selected from WMT 2011 test set [2] were translated by 3 MT systems described in [3] and MT errors were manually marked and classified. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://www.statmt.org/wmt11/evaluation-task.html
[3] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0006-AADC-5
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Galuščáková
Petra
galuscakova@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
galuscakova@ufal.mff.cuni.cz
2021-06-29
EuroMatrix Plus
euFunds
EuroMatrixPlus – Bringing Machine Translation for European Languages to the User
nationalFunds
corpus
text
bilingual
slk
Slovak
eng
English
1
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADD-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Manually Ranked Translation Outputs
Manually ranked outputs of Czech-Slovak translations. Three annotators manually ranked outputs of five MT systems (Česílko, Česílko2, Google Translate and two Moses setups) on three data sets (100 sentences randomly selected from books, 100 sentences randomly selected from Acquis corpus and 50 first sentences from WMT 2010 test set). Ranking was applied in MT systems comparison in [1].
References:
[1] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0006-AADD-3
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Galuščáková
Petra
galuscakova@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
galuscakova@ufal.mff.cuni.cz
2021-06-29
EuroMatrix Plus
euFunds
EuroMatrixPlus – Bringing Machine Translation for European Languages to the User
nationalFunds
corpus
text
bilingual
slk
Slovak
ces
Czech
1
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADF-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech-Slovak Parallel Corpus
Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation.
References:
[1] http://langtech.jrc.it/JRC-Acquis.html/
[2] http://www.statmt.org/europarl/
[3] http://apertium.eu/data
[4] http://opus.lingfil.uu.se/
[5] http://ec.europa.eu/
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0006-AADF-0
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Galuščáková
Petra
galuscakova@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
galuscakova@ufal.mff.cuni.cz
2021-06-29
EuroMatrix Plus
euFunds
EuroMatrixPlus – Bringing Machine Translation for European Languages to the User
nationalFunds
corpus
text
bilingual
slk
Slovak
ces
Czech
5700000
sentences
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AAE0-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
English-Slovak Parallel Corpus
English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation.
References:
[1] http://langtech.jrc.it/JRC-Acquis.html/
[2] http://www.statmt.org/europarl/
[3] http://apertium.eu/data
[4] http://opus.lingfil.uu.se/
[5] http://ec.europa.eu/
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0006-AAE0-A
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Galuščáková
Petra
galuscakova@ufal.mff.cuni.cz
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
galuscakova@ufal.mff.cuni.cz
2021-06-29
EuroMatrix Plus
euFunds
EuroMatrixPlus – Bringing Machine Translation for European Languages to the User
nationalFunds
corpus
text
bilingual
slk
Slovak
eng
English
2
files
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AAFE-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Česílko
Česílko is a tool enabling the fast and efficient translation from one source language into many target languages, which are mutually related.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0006-AAFE-A
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
downloadable
No value given
No value given
No value given
2021-06-29
http://lindat.mff.cuni.cz/services/cesilko/demo.php
toolService
No value given
No value given
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-B847-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
CWC2011
Web corpus of Czech, created in 2011. Contains newspapers₊magazines, discussions, blogs.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0006-B847-6
available-restrictedUse
CC-BY
attribution
downloadable
Spoustová
Johanka
johanka@ucw.cz
Charles University in Prague, UFAL
johanka@ucw.cz
2021-06-29
true
Internet as a Language Corpus
nationalFunds
corpus
text
monolingual
ces
Czech
2650000000
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-DB11-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 2.5
The Prague Dependency Treebank 2.5 annotates the same texts as the PDT 2.0. The annotation on the original four layers was fixed or improved in various aspects (see Documentation). Moreover, new information was added to the data:
Annotation of multiword expressions
Pair/group meaning
Clause segmentation
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Bejček
Eduard
bejcek@ufal.mff.cuni.cz
Charles University in Prague, UFAL
bejcek@ufal.mff.cuni.cz
2021-06-29
true
Prague Dependency Treebank 2.5
nationalFunds
corpus
text
monolingual
ces
Czech
2000000
tokens
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0007-70FD-E2022-03-14T14:21:37Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
DZ Interset
DZ Interset is a means of converting among various tag sets in natural language processing. The core idea is similar to interlingua-based machine translation. DZ Interset defines a set of features that are encoded by the various tag sets. The set of features should be as universal as possible. It does not need to encode everything that is encoded by any tag set but it should encode all information that people may want to access and/or port from one tag set to another.
New tag sets are attached by writing a driver for them. Once the driver is ready, you can easily convert tags between the new set and any other set for which you also have a driver. This reusability is an obvious advantage over writing a targeted conversion procedure each time you need to convert between a particular pair of tag sets.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0007-70FD-E
available-restrictedUse
GPL
other
downloadable
Zeman
Daniel
zeman@ufal.mff.cuni.cz
Charles University in Prague, UFAL
zeman@ufal.mff.cuni.cz
2022-03-14
true
http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl
Výzkumný záměr
nationalFunds
toolService
tool
false
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0008-D259-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Additional German-Czech reference translations of the WMT'11 test set
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0008-D259-7
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Dušek
Ondřej
odusek@ufal.mff.cuni.cz
Charles University in Prague, UFAL
odusek@ufal.mff.cuni.cz
2021-06-29
true
Čeština ve věku strojového překladu
nationalFunds
EuroMatrix Plus
euFunds
EuroMatrixPlus – Bringing Machine Translation for European Languages to the User
nationalFunds
EuroMatrixPlus - Enlarged European Union Bringing Machine Translation for European Languages to the User
nationalFunds
corpus
text
bilingual
deu
German
ces
Czech
527
kb
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-60D6-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
W2C – Web to Corpus – tool
A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plaain text, identify language, etc.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
available-restrictedUse
CC-BY-SA
attribution
shareAlike
downloadable
Popel
Martin
popel@ufal.mff.cuni.cz
Charles University in Prague, UFAL
popel@ufal.mff.cuni.cz
2021-06-29
true
toolService
suiteOfTools
false
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0008-E130-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Discourse Treebank 1.0
Annotation of discourse relations is a project related to the Prague Dependency Treebank 2.5. It represents a new layer of manual annotation, above the existing layers of the PDT and it portrays linguistic phenomena from the perspective of discourse structure and coherence.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0008-E130-A
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
CD-ROM
downloadable
Mírovský
Jiří
mirovsky@ufal.mff.cuni.cz
Charles University in Prague, UFAL
mirovsky@ufal.mff.cuni.cz
2021-06-29
true
http://ufal.mff.cuni.cz/discourse/data.php
From the structure of a sentence to textual relationships
nationalFunds
Computational Linguistics: Explicit description of language and annotated data focused on Czech
nationalFunds
Coreference, discourse relations and information structure in a contrastive perspective
nationalFunds
corpus
text
monolingual
ces
Czech
49431
sentences
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000C-2112-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 3
Corpus AKCES 3 includes texts written in czech by non-native speakers (AKCES/CLAC - Czech Language Acquisition Corpora)
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-000C-2112-B
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
webExecutable
Šebesta
KS
sebesta@ff.cuni.cz
Charles University in Prague, ÚČJTK
sebesta@ff.cuni.cz
2021-06-29
Inovace vzdělávání v oboru čeština jako druhý jazyk; Jazyk jako lidská činnost, její produkt a faktor; Lingvistika
nationalFunds
corpus
text
monolingual
ces
Czech
11.32
mb
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67C-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Korektor
Statistical spell- and (occasional) grammar-checker. There are three versions: a unix command line utility and an OS X SpellServer with a System Service, that integrates with native OS X GUI applications, and a web service run by Lindat-Clarin, that can be used either through a web form in a browser, or by web applications using API.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-000D-F67C-5
available-restrictedUse
BSD
other
webExecutable
downloadable
accessibleThroughInterface
Richter
Michal
stranak@ufal.mff.cuni.cz
Charles University in Prague, UFAL
stranak@ufal.mff.cuni.cz
2021-06-29
true
http://lindat.mff.cuni.cz/services/korektor
LINDAT-CLARIN project (LM2010013)
nationalFunds
toolService
tool
true
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000C-2293-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 4
Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora)
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-000C-2293-0
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
webExecutable
Šebesta
KS
sebesta@ff.cuni.cz
Charles University in Prague, ÚČJTK
sebesta@ff.cuni.cz
2021-06-29
Inovace vzdělávání v oboru čeština jako druhý jazyk; Jazyk jako lidská činnost, její produkt a faktor; Lingvistika
nationalFunds
corpus
text
monolingual
ces
Czech
4.502
mb
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC91-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech translation of the EBUContentGenre thesaurus
The EBUContentGenre is a thesaurus containing the hierarchical description of various genres utilized in the TV broadcasting industry. This thesaurus is a part of a complex metadata specification called EBUCore intended for multifaceted description of audiovisual content. EBUCore (http://tech.ebu.ch/docs/tech/tech3293v1_3.pdf) is a set of descriptive and technical metadata based on the Dublin Core and adapted to media. EBUCore is the flagship metadata specification of European Broadcasting Union, the largest professional association of broadcasters around the world. It is developed and maintained by EBU's Technical Department (http://tech.ebu.ch). The translated thesaurus can be used for effective cataloguing of (mostly TV) audiovisual content and consequent development of systems for automatic cataloguing (topic/genre detection).
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-000D-EC91-2
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Ircing
Pavel
ircing@kky.zcu.cz
University of West Bohemia
ircing@kky.zcu.cz
2021-06-29
true
Eliminace jazykových bariér handicapovaných diváků České televize II
nationalFunds
lexicalConceptualResource
thesaurus
text
bilingual
ces
Czech
eng
English
1266
keywords
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC92-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ATCC: Pronunciation lexicon and n-gram counts for ASR module
The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0).
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-000D-EC92-F
available-restrictedUse
CC-BY-NC
attribution
academic-nonCommercialUse
downloadable
Šmídl
Luboš
ircing@kky.zcu.cz
University of West Bohemia
ircing@kky.zcu.cz
2021-06-29
true
Inteligentní technologie pro zvýšení bezpečnosti letového provozu
nationalFunds
lexicalConceptualResource
other
text
monolingual
eng
English
236500
other
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC98-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
OVM – Otázky Václava Moravce
The corpus consists of transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. It contains 35 hours of speech and corresponding word-by-word transcriptions, including the transcription of some non-speech events. Speakers’ names are also assigned to corresponding segments. The resulting corpus is suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 16 sound files (WAV PCM, 16-bit, 48 kHz, mono) and transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-000D-EC98-3
available-restrictedUse
CC-BY-NC
attribution
academic-nonCommercialUse
downloadable
Ircing
Pavel
ircing@kky.zcu.cz
University of West Bohemia
ircing@kky.zcu.cz
2021-06-29
true
corpus
audio
monolingual
ces
Czech
35
hours
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F696-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
jusText
jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/justext/). It is successfully used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The boilerplate removal algorithm is able to remove most of non-grammatical sentences from a web page like navigation, advertisements, tables, short notes and so on. It has been shown it overperforms or at least keeps up with it's competitors (according to comparison with participants of Cleaneval competition in author's Ph.D. thesis). The precise removal of unwanted content and scalability of the algorithm has been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- over 20 TB of HTML pages were processed resulting in corpora of 70 billions tokens altogether.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-000D-F696-9
available-restrictedUse
CC-BY-SA
attribution
shareAlike
downloadable
Pomikálek
Jan
jan.pomikalek@gmail.com
Natural Language Processing Centre, Faculty of Informatics Masaryk University
jan.pomikalek@gmail.com
2021-06-29
true
PRESEMT
euFunds
toolService
tool
false
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67A-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Chared
Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9
available-restrictedUse
BSD
other
downloadable
Pomikálek
Jan
jan.pomikalek@gmail.com
Natural Language Processing Centre, Faculty of Informatics Masaryk University
jan.pomikalek@gmail.com
2021-06-29
true
PRESEMT
euFunds
toolService
tool
false
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67B-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
onion
onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7
available-restrictedUse
BSD
other
downloadable
Pomikálek
Jan
jan.pomikalek@gmail.com
Natural Language Processing Centre, Faculty of Informatics Masaryk University
jan.pomikalek@gmail.com
2021-06-29
true
http://code.google.com/p/onion/
PRESEMT
euFunds
toolService
tool
false
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0015-8DAF-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Czech-English Dependency Treebank 2.0
Texts
The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part.
Data
The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 sentence-aligned. An additional automatic alignment on the node level (different for each annotation layer) is part of this release, too. The original Penn Treebank-like file structure (25 sections, each containing up to one hundred files) has been preserved. Only those PTB documents which have both POS and structural annotation (total of 2312 documents) have been translated to Czech and made part of this release.
Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are:
dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values)
semantic labeling of content words and types of coordinating structures
argument structure, including an argument structure ("valency") lexicon for both languages
ellipsis and anaphora resolution.
This annotation style is called tectogrammatical annotation and it constitutes the tectogrammatical layer in the corpus. For more details see below and documentation.
Annotation of the Czech part
Sentences of the Czech translation were automatically morphologically annotated and parsed into surface-syntax dependency trees in the PDT 2.0 annotation style. This annotation style is sometimes called analytical annotation; it constitutes the analytical layer of the corpus. The manual tectogrammatical (deep-syntax) annotation was built as a separate layer above the automatic analytical (surface-syntax) parse. A sample of 2,000 sentences was manually annotated on the analytical layer.
Annotation of the English part
The resulting manual tectogrammatical annotation was built above an automatic transformation of the original phrase-structure annotation of the Penn Treebank into surface dependency (analytical) representations, using the following additional linguistic information from other sources:
PropBank (LDC2004T14)
VerbNet
NomBank (LDC2008T23)
flat noun phrase structures (by courtesy of D. Vadas and J.R. Curran)
For each sentence, the original Penn Treebank phrase structure trees are preserved in this corpus together with their links to the analytical and tectogrammatical annotation.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
available-restrictedUse
other
other
downloadable
Hajič
Jan
hajic@ufal.mff.cuni.cz
Charles University in Prague, UFAL
hajic@ufal.mff.cuni.cz
2021-06-29
true
http://ufal.mff.cuni.cz/pcedt2.0/trees/00/01/wsj_0001_1.xhtml
MSM0021620838 - Moderní metody, struktury a systémy informatiky
nationalFunds
LC536 - Integrated center for natural language processing
nationalFunds
ME09008 - Mnohojazyčná univerzální anotace lingvistických dat
nationalFunds
LM2010013 - LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat
nationalFunds
7E09003 - EuroMatrixPlus—Bringing Machine Translation for European Languages to the User
nationalFunds
7E11051 - EuroMatrixPlus - Enlarged European Union Bringing Machine Translation for European Languages to the User
nationalFunds
7E11041 - Feedback Analysis for User Adaptive Statistical Translation
nationalFunds
GAP406/10/0875 - Computational Linguistics: Explicit description of language and annotated data focused on Czech
nationalFunds
GPP406/10/P193 - Tools for Revision and Tectogrammatical Annotation of a Czech Dependency Treebank
nationalFunds
GA405/09/0729 - From the structure of a sentence to textual relationships
nationalFunds
Companions, No. 034434
euFunds
EuroMatrix, No. 034291
euFunds
EuroMatrixPlus, No. 231720
euFunds
Faust, No. 247762
euFunds
corpus
text
bilingual
ces
Czech
eng
English
49208
sentences
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000E-011B-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Corpus of contemporary blogs
In NLP Centre, dividing text into sentences is currently done with
a tool which uses rule-based system. In order to make enough training
data for machine learning, we split the corpus of contemporary text
CBB.blog (1 million tokens) with annotators into senteces.
Each file contains one hundredth of the whole corpus and all data were
processed in parallel by two annotators.
The corpus was created from ten contemporary blogs:
hintzu.otaku.cz
modnipeklo.cz
bloc.cz
aleneprokopova.blogspot.com
blog.aktualne.cz
fuchsova.blog.onaidnes.cz
havlik.blog.idnes.cz
blog.aktualne.centrum.cz
klusak.blogspot.cz
myego.cz/welldone
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-000E-011B-8
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
downloadable
Grác
Marek
grac@fi.muni.cz
Masaryk university, NLP Centre
grac@fi.muni.cz
2021-06-29
true
corpus
text
monolingual
ces
Czech
10
mb
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B2E-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
sholva-0.6
Semantic net `sholva' contains more than 150 000 records for which there was sufficient agreement among annotators. Indvidual words are labeled in the following categories:
person, person / individual, event and substance.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-1B2E-0
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
downloadable
Grác
Marek
grac@fi.muni.cz
Masaryk university, NLP Centre
grac@fi.muni.cz
2021-06-29
true
lexicalConceptualResource
wordnet
text
monolingual
ces
Czech
3
mb
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0015-A780-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MorfFlex CZ
Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0015-A780-9
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
hardDisk
Hajič
Jan
hajic@ufal.mff.cuni.cz
Charles University in Prague, UFAL
hajic@ufal.mff.cuni.cz
2021-06-29
true
N/A
ownFunds
lexicalConceptualResource
computationalLexicon
text
monolingual
ces
Czech
113537915
lexicalTypes
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0019-89A0-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 2
Corpus AKCES 2 uncludes trancripts of records of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora)
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0019-89A0-9
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
downloadable
Šebesta
KS
sebesta@ff.cuni.cz
Charles University in Prague, ÚČJTK
sebesta@ff.cuni.cz
2021-06-29
true
https://wiki.korpus.cz/doku.php/cnk:schola2010
Jazyk jako lidská činnost, její produkt a faktor
nationalFunds
Program rozvoje vědních oblastí na Univerzitě Karlově P10 – Lingvistika, modul Osvojování a vývoj jazykové a komunikační kompetence u populace ČR, řešeno od r. 2012
nationalFunds
corpus
text
monolingual
ces
Czech
792764
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-6133-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
W2C – Web to Corpus – Corpora
A set of corpora for 120 languages automatically collected from wikipedia and the web.
Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0022-6133-9
available-restrictedUse
CC-BY-SA
attribution
shareAlike
downloadable
accessibleThroughInterface
Popel
Martin
popel@ufal.mff.cuni.cz
Charles University in Prague, UFAL
popel@ufal.mff.cuni.cz
2021-06-29
true
corpus
text
multilingual
afr
Afrikaans
als
ToskAlbanian
amh
Amharic
ara
Arabic
arg
Aragonese
arz
Egyptian Arabic
ast
Asturian
aze
Azerbaijani
bel
Belarusian
ben
Bengali
bos
Bosnian
bpy
Bishnupriya
bre
Breton
bug
Buginese
bul
Bulgarian
cat
Catalan
ceb
Cebuano
ces
Czech
chv
Chuvash
cos
Corsican
cym
Welsh
dan
Danish
deu
German
diq
Dimli (individual language)
ell
Modern Greek (1453-)
eng
English
epo
Esperanto
est
Estonian
eus
Basque
fao
Faroese
fas
Persian
fin
Finnish
fra
French
fry
WesternFrisian
gan
GanChinese
gla
ScottishGaelic
gle
Irish
glg
Galician
glk
Gilaki
guj
Gujarati
hat
Haitian
hbs
Serbo-Croatian
heb
Hebrew
hif
Fiji Hindi
hin
Hindi
hrv
Croatian
hsb
UpperSorbian
hun
Hungarian
hye
Armenian
ido
Ido
ina
Interlingua (International Auxiliary Language Association)
ind
Indonesian
isl
Icelandic
ita
Italian
jav
Javanese
jpn
Japanese
kan
Kannada
kat
Georgian
kaz
Kazakh
kor
Korean
kur
Kurdish
lat
Latin
lav
Latvian
lim
Limburgan
lit
Lithuanian
lmo
Lombard
ltz
Luxembourgish
mal
Malayalam
mar
Marathi
mkd
Macedonian
mlg
Malagasy
mon
Mongolian
mri
Maori
msa
Malay (macrolanguage)
mya
Burmese
nap
Neapolitan
nds
LowGerman
nep
Nepali
new
Newari
nld
Dutch
nno
Norwegian Nynorsk
nor
Norwegian
oci
Occitan(post 1500)
oss
Ossetian
pam
Pampanga
pms
Piemontese
pol
Polish
por
Portuguese
que
Quechua
ron
Romanian
rus
Russian
sah
Yakut
scn
Sicilian
sco
Scots
slk
Slovak
slv
Slovenian
spa
Spanish
sqi
Albanian
srp
Serbian
sun
Sundanese
swa
Swahili(macrolanguage)
swe
Swedish
tam
Tamil
tat
Tatar
tel
Telugu
tgk
Tajik
tgl
Tagalog
tha
Thai
tur
Turkish
ukr
Ukrainian
urd
Urdu
uzb
Uzbek
vec
Venetian
vie
Vietnamese
vol
Volapük
war
Waray (Philippines)
wln
Walloon
yid
Yiddish
yor
Yoruba
zho
Chinese
55
gb
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-AAF5-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MTMonkey
MTMonkey is a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre- and post processing.
It consists of an application server and remote workers which handle text processing and communicate translation requests to MT systems. The communication between the application server and the workers is based on the XML-RPC protocol.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0022-AAF5-B
available-restrictedUse
other
other
downloadable
Dušek
Ondřej
odusek@ufal.mff.cuni.cz
Charles University in Prague, UFAL
odusek@ufal.mff.cuni.cz
2021-06-29
true
The KHRESMOI Project (EU 7th Framework Programme grant agreement no. 257528)
euFunds
LINDAT-CLARIN project (LM2010013)
nationalFunds
AMALACH project (DF12P01OVV02 of the Ministry of Culture of Czech Republic)
nationalFunds
toolService
infrastructure
false
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C73C-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Named Entity Corpus 1.0
The presented Czech Named Entity Corpus 1.0 is the first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0022-C73C-7
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Straková
Jana
strakova@ufal.mff.cuni.cz
Charles University in Prague, UFAL
strakova@ufal.mff.cuni.cz
2021-06-29
1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů)
nationalFunds
corpus
text
monolingual
ces
Czech
6000
sentences
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C7F6-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
PML Tree Query
System for querying annotated treebanks in PML format. The querying uses it own query language with graphical representation. It has two different implementations (SQL and Perl) and several clients (TrEd, browser-based, command line interface).
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0022-C7F6-3
available-restrictedUse
GPL
other
downloadable
Štěpánek
Jan
jan.stepanek@matfyz.cz
Charles University in Prague, UFAL
jan.stepanek@matfyz.cz
2021-06-29
https://lindat.mff.cuni.cz/services/pmltq/
Integration of language resources for information extraction from natural texts
nationalFunds
toolService
tool
false
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C7FD-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
PMLTQ::Web
Simple web build on the top of the PML Tree Query service.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0022-C7FD-6
available-restrictedUse
other
other
accessibleThroughInterface
Sedlák
Michal
sedlak@ufal.mff.cuni.cz
Charles University in Prague, UFAL
sedlak@ufal.mff.cuni.cz
2021-06-29
https://lindat.mff.cuni.cz/services/pmltq/
toolService
tool
false
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-10B2-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Many Czech References for 50 Sentences Selected from WMT11 Data
This dataset contains the whole set of very many Czech translations for 50 English source sentences coming from WMT11 test set (http://www.statmt.org/wmt11).
In total, there are 15431447 Czech sentences, i.e. 300k reference translations per source English sentence on average, but the exact number greatly varies across sentences.
You can find more details in included README file.
If you use this dataset, please cite the following paper which describes the technique used to construct the Czech translations:
Bojar Ondřej, Macháček Matouš, Tamchyna Aleš, Zeman Daniel:
Scratching the Surface of Possible Translations.
Lecture Notes in Computer Science, Vol. 8082, Text, Speech and Dialogue: 16th
International Conference, TSD 2013. Proceedings, Copyright © Springer Verlag,
Berlin / Heidelberg, ISBN 978-3-642-40584-6, ISSN 0302-9743, pp. 465-474, 2013
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-10B2-F
available-restrictedUse
CC-BY-SA
attribution
shareAlike
downloadable
Macháček
Matouš
machacekmatous@gmail.com
Charles University in Prague, UFAL
machacekmatous@gmail.com
2021-06-29
Čeština ve věku strojového překladu
nationalFunds
MosesCore
euFunds
Využití mnohonásobných referencí ve strojovém překladu
nationalFunds
corpus
text
monolingual
ces
Czech
15431447
sentences
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-D9BF-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Khresmoi Query Translation Test Data 1.0
This package contains data sets for development and testing of machine translation of medical search short queries between Czech, English, French, and German. The queries come from general public and medical experts.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0022-D9BF-5
available-restrictedUse
CC-BY-NC
attribution
academic-nonCommercialUse
downloadable
Hajič
Jan
hajic@ufal.mff.cuni.cz
Charles University in Prague, UFAL
hajic@ufal.mff.cuni.cz
2021-06-29
true
KHRESMOI - KNOWLEDGE HELPER FOR MEDICAL AND OTHER INFORMATION USERS, EU NO. 257528
euFunds
euFunds
corpus
text
multilingual
eng
English
fra
French
deu
German
ces
Czech
1508
terms
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-EE02-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Plain-Moses-Chimera
Statistical component of Chimera, a state-of-the-art MT system.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0022-EE02-C
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Bojar
Ondřej
bojar@ufal.mff.cuni.cz
Charles University in Prague, UFAL
bojar@ufal.mff.cuni.cz
2021-06-29
Zpřístupnění rozsáhlého video archivu kulturního dědictví pomocí metod automatického rozpoznávání mluvené řeči a strojového překladu. (AMALACH)
nationalFunds
toolService
suiteOfTools
true
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-FE82-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Facebook Data for Sentiment Analysis
Corpus consists of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0022-FE82-7
available-restrictedUse
CC-BY-SA
attribution
shareAlike
downloadable
Habernal
Ivan
habernal@kiv.zcu.cz
University of West Bohemia in Pilsen, KIV
habernal@kiv.zcu.cz
2021-06-29
http://liks.fav.zcu.cz/sentiment/
corpus
text
monolingual
ces
Czech
1084
kb
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-FF60-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech SubLex 1.0
Czech subjectivity lexicon, i.e. a list of subjectivity clues for sentiment analysis in Czech. The list contains 4626 evaluative items (1672 positive and 2954 negative) together with their part of speech tag, polarity orientation and source information.
The core of the Czech subjectivity lexicon has been gained by automatic translation of a freely available English subjectivity lexicon downloaded from http://www.cs.pitt.edu/mpqa/subj_lexicon.html. For translating the data into Czech, we used parallel corpus CzEng 1.0 containing 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep layers of syntactic representation. Afterwards, the lexicon has been manually refined by an experienced annotator.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0022-FF60-B
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Veselovská
Kateřina
veselovska@ufal.mff.cuni.cz
Charles University in Prague, UFAL
veselovska@ufal.mff.cuni.cz
2021-06-29
GAUK 3537/2011 grant and SVV project number 267 314.
nationalFunds
lexicalConceptualResource
wordList
text
monolingual
ces
Czech
207
kb
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119C-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ORAL2006: Corpus of informal spoken Czech
Corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 221 recordings made in 2002–2006 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 754, the metadata include sociolinguistic information about them. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-119C-C
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Křen
Michal
michal.kren@ff.cuni.cz
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
michal.kren@ff.cuni.cz
2021-06-29
true
Český národní korpus a korpusy dalších jazyků
nationalFunds
corpus
text
monolingual
ces
Czech
1000000
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119D-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
ORAL2008: Balanced corpus of informal spoken Czech
Balanced corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 297 recordings made in 2002–2007 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 995, the corpus is balanced in their main sociolinguistic categories (gender, age group, education, region of childhood residence).
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-119D-A
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Křen
Michal
michal.kren@ff.cuni.cz
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
michal.kren@ff.cuni.cz
2021-06-29
true
Český národní korpus a korpusy dalších jazyků
nationalFunds
corpus
text
monolingual
ces
Czech
1000000
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119E-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2005: balanced corpus of written Czech
Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2000–2004 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-119E-8
available-restrictedUse
other
other
downloadable
Křen
Michal
michal.kren@ff.cuni.cz
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
michal.kren@ff.cuni.cz
2021-06-29
true
Český národní korpus a korpusy dalších jazyků
nationalFunds
corpus
text
monolingual
ces
Czech
100000000
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119F-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2010: balanced corpus of written Czech
Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2005–2009 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-119F-6
available-restrictedUse
other
other
downloadable
Křen
Michal
michal.kren@ff.cuni.cz
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
michal.kren@ff.cuni.cz
2021-06-29
true
Český národní korpus a korpusy dalších jazyků
nationalFunds
corpus
text
monolingual
ces
Czech
100000000
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1358-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2006PUB: corpus of Czech newspapers
Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-1358-3
available-restrictedUse
other
other
downloadable
Křen
Michal
michal.kren@ff.cuni.cz
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
michal.kren@ff.cuni.cz
2021-06-29
true
Český národní korpus a korpusy dalších jazyků
nationalFunds
corpus
text
monolingual
ces
Czech
300000000
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1359-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2009PUB: corpus of Czech newspapers
Corpus of contemporary Czech newspapers and magazines sized 700 MW. It contains various titles published between 1995–2007. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-1359-1
available-restrictedUse
other
other
downloadable
Křen
Michal
michal.kren@ff.cuni.cz
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
michal.kren@ff.cuni.cz
2021-06-29
true
Český národní korpus a korpusy dalších jazyků
nationalFunds
corpus
text
monolingual
ces
Czech
700000000
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1AAF-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Prague Dependency Treebank 3.0
PDT 3.0 is a new version of Prague Dependency Treebank. It contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and semantic annotation (0.8 MW); in addition, certain properties of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations are annotated at the semantic level.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Mírovský
Jiří
mirovsky@ufal.mff.cuni.cz
Charles University in Prague, UFAL
mirovsky@ufal.mff.cuni.cz
2021-06-29
true
http://ufal.mff.cuni.cz/pdt3.0
Computational Linguistics: Explicit description of language and annotated data focused on Czech
nationalFunds
corpus
text
monolingual
ces
Czech
49431
sentences
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B04-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Named Entity Corpus 1.1
Czech Named Entity Corpus 1.1 fixes some issues of the Czech Named Entity Corpus 1.0: misannotated entities are fixed, all formats contain the same data, tmt format is replaced with treex format, all formats contain splitting into training, development and testing portion of the data.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Straková
Jana
strakova@ufal.mff.cuni.cz
Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics in Prague
strakova@ufal.mff.cuni.cz
2021-06-29
Teoretické základy informatiky a výpočetní lingvistiky
nationalFunds
LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat
nationalFunds
Vybrané derivační vztahy pro automatické zpracování češtiny
nationalFunds
PRVOUK
nationalFunds
corpus
text
monolingual
ces
Czech
5868
sentences
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B22-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Named Entity Corpus 2.0
Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named entities.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-1B22-8
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Straková
Jana
strakova@ufal.mff.cuni.cz
Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics in Prague
strakova@ufal.mff.cuni.cz
2021-06-29
Teoretické základy informatiky a výpočetní lingvistiky
nationalFunds
LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat
nationalFunds
Vybrané derivační vztahy pro automatické zpracování češtiny
nationalFunds
PRVOUK
nationalFunds
corpus
text
monolingual
ces
Czech
8993
sentences
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1D76-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Senior COMPANION Expressive Speech Corpus
The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded natural dialogues between a computer and seniors.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-1D76-9
available-restrictedUse
CC-BY-NC-SA
attribution
academic-nonCommercialUse
shareAlike
downloadable
Ircing
Pavel
ircing@kky.zcu.cz
University of West Bohemia, Dept. of Cybernetics
ircing@kky.zcu.cz
2021-06-29
true
COMPANIONS - Intelligent, Persistent, Personalised Multimodal Interfaces to the Internet
euFunds
corpus
audio
monolingual
ces
Czech
6508
utterances
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-3B09-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
SYN2013PUB: corpus of written Czech newspapers
Corpus of contemporary Czech newspapers and magazines sized 935 MW. It contains various titles published between 2005–2009. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-3B09-4
available-restrictedUse
other
other
downloadable
Křen
Michal
michal.kren@ff.cuni.cz
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
michal.kren@ff.cuni.cz
2021-06-29
true
https://kontext.korpus.cz/first_form?corpname=syn2013pub
Český národní korpus
nationalFunds
corpus
text
monolingual
ces
Czech
935 000 000
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-3FBB-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
AKCES 2 ver. 2
Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-3FBB-3
available-restrictedUse
CC-BY-NC-ND
attribution
academic-nonCommercialUse
noDerivatives
downloadable
Šebesta
Karel
sebesta@ff.cuni.cz
Charles University in Prague, ÚČJTK
sebesta@ff.cuni.cz
2021-06-29
true
http://ames.ff.cuni.cz/
Program rozvoje vědních oblasti na Univerzitě Karlově, Program P10 - Lingvistika
nationalFunds
corpus
text
monolingual
ces
Czech
792764
words
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4087-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Linguistic digital repository based on DSpace
One of the goals of LINDAT/CLARIN Centre for Language Research Infrastructure is to provide technical background to institutions or researchers who wants to share their tools and data used for research in linguistics or related research fields. The digital repository is built on a highly customised DSpace platform.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-4087-6
available-restrictedUse
other
other
downloadable
Mišutka
Jozef
misutka@ufal.mff.cuni.cz
Charles University in Prague, UFAL
misutka@ufal.mff.cuni.cz
2021-06-29
true
LM2010013
nationalFunds
toolService
infrastructure
false
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4336-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Czech Morphological Analyzer v1
One of the very first steps in automatic processing of Czech text is morphological analysis and lemmatization.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-4336-4
available-restrictedUse
other
other
downloadable
Hajič
Jan
jan.hajic@mff.cuni.cz
Charles University in Prague, UFAL
jan.hajic@mff.cuni.cz
2021-06-29
https://lindat.mff.cuni.cz/services/morph/index.html
toolService
service
true
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4337-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
EngVallex - English Valency Lexicon
EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank and Verbnet, two existing English predicate-argument lexicons used, i.a., for the PropBank project. The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank, which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-4337-2
available-restrictedUse
other
other
downloadable
Hajič
Jan
jan.hajic@mff.cuni.cz
Charles University in Prague, UFAL
jan.hajic@mff.cuni.cz
2021-06-29
http://lindat.mff.cuni.cz/services/EngVallex/
lexicalConceptualResource
computationalLexicon
text
monolingual
eng
English
4337
entries
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4338-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
PDT-Vallex: Czech Valency lexicon linked to treebanks
The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool), and also in more human readable form including corpus examples (see the WEBSITE link below). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-4338-F
available-restrictedUse
other
other
downloadable
Hajič
Jan
jan.hajic@mff.cuni.cz
Charles University in Prague, UFAL
jan.hajic@mff.cuni.cz
2021-06-29
http://lindat.mff.cuni.cz/services/PDT-Vallex/
lexicalConceptualResource
lexicon
text
monolingual
ces
Czech
7121
entries
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-43CD-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
MorphoDiTa: Morphologic Dictionary and Tagger
MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological generation, tagging and tokenization and is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, MorphoDiTa achieves state-of-the-art results with a throughput around 10-200K words per second. MorphoDiTa is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-43CD-0
available-restrictedUse
other
other
downloadable
Straka
Milan
straka@ufal.mff.cuni.cz
Charles University in Prague, UFAL
straka@ufal.mff.cuni.cz
2021-06-29
http://lindat.mff.cuni.cz/services/morphodita/
toolService
tool
false
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-43CE-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
NameTag
NameTag is an open-source tool for named entity recognition (NER). NameTag identifies proper names in text and classifies them into predefined categories, such as names of persons, locations, organizations, etc. NameTag is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, NameTag achieves state-of-the-art performance (Straková et al. 2013). NameTag is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-43CE-E
available-restrictedUse
other
other
downloadable
Straka
Milan
straka@ufal.mff.cuni.cz
Charles University in Prague, UFAL
straka@ufal.mff.cuni.cz
2021-06-29
http://lindat.mff.cuni.cz/services/nametag/
toolService
tool
false
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4670-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Vystadial 2013 – Czech data
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts.
The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits.
This is the Czech data part of the dataset.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-4670-6
available-restrictedUse
CC-BY-SA
attribution
shareAlike
downloadable
Korvas
Matěj
korvas@ufal.mff.cuni.cz
Faculty of Mathematics and Physics, Charles University in Prague, UFAL
korvas@ufal.mff.cuni.cz
2021-06-29
true
MŠMT LK11221 (Vývoj metod pro návrh statistických mluvených dialogových systémů)
nationalFunds
corpus
audio
monolingual
ces
Czech
18
hours
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4671-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Vystadial 2013 – English data
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts.
The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits.
This is the English data part of the dataset.
NOT_DEFINED_FOR_V2
http://hdl.handle.net/11858/00-097C-0000-0023-4671-4
available-restrictedUse
CC-BY-SA
attribution
shareAlike
downloadable
Korvas
Matěj
korvas@ufal.mff.cuni.cz
Faculty of Mathematics and Physics, Charles University in Prague, UFAL
korvas@ufal.mff.cuni.cz
2021-06-29
true
MŠMT LK11221 (Vývoj metod pro návrh statistických mluvených dialogových systémů)
nationalFunds
corpus
audio
monolingual
eng
English
45
hours
oai_metasharev2///hdl_11858_00-097C-0000-0001-4877-A/100