2024-03-28T19:51:47Zhttp://lindat.mff.cuni.cz/repository/oai/requestoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4872-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hajič, Jan
Smrž, Otakar
Zemánek, Petr
Pajas, Petr
Šnaidauf, Jan
Beška, Emanuel
Kracmar, Jakub
Hassanová, Kamila
2011-06-27T11:59:09Z
2009-11-02T10:34:20Z
2009-11-02T10:34:20Z
http://hdl.handle.net/11858/00-097C-0000-0001-4872-3
The PADT project might be summarized as an open-ended activity of the Center for Computational Linguistics, the Institute of Formal and Applied Linguistics, and the Institute of Comparative Linguistics, Charles University in Prague, resting in multi-level annotation of Arabic language resources in the light of the theory of Functional Generative Description (Sgall et al., 1986; Hajičová and Sgall, 2003).
ara
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
http://ufal.mff.cuni.cz/padt
corpus
Arabic
Prague Arabic Dependency Treebank 1.0
Pražský arabský závislostní korpus 1.0
corpus
Prague Arabic Dependency Treebank 1.0
Hajic
Jan
Charles University in Prague, UFAL
restrictedUse
own
academicUse/nonCommercialUse
DVD-R
False
Prague Arabic Dependency Treebank
nationalFunds
Dependency treebank
text
ces
113500
tokens
hajic@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
113500@@tokens
130383574
2
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-487A-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Bejček, Eduard
Hoffmannová, Petra
Holub, Martin
Hučínová, Marie
Pecina, Pavel
Straňák, Pavel
Šidák, Pavel
Hajič, Jan
2011-06-27T13:00:08Z
2011-06-27T13:00:08Z
2011-01-23
http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
This dataset contains annotation of PDT using Czech WordNet ontology: http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
Data is stored in PML format. This is a stand-off annotation and for most use cases it requires PDT 2.0 and the Czech WordNet 1.9 PDT that we have used for annotation.
1ET100300517, 1ET201120505
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
PDT
Czech WordNet
PDT
Lexico-Semantic Annotation of PDT using Czech WordNet
Lexikálně-sémantická anotace PDT pomocí Českého WordNetu
corpus
Lexikálně-sémantická anotace PDT pomocí Českého WordNetu
Straňák
Pavel
Charles University in Prague, UFAL
unrestrictedUse
CC_BY-NC-SA_3.0
attribution
download
True
1ET201120505 - Od jazyka ke znalostem a sémantickému webu
nationalFunds
Data is stored in PML format. This is a stand-off annotation and for most use cases it requires PDT 2.0 and the Czech WordNet 1.9 PDT that we have used for annotation.
text
ces
2.5
mb
stranak@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Grantová agentura Akademie věd České republiky@@1ET201120505@@Od jazyka ke znalostem a sémantickému webu@@nationalFunds@@
Grantová agentura Akademie věd České republiky@@1ET100300517@@Metody inteligentních systémů a jejich aplikace při dobývání znalostí a zpracování přirozeného jazyka@@nationalFunds@@
2.5@@mb
2560526
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4916-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Bojar, Ondřej
Žabokrtský, Zdeněk
Češka, Pavel
Beňa, Peter
Janíček, Miroslav
2011-06-28T16:13:23Z
2009-11-02T10:32:27Z
2009-11-02
http://hdl.handle.net/11858/00-097C-0000-0001-4916-9
CzEng 0.7 is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague. The corpus contains no manual annotation. It is limited only to texts which have been already available in an electronic form and which are not protected by authors' rights in the Czech Republic. The main purpose of the corpus is to support Czech-English and English-Czech machine translation research with the necessary data. CzEng 0.7 consists of a large set of parallel textual documents mainly from the fields of European law, information technology, and fiction, all of them converted into a uniform XML-based file format and provided with automatic sentence alignment.
ces
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11234/1-1458
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
http://ufal.mff.cuni.cz/czeng/czeng07/
parallel corpus
CzEng 0.7
corpus
CzEng 0.7
Straňák
Pavel
Charles University in Prague, UFAL
unrestrictedUse
CC_BY-NC-SA_3.0
academicUse/nonCommercialUse
download
True
EuroMatrix
EU
a Eng. Cz parallel corpus
text
ces
1375908
sentences
stranak@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
1375908@@sentences
373531879
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4908-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Lopatková, Markéta
Žabokrtský, Zdeněk
Kettnerová, Václava
2011-06-28T10:07:47Z
2009-11-02T11:50:55Z
2009-11-02
http://hdl.handle.net/11858/00-097C-0000-0001-4908-9
The Valency Lexicon of Czech Verbs, Version 2.5 (VALLEX 2.5), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. VALLEX 2.5 has been developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague.
VALLEX 2.5 provides information on the valency structure (combinatorial potential) of verbs in their particular senses - there are roughly 2,730 lexeme entries containing together around 6,460 lexical units ("senses").
LC 536 - Center for Computational Linguistics, 1ET100300517 and 1ET101120503.
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11234/1-2307
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
http://ufal.mff.cuni.cz/vallex/2.5/
valency
Czech
VALLEX 2.5
lexicalConceptualResource
VALLEX 2.5
Straňák
Pavel
Charles University in Prague, UFAL
unrestrictedUse
own
academicUse/nonCommercialUse
download
True
Vallex
nationalFunds
The Valency Lexicon of Czech Verbs, Version 2.5 (VALLEX 2.5), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. VALLEX 2.5 has been developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague.
VALLEX 2.5 provides information on the valency structure (combinatorial potential) of verbs in their particular senses - there are roughly 2,730 lexeme entries containing together around 6,460 lexical units ("senses").
text
lexicon
ces
6460
words
stranak@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LC536@@Centrum komputační lingvistiky@@nationalFunds@@
Grantová agentura Akademie věd České republiky@@1ET100300517@@Metody inteligentních systémů a jejich aplikace při dobývání znalostí a zpracování přirozeného jazyka@@nationalFunds@@
Grantová agentura Akademie věd České republiky@@1ET101120503@@Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů@@nationalFunds@@
6460@@words
16102889
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4880-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pala, Karel
Čapek, Tomáš
Zajíčková, Barbora
Bartůšková, Dita
Kulková, Kateřina
Hoffmannová, Petra
Bejček, Eduard
Straňák, Pavel
Hajič, Jan
2011-06-27T14:04:01Z
2011-01-24T09:00:29Z
2011-01-24
http://hdl.handle.net/11858/00-097C-0000-0001-4880-3
A slightly modified version of the Czech Wordnet. This is the version used to annotate "The Lexico-Semantic Annotation of PDT using Czech WordNet": http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
The Czech WordNet was developed by the Centre of Natural Language Processing at the Faculty of Informatics, Masaryk University, Czech Republic.
The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 23,094 word senses (synsets). 203 of these were created or modified by UFAL during correction of annotations. This version of WordNet was used to annotate word senses in PDT: http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
A more recent version of Czech WordNet is distributed by ELRA: http://catalog.elra.info/product_info.php?products_id=1089
1ET201120505, LM2010013
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
ontology
wordnet
Czech WordNet
Czech WordNet 1.9 PDT
corpus
Czech WordNet 1.9 PDT
Straňák
Pavel
Charles University in Prague, UFAL
unrestrictedUse
CC_BY-NC-SA_3.0
academicUse/nonCommercialUse
download
True
1ET201120505 - Od jazyka ke znalostem a sémantickému webu
nationalFunds
The Czech WordNet was developed by the Centre of Natural Language Processing at the Faculty of Informatics, Masaryk University, Czech Republic.
The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 23,094 word senses (synsets). 203 of these were created or modified by UFAL during correction of annotations. This version of WordNet was used to annotate word senses in PDT: http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
A more recent version of Czech WordNet is distributed by ELRA: http://catalog.elra.info/product_info.php?products_id=1089
text
ces
23094
words
stranak@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Grantová agentura Akademie věd České republiky@@1ET201120505@@Od jazyka ke znalostem a sémantickému webu@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
23094@@words
451431
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-487E-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hajič, Jan
Straňák, Pavel
Štěpánek, Jan
2011-06-27T13:16:27Z
2009-01-05T00:00:00Z
2009-01-05
http://www.aclweb.org/anthology/W09-1201
http://hdl.handle.net/11858/00-097C-0000-0001-487E-B
Czech trial (example) data for CoNLL 2009 Shared Task. The data are generated from PDT 2.0. LDC2009E32B
MSM 0021620838 (http://ufal.mff.cuni.cz:8080/bib/?section=grant&id=116488695895567&mode=view)
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
conll-st
CoNLL 2009 Shared Task Czech Trial Set
corpus
CoNLL 2009 Shared Task Czech Trial Set
Straňák
Pavel
Charles University in Prague, UFAL
unrestrictedUse
CC_BY-NC-SA_3.0
attribution
download
True
none
ownFunds
Czech trial (example) data for CoNLL 2009 Shared Task.
text
ces
194
sentences
stranak@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620838@@Moderní metody, struktury a systémy informatiky@@nationalFunds@@
194@@sentences
59154
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4909-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Klyueva, Natalia
Bojar, Ondřej
2011-06-28T10:42:32Z
2008-10-02T00:00:00Z
2008-10-02
http://hdl.handle.net/11858/00-097C-0000-0001-4909-7
UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of UMC is to extend the set of languages covered by the corpus CzEng mainly for the purposes of machine translation.
All the texts were downloaded from a single source — The Project Syndicate (Copyright: Project Syndicate 1995-2008), which contains a huge collection of high-quality news articles and commentaries. We were given the permission to use the texts for research and non-commercial purposes.
FP6-IST-5-034291-STP (EuroMatrix)
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
http://ufal.mff.cuni.cz/umc/cer
multi-language corpus
UMC 0.1: Czech-Russian-English Multilingual Corpus
corpus
UMC 0.1: Czech-Russian-English Multilingual Corpus
Straňák
Pavel
Charles University in Prague, UFAL
unrestrictedUse
CC_BY-NC-ND
academicUse/nonCommercialUse
download
True
EuroMatrix
EU
UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of UMC is to extend the set of languages covered by the corpus CzEng mainly for the purposes of machine translation.
All the texts were downloaded from a single source — The Project Syndicate (Copyright: Project Syndicate 1995-2008), which contains a huge collection of high-quality news articles and commentaries. We were given the permission to use the texts for research and non-commercial purposes.
text
ces
1800000
words
stranak@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
European Union@@FP6-IST-5-034291-STP@@Euromatrix@@euFunds@@
1800000@@words
25537868
1
Czech-Russian|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=umc_01_cs_m
English-Russian|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=umc_01_enru_en_m
Russian-Czech|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=umc_01_ru_m
Russian-English|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=umc_01_enru_ru_m
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B098-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hajič, Jan
Panevová, Jarmila
Hajičová, Eva
Sgall, Petr
Pajas, Petr
Štěpánek, Jan
Havelka, Jiří
Mikulová, Marie
Žabokrtský, Zdeněk
Ševčíková-Razímová, Magda
Urešová, Zdeňka
2011-11-03T21:33:25Z
2006-07-21T00:00:00Z
2006-07-21
LDC2006T01
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (two million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.
PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well.
1ET101120413 (Data a nástroje pro informační systémy) MSM 0021620838 (Moderní metody, struktury a systémy informatiky) 1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů) 1P05ME752 (Vícejazyčný valenční a predikátový slovník přirozeného jazyka) LC536 (Centrum komputační lingvistiky)
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
PDT 2.0 License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2
ACA
http://ufal.mff.cuni.cz/pdt2.0/
corpus
Czech
treebank
PDT
Prague Dependency Treebank 2.0 (PDT 2.0)
Pražský závislostní korpus 2.0 (PZK 2.0)
corpus
text
yes
LINDAT / CLARIAH-CZ
Pavel@@Straňák@@stranak@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Grantová agentura Akademie věd České republiky@@1ET101120413@@Data a nástroje pro informační systémy@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620838@@Moderní metody, struktury a systémy informatiky@@nationalFunds@@
Grantová agentura Akademie věd České republiky@@1ET101120503@@Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@1P05ME752@@Vícejazyčný valenční a predikátový slovník přirozeného jazyka@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LC536@@Centrum komputační lingvistiky@@nationalFunds@@
2000000@@words
281012478
8
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B43E-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hajič, Jan
Panevová, Jarmila
Sgall, Petr
Pajas, Petr
Štěpánek, Jan
Havelka, Jiří
Mikulová, Marie
Žabokrtský, Zdeněk
Ševčíková-Razímová, Magda
2011-11-04T15:03:18Z
2006-06-21T00:00:00Z
2006-06-21
http://hdl.handle.net/11858/00-097C-0000-0001-B43E-6
A small subset of PDT 2.0 made available under a permissive license.
Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.
PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well.
* Ministry of Education of the Czech Republic projects No. VS96151, LN00A063, 1P05ME752, MSM0021620838 and LC536,
* Grant Agency of the Czech Republic grants Nos. 405/96/0198, 405/96/K214 and 405/03/0913,
* research funds of the Faculty of Mathematics and Physics,
* Charles University, Prague, Czech Republic,
* Grant Agency of the Czech Academy of Science, Prague, Czech Republic projects No. 1ET101120503, 1ET101120413, and 1ET201120505
* Grant Agency of the Charles University No. 489/04, 350/05, 352/05 and 375/05
* the U.S. NSF Grant #IIS9732388.
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
http://ufal.mff.cuni.cz/pdt2.0/doc/pdt-guide/en/html/ch03.html#a-data-sample
treebank
dependency
PDT
Prague Dependency Treebank 2.0 - sample data
corpus
Prague Dependency Treebank 2.0 - sample data
Straňák
Pavel
Charles University in Prague, UFAL
unrestrictedUse
CC_BY
academicUse/nonCommercialUse
download
True
nationalFunds
A small subset of PDT 2.0 made available under a permissive license.
text
ces
549.2
kb
stranak@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@VS96151@@Laboratoř počítačového zpracování jazykových dat@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LN00A063@@Centrum komputační lingvistiky@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@1P05ME752@@Vícejazyčný valenční a predikátový slovník přirozeného jazyka@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620838@@Moderní metody, struktury a systémy informatiky@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LC536@@Centrum komputační lingvistiky@@nationalFunds@@
Grantová agentura České republiky@@GA405/96/0198@@Formální reprezentace jazykových struktur@@nationalFunds@@
Grantová agentura České republiky@@GA405/96/K214@@Čeština ve věku počítačů@@nationalFunds@@
Grantová agentura České republiky@@GA405/03/0913@@Velké jazykové korpusy a jejich automatická analýza@@nationalFunds@@
Grantová agentura Akademie věd České republiky@@1ET101120503@@Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů@@nationalFunds@@
Grantová agentura Akademie věd České republiky@@1ET101120413@@Data a nástroje pro informační systémy@@nationalFunds@@
Grantová agentura Akademie věd České republiky@@1ET201120505@@Od jazyka ke znalostem a sémantickému webu@@nationalFunds@@
Grantová agentura Univerzity Karlovy v Praze@@GAUK 489/2004@@Tektogramatická reprezentace angličtiny - aplikace funkčního generativního popisu (FGP) na hloubkovou syntax cizích jazyků v PZK@@nationalFunds@@
Grantová agentura Univerzity Karlovy v Praze@@GAUK 350/2005@@Faktory koherence textu a jejich zpracování v syntakticky anotovaném korpusu textů@@nationalFunds@@
Grantová agentura Univerzity Karlovy v Praze@@GAUK 352/2005@@Pražský závislostní korpus: Analýza vybraných jevů z české funkční onomatologie a syntaxe@@nationalFunds@@
Grantová agentura Univerzity Karlovy v Praze@@GAUK 375/2005@@Automatická hloubková analýza mluvené češtiny: od akustického signálu k významu@@nationalFunds@@
National Science Foundation (USA)@@NSF IIS-9732388@@Data preparation for Workshop 1998, JHU, Baltimore, MD, USA@@other@@
549.2@@kb
549252
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4914-D2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hajič, Jan
Pajas, Petr
Mareček, David
Mikulová, Marie
Urešová, Zdeňka
Podveský, Petr
2011-06-28T11:19:19Z
2009-11-02T10:40:55Z
2009-11-02T10:40:55Z
http://hdl.handle.net/11858/00-097C-0000-0001-4914-D
The first edition of a speech corpus with a speech reconstruction layer (edited transcript).
The project of speech reconstruction of Czech and English has been started at UFAL together with the PIRE project in 2005, and has gradually grown from ideas to (first) annotation specification, annotation software and actual annotation. It is part of the Prague Dependency Treebank family of annotated corpus resources and tools, to which it adds the spoken language layer(s).
LC536; MSM0021620838; IST-034344; ME838
ces
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
PDTSL
https://lindat.mff.cuni.cz/repository/xmlui/page/licence-pdtsl
ACA
http://ufal.mff.cuni.cz/pdtsl
corpus
spoken language
Prague Dependency Treebank of Spoken Language (PDTSL) 0.5
Pražský závislostní korpus mluvené řeči 0.5
corpus
Prague Dependency Treebank of Spoken Language (PDTSL) 0.5
Straňák
Pavel
Charles University in Prague, UFAL
restrictedUse
CC
academicUse/nonCommercialUse
download
True
Center for Computational Linguistics
nationalFunds
First edition of speech corpus with speech reconstruction layer (edited transcript).
audio
ces
120000
words
stranak@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LC536@@Centrum komputační lingvistiky@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620838@@Moderní metody, struktury a systémy informatiky@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@ME 838@@Reprezentace významu a automatické porozuměmí přirozenému jazyku@@nationalFunds@@
European Union@@FP6-IST-5-034434-IP@@Companions IP@@euFunds@@
120000@@words
6912312
2
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-C6D1-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hajič, Jan
Straňák, Pavel
Štěpánek, Jan
2011-11-08T21:34:04Z
2009-01-19T00:00:00Z
2009-01-19
LDC2009E34B, LDC2009E35B
http://www.aclweb.org/anthology/W09-1201
http://hdl.handle.net/11858/00-097C-0000-0001-C6D1-9
Czech data - both train and test+eval sets, as well as the valency dictionary - for the CoNLL 2009 Shared Task. Documentation is included. The data are generated from PDT 2.0. LDC catalog number: LDC2009E34B
MSM 0021620838 (http://ufal.mff.cuni.cz:8080/bib/?section=grant&id=116488695895567&mode=view)
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
conll-st
treebank
CoNLL 2009 Shared Task - Czech Data
corpus
text
yes
LINDAT / CLARIAH-CZ
Pavel@@Straňák@@stranak@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620838@@Moderní metody, struktury a systémy informatiky@@nationalFunds@@
15565682
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F3-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pajas, Petr
2011-06-28T09:38:08Z
2009-11-02T09:51:39Z
2009-11-02T09:51:39Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F3-0
XSH is a powerfull command-line tool for querying, processing and editing XML documents. It features a shell-like interface with auto-completion for comfortable interactive work, but can be as well used for off-line (batch) processing of XML data.
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Artistic License (Perl) 1.0
http://opensource.org/licenses/Artistic-Perl-1.0
http://xsh.sourceforge.net
XML processing
command-line
XSH
toolService
no
LINDAT / CLARIAH-CZ
0
0
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F7-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pajas, Petr
2011-06-28T09:39:07Z
2009-10-13T13:11:11Z
2009-10-13T13:11:11Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F7-8
Tree Editor
TrEd is a fully customizable and programmable graphical editor and viewer for tree-like structures. Among other projects, it was used as the main annotation tool for syntactical and tectogrammatical annotations in The Prague Dependency Treebank, as well as for decision-tree based morphological annotation of The Prague Arabic Dependency Treebank.
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
http://ufal.mff.cuni.cz/tred/
annotation
tree
editor
XML
PML
TrEd
toolService
yes
LINDAT / CLARIAH-CZ
96617072
3
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F8-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pajas, Petr
Mareček, David
2011-06-28T09:39:18Z
2009-11-02T09:33:08Z
2009-11-02T09:33:08Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F8-6
MEd is an annotation tool in which linearly-structured annotations of text or audio data can be created and edited. The tool supports multiple stacked layers of annotations that can be interconnected by links. MEd can also be used for other purposes, such as word-to-word alignment of parallel corpora.
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
annotation tool
MEd
toolService
yes
LINDAT / CLARIAH-CZ
68114
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F9-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Krbec, Pavel
2011-06-28T09:39:30Z
2009-11-02T09:25:18Z
2009-11-02T09:25:18Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F9-4
The HMM-based Tagger is a software for morphological disambiguation (tagging) of Czech texts. The algorithm is statistical, based on the Hidden Markov Models.
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Tagging/MM_tagger/index.html
tagger
morphology
HMM tagger
toolService
yes
LINDAT / CLARIAH-CZ
2322728
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FA-22017-04-10T13:34:17Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F2-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pajas, Petr
2011-06-28T09:37:08Z
2010-01-13T15:06:26Z
2010-01-13T15:06:26Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F2-1
Modifications to DSpace made by Petr Pajas in order to support pidconsortium.eu PID handle system instead of the default handle.com system used by DSpace.
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0023-4087-6
BSD 2-Clause "Simplified" or "FreeBSD" license
http://opensource.org/licenses/BSD-2-Clause
PUB
http://svn.ms.mff.cuni.cz/redmine/projects/dspace-modifications
DSpace
handle
EPIC
Dspace modifications for use of EPIC handles
toolService
yes
LINDAT / CLARIAH-CZ
27994
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FB-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Kučera, Ondřej
2011-06-28T09:39:55Z
2009-11-02T09:42:50Z
2009-11-02T09:42:50Z
http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F
The STYX system is an electronic exercise book for practising Czech morphology and syntax consisting of more than 11, 000 sentences.
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public Licence, version 3
http://opensource.org/licenses/GPL-3.0
PUB
http://ufal.mff.cuni.cz/styx/
education
morphology
syntax
STYX
toolService
yes
LINDAT / CLARIAH-CZ
42477544
7
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FC-D2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Klusáček, David
2011-06-28T09:40:13Z
2009-11-02T09:34:32Z
2009-11-02T09:34:32Z
http://hdl.handle.net/11858/00-097C-0000-0001-48FC-D
MMI_clustering is a set of command line tools implementing Mercer's maximum mutual information-based clustering technique.
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
http://ufal.mff.cuni.cz/tools/mmic
clustering
MMI_clustering
toolService
yes
LINDAT / CLARIAH-CZ
192029
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FD-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Marek, Michal
2011-06-28T09:40:25Z
2009-11-02T09:48:39Z
2009-11-02T09:48:39Z
http://hdl.handle.net/11858/00-097C-0000-0001-48FD-B
Victor is a web page cleaning tool. It is aimed at removing menu, ads, footers, headers, etc. from HTML web pages, so that only main web page content remains. Victor is based on a conditional random fields algorithm.
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
http://ufal.mff.cuni.cz/victor/
html cleaning
Victor
toolService
yes
LINDAT / CLARIAH-CZ
1877749
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FE-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Raab, Jan
2011-06-28T09:40:39Z
2009-11-02T09:36:29Z
2009-11-02T09:36:29Z
http://hdl.handle.net/11858/00-097C-0000-0001-48FE-9
The MORČE tagger is a software for morphological disambiguation (part-of-speech tagging) of Czech text. The algorithm is statistical, based on an idea of so-called "Averaged Perceptron" published by Michael Collins in 2002.
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0023-43CD-0
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
tagger
morphology
Morče
toolService
yes
LINDAT / CLARIAH-CZ
55586532
3
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48FF-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Spousta, Miroslav
2011-06-28T09:40:54Z
2009-11-02T09:50:15Z
2009-11-02T09:50:15Z
http://hdl.handle.net/11858/00-097C-0000-0001-48FF-7
Victoria is an on-line HTML web page annotation tool suitable for selecting texts on the web pages. It can be used to mark important/interesting parts of web pages for further processing.
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
http://ufal.mff.cuni.cz/victor/
web page processing
Victoria
toolService
yes
LINDAT / CLARIAH-CZ
994346
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4900-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Kolovratník, David
2011-06-28T09:41:07Z
2009-11-02T09:37:56Z
2009-11-02T09:37:56Z
http://hdl.handle.net/11858/00-097C-0000-0001-4900-A
The MORFO system for morphological analysis of Czech consists of four units: the analyzer, the generator, the dictionary editor, and the library with the shared source code for handling dictionary objects.
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
PDT 2.0 License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2
ACA
http://ufal.mff.cuni.cz/morfo
morphological analysis
MORFO
toolService
yes
LINDAT / CLARIAH-CZ
10319099
3
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4901-82017-04-10T13:32:37Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4902-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hana, Jiří
2011-06-28T09:41:34Z
2009-11-02T09:27:18Z
2009-11-02T09:27:18Z
http://hdl.handle.net/11858/00-097C-0000-0001-4902-6
Lexical Annotation Workbench (LAW) is an integrated environment for morphological annotation. It supports simple morphological annotation (assigning a lemma and tag to a word), integration and comparison of different annotations of the same text, searching for particular word, tag etc.
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
http://purl.org/net/jh/law
language annotation
LAW
toolService
yes
LINDAT / CLARIAH-CZ
3461273
3
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4904-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hajič, Jan
2011-06-28T09:42:24Z
2009-11-02T09:22:59Z
2009-11-02T09:22:59Z
http://hdl.handle.net/11858/00-097C-0000-0001-4904-2
The Feature-based (exponential model) Tagger is a fast implementation of the Czech tagger developed at UFAL and described in the PDT 1.0 documentation (Czech Language Tagging page). In order to get the best possible results, the tagger requires preprocessing by a Czech morphological module with a very high coverage. This module covers a superset of the Czech "FM" morphology. Both the morphological module and the tagger are supplied as binary executables, together with all necessary precompiled Czech data. Input must be in the ISO Latin 2 (iso-8859-2) code and follow the csts.dtd definition, and output is produced in the same way (ISO Latin 2 code, csts.dtd). (As is the case with many of the tools provided with PDT 1.0, both executables also accept - and then produce - a "simplified SGML", which is not a real, valid SGML, but simply contains at least the tags for words, punctuation, and sentence breaks, one item per line.)
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
PDT 2.0 License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pdt2
ACA
http://ufal.mff.cuni.cz/pdt2.0/doc/tools/machine-annotation/index.html#a-ma-tagging
morphology
tagger
Feature-based tagger
toolService
yes
LINDAT / CLARIAH-CZ
http://lindat.mff.cuni.cz/services/morph/
6843519
3
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-4905-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Mírovský, Jiří
Ondruška, Roman
2011-06-28T09:42:37Z
2009-11-02T09:41:19Z
2009-11-02T09:41:19Z
http://hdl.handle.net/11858/00-097C-0000-0001-4905-F
Netgraph is a graphically oriented client-server application for searching in linguistically annotated treebanks. The query language of Netgraph is simple and intuitive, yet powerful enough for treebanks with complex annotations schemes. The primary purpose of Netgraph is searching in the Prague Dependency Treebank 2.0, nevertheless it can be used for other treebanks as well.
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public Licence, version 3
http://opensource.org/licenses/GPL-3.0
PUB
http://quest.ms.mff.cuni.cz/netgraph/
search
treebank
Netgraph
toolService
yes
LINDAT / CLARIAH-CZ
2416955
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-48F4-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Smrž, Otakar
Bielický, Viktor
Buckwalter, Tim
2011-06-28T09:38:24Z
2009-11-02T09:19:05Z
2009-11-02T09:19:05Z
http://hdl.handle.net/11858/00-097C-0000-0001-48F4-E
ElixirFM is a high-level implementation of Functional Arabic
Morphology documented at http://elixir-fm.wiki.sourceforge.net/. The
core of ElixirFM is written in Haskell, while interfaces in Perl
support lexicon editing and other interactions.
ara
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://opensource.org/licenses/GPL-3.0
http://github.com/otakar-smrz/elixir-fm
Arabic morphology
ElixirFM
ElixirFM
toolService
no
LINDAT / CLARIAH-CZ
http://lindat.mff.cuni.cz/services/elixirfm/demo.php
0
0
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-B08B-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Bejček, Eduard
Klyueva, Natalia
Straňák, Pavel
Šidák, Pavel
Šťastná, Eva
Vimmrová, Pavlína
Hajič, Jan
2011-11-02T19:50:32Z
2011-11-02T19:50:32Z
2010
http://hdl.handle.net/11858/00-097C-0000-0001-B08B-3
This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as the original PDT 2.0 data. It is to be used together with the PDT 2.0.
grant 1ET201120505 of the Academy of Sciences of the Czech Republic and grant MSM0021620838 of the Ministry of Youth, Education and Sport of The Czech Republic
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
MWE
multiword expressions
idiom
phraseme
named entity
Multiword expressions in the Prague Dependency Treebank 2.0
corpus
text
yes
LINDAT / CLARIAH-CZ
Pavel@@Straňák@@stranak@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Grantová agentura Akademie věd České republiky@@1ET201120505@@Od jazyka ke znalostem a sémantickému webu@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620838@@Moderní metody, struktury a systémy informatiky@@nationalFunds@@
3563780
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CC1E-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Bojar, Ondřej
Straňák, Pavel
Zeman, Daniel
2011-11-23T15:47:18Z
2011-11-23T15:47:18Z
2011-11-23
UMC004
http://hdl.handle.net/11858/00-097C-0000-0001-CC1E-B
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens
FP7-ICT-2007-3-231720 (EuroMatrix Plus), 7E09003 (Czech part of EM+)
hin
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/231720
http://hdl.handle.net/11858/00-097C-0000-0023-6260-A
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
PUB
news
web texts
Hindi Web Texts
corpus
text
308000000
token
yes
LINDAT / CLARIAH-CZ
Pavel@@Straňák@@stranak@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
European Union@@FP7-ICT-2007-3-231720@@EuroMatrix Plus@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/231720
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E09003@@EuroMatrixPlus – Bringing Machine Translation for European Languages to the User@@nationalFunds@@
308000000@@tokens
18000000@@sentences
1440728552
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-BD17-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Bojar, Ondřej
Straňák, Pavel
Zeman, Daniel
Jain, Gaurav
Damani, Om Prakesh
2011-11-07T16:18:29Z
2011-11-07T16:18:29Z
2010-05-11
UMC002
http://hdl.handle.net/11858/00-097C-0000-0001-BD17-1
English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A part of the data is our patch for the Emille parallel corpus.
FP7-ICT-2007-3-231720 (EuroMatrix Plus) 7E09003 (Czech part of EM+)
hin
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/231720
http://hdl.handle.net/11858/00-097C-0000-0023-625F-0
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
English-Hindi parallel corpus
parallel corpus
English-Hindi Parallel Corpus
corpus
text
yes
LINDAT / CLARIAH-CZ
Pavel@@Straňák@@stranak@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
European Union@@FP7-ICT-2007-3-231720@@EuroMatrix Plus@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/231720
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E09003@@EuroMatrixPlus – Bringing Machine Translation for European Languages to the User@@nationalFunds@@
12749739
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCD-02014-05-13T09:21:27Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-Aoai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCA1-02022-11-25T16:00:44Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Šmídl, Luboš
2011-12-15T13:51:07Z
2011-12-15T13:51:07Z
2011-12-15
ZCU_CZ_ATC
http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0
Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono.
Technology Agency of the Czech Republic, project No. TA01030476.
eng
University of West Bohemia, Department of Cybernetics
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
PUB
speech corpus
acoustic model
Air Traffic Control Communication
corpus
audio
yes
LINDAT / CLARIAH-CZ
Pavel@@Ircing@@ircing@kky.zcu.cz@@University of West Bohemia, Department of Cybernetics
Technologická agentura České republiky@@TA01030476@@Inteligentní technologie pro zvýšení bezpečnosti letového provozu@@nationalFunds@@
20@@hours
584245376
1
search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=airtraffic_en_w
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCF-C2022-04-26T13:51:47Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
(:unav) Unknown author
2011-12-15T16:46:56Z
2011-12-15T16:46:56Z
2011-12-15
http://hdl.handle.net/11858/00-097C-0000-0001-CCCF-C
First version of the very large Czech corpus Czes created with a new set of tools. It comprises 465,102,710 tokens.
Lexical Computing Ltd.
ces
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
Czech corpus large
czes
corpus
text
yes
LINDAT / CLARIAH-CZ
Václav@@Němčík@@xnemcik@fi.muni.cz@@Masaryk University, NLP Centre
Lexical Computing Ltd.@@@@@@@@
465102710@@tokens
1402268778
1
search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=czes_cz_w
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCCE-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Rambousek, Adam
2011-12-15T15:19:41Z
2011-12-15T15:19:41Z
2011-12-15
http://hdl.handle.net/11858/00-097C-0000-0001-CCCE-E
Integrated lexicographic platform for Russian.
rus
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
lexicography platform
russian
web dictionary
Integrated lexicographic platform for Russian
toolService
yes
LINDAT / CLARIAH-CZ
18354328
4
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCD2-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Larasati, Septina Dian
2011-12-16T08:11:47Z
2011-12-16T08:11:47Z
2011-12-16
http://hdl.handle.net/11858/00-097C-0000-0001-CCD2-2
Raw Text
ind
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
Indonesian-English parallel corpus
parallel corpus
IDENTICv1.0-raw
corpus
text
yes
LINDAT / CLARIAH-CZ
Septina Dian@@Larasati@@septina.larasati@gmail.com@@Charles University in Prague, UFAL
2698146
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCDB-02022-04-26T13:52:15Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
(:unav) Unknown author
2011-12-16T09:34:39Z
2011-12-16T09:34:39Z
2011-12-16
http://hdl.handle.net/11858/00-097C-0000-0001-CCDB-0
Slovak large web corpus skTenTen, comprising 876,003,720 tokens.
Lexical Computing Ltd.
slk
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
Slovak large corpus
skTenTen
corpus
text
yes
LINDAT / CLARIAH-CZ
Václav@@Němčík@@xnemcik@fi.muni.cz@@Masaryk University, NLP Centre
Lexical Computing Ltd.@@@@@@@@
876003720@@tokens
1847547412
1
search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=sktenten_2011_12_16
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-CCDF-82022-04-26T13:50:51Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
(:unav) Unknown author
2011-12-16T09:58:26Z
2011-12-16T09:58:26Z
2011-12-16
http://hdl.handle.net/11858/00-097C-0000-0001-CCDF-8
Very large English web corpus enTenTEn, comprising 3,268,798,627 tokens.
Lexical Computing Ltd.
eng
Masaryk University, NLP Centre
NLP Centre Web Corpus License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC
ACA
English large corpus
enTenTen
corpus
text
yes
LINDAT / CLARIAH-CZ
Václav@@Němčík@@xnemcik@fi.muni.cz@@Masaryk University, NLP Centre
Lexical Computing Ltd.@@@@@@@@
3268798627@@tokens
7467307355
1
search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=ententen_2011_12_16_en_w
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0001-D709-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Grác, Marek
2011-12-16T15:03:23Z
2011-12-16T15:03:23Z
2011-12-16
http://hdl.handle.net/11858/00-097C-0000-0001-D709-F
Czech corpus annotated for NP and clause chunks by 3-11 annotators (with average inter-annotator agreement at 88%). It consists of 10,000 sentences.
ces
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
interannotator agreement
corpus
chunks
phrases
clauses
BushBank
corpus
text
yes
LINDAT / CLARIAH-CZ
Václav@@Němčík@@xnemcik@fi.muni.cz@@Masaryk University, NLP Centre
10000@@sentences
88173871
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BCCF-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Nedoluzhko, Anna
Mírovský, Jiří
2012-02-20T13:56:58Z
2012-02-20T13:56:58Z
2012-02-20
http://hdl.handle.net/11858/00-097C-0000-0005-BCCF-3
Annotation of extended textual coreference and bridging relations in the Prague Dependency Treebank 2.0
project LINDAT-Clarin LM2010013, grant GAČR GA405/09/0729
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
bridging anaphora
textual coreference
PDT
Extended Textual Coreference and Bridging Relations in PDT 2.0
corpus
text
yes
LINDAT / CLARIAH-CZ
Jiří@@Mírovský@@mirovsky@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
Grantová agentura České republiky@@GA405/09/0729@@Od struktury věty k textovým vztahům@@nationalFunds@@
77790580
2
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BF85-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Larasati, Septina Dian
2012-03-13T14:34:36Z
2012-03-13T14:34:36Z
2012-03-13
http://hdl.handle.net/11858/00-097C-0000-0005-BF85-F
IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide researchers a proper Indonesian-English textual data set and also to promote research in this language pair. The corpus contains texts coming from different sources with different genres.
The research leading to these results has received funding from the European Commission’s 7th Framework Program under grant agreement no 238405 (CLARA) and by the grant LC536 Centrum Komputacni Lingvistiky of the Czech Ministry of Education.
ind
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/238405
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
Indonesian-English parallel corpus
parallel corpus
IDENTICv1.0
corpus
text
yes
LINDAT / CLARIAH-CZ
Septina Dian@@Larasati@@septina.larasati@gmail.com@@Charles University in Prague, UFAL
European Union@@FP7-238405@@CLARA (Common Language Resources and their Applications)@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/238405
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LC536@@Centrum komputační lingvistiky@@nationalFunds@@
16615187
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-BF95-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Cinková, Silvie
Holub, Martin
Rambousek, Adam
Smejkalová, Lenka
2012-03-19T14:07:13Z
2012-03-19T14:07:13Z
2012-03-19
http://hdl.handle.net/11858/00-097C-0000-0005-BF95-B
VPS-30-En is a small lexical resource that contains the following 30 English verbs: access, ally, arrive, breathe,
claim, cool, crush, cry, deny, enlarge, enlist, forge, furnish, hail, halt, part, plough, plug, pour, say, smash, smell, steer, submit, swell,
tell, throw, trouble, wake and yield. We have created and have been using VPS-30-En to explore the interannotator agreement potential
of the Corpus Pattern Analysis. VPS-30-En is a small snapshot of the Pattern Dictionary of English Verbs (Hanks and Pustejovsky,
2005), which we revised (both the entries and the annotated concordances) and enhanced with additional annotations.
This work has been partly supported by the Ministry of
Education of CR within the LINDAT-Clarin project
LM2010013, and by the Czech Science Foundation under
the projects P103/12/G084, P406/2010/0875 and
P401/10/0792.
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
http://ufal.mff.cuni.cz/spr/pdev30verbs.html
corpus pattern analysis
clustering
lexical semantics
verbs
VPS-30-En
lexicalConceptualResource
text
lexicon
yes
LINDAT / CLARIAH-CZ
Silvie@@Cinková@@cinkova@ufal.mff.cuni.cz@@Charles University in Prague, UFAL
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
Grantová agentura České republiky@@GAP103/12/G084@@Centrum pro multi-modální interpretaci dat velkého rozsahu@@nationalFunds@@
Grantová agentura České republiky@@GAP406/10/0875@@Komputační lingvistika: Explicitní popis jazyka a anotovaná data se zřetelem na češtinu@@nationalFunds@@
Grantová agentura České republiky@@GAP401/10/0792@@Temporální aspekty znalostí a informací@@nationalFunds@@
1594389
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0005-CF9C-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pražák, Aleš
Šmídl, Luboš
2012-03-28T14:45:25Z
2012-03-28T14:45:25Z
2012-03-28
ZCU_CZ_Parliament
http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4
The corpus consists of recordings from the Chamber of Deputies of the Parliament of the Czech Republic. It currently consists of 88 hours of speech data, which corresponds roughly to 0.5 million tokens. The annotation process is semi-automatic, as we are able to perform the speech recognition on the data with high accuracy (over 90%) and consequently align the resulting automatic transcripts with the speech. The annotator’s task is then to check the transcripts, correct errors, add proper punctuation and label speech sections with information about the speaker. The resulting corpus is therefore suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 18 sound files (WAV PCM, 16-bit, 44.1 kHz, mono) and corresponding transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
The date of airing of a particular recording is encoded in the filename in the form SOUND_YYMMDD_*. Note that the recordings are usually aired in the early morning on the day following the actual Parliament session. If the recording is too long to fit in the broadcasting scheme, it is divided into several parts and aired on the consecutive days.
ces
University of West Bohemia, Department of Cybernetics
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
speech corpus
acoustic model
speaker identification
speaker verification
Czech Parliament Meetings
corpus
audio
yes
LINDAT / CLARIAH-CZ
Pavel@@Ircing@@ircing@kky.zcu.cz@@University of West Bohemia, Department of Cybernetics
28212817896
37
search|http://lindat.mff.cuni.cz/services/kontext/first_form?corpname=czechparl_2012_03_28_cs_w
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADA-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Galuščáková, Petra
Bojar, Ondřej
2012-05-15T12:36:59Z
2012-05-15T12:36:59Z
2012-05-15
http://hdl.handle.net/11858/00-097C-0000-0006-AADA-9
Testing set from WMT 2011 [1] competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English. Test set is described in [2].
References:
[1] http://www.statmt.org/wmt11/evaluation-task.html
[2] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press.
The work on this project was supported by the grant EuroMatrixPlus (FP7-ICT-
2007-3-231720 of the EU and 7E09003 of the Czech Republic)
slk
ces
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
WMT
test data
Slovak
WMT 2011 Testing Set
corpus
text
yes
LINDAT / CLARIAH-CZ
Petra@@Galuščáková@@galuscakova@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
European Union@@FP7-ICT-2007-3-231720@@EuroMatrix Plus@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/231720
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E09003@@EuroMatrixPlus – Bringing Machine Translation for European Languages to the User@@nationalFunds@@
472910
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADB-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Galuščáková, Petra
Bojar, Ondřej
2012-05-15T13:42:49Z
2012-05-15T13:42:49Z
2012-05-15
http://hdl.handle.net/11858/00-097C-0000-0006-AADB-7
Manual classification of errors of Czech-Slovak translation according to the classification introduced by Vilar et al. [1]. First 50 sentences from WMT 2010 test set were translated by 5 MT systems (Česílko, Česílko2, Google Translate and two Moses setups) and MT errors were manually marked and classified. Classification was applied in MT systems comparison [3]. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://matrix.statmt.org/test_sets/list
[3] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011
This work has been supported by the grants Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
slk
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
machine translation
errors classification
CS-SK translation
Manually Classified Errors in Cs->Sk Translation
corpus
text
yes
LINDAT / CLARIAH-CZ
Petra@@Galuščáková@@galuscakova@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
European Union@@FP7-ICT-2007-3-231720@@EuroMatrix Plus@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/231720
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E09003@@EuroMatrixPlus – Bringing Machine Translation for European Languages to the User@@nationalFunds@@
9371
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADC-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Galuščáková, Petra
Bojar, Ondřej
2012-05-15T13:59:24Z
2012-05-15T13:59:24Z
2012-05-15
http://hdl.handle.net/11858/00-097C-0000-0006-AADC-5
Manual classification of errors of English-Slovak translation according to the classification introduced by Vilar et al. [1]. 50 sentences randomly selected from WMT 2011 test set [2] were translated by 3 MT systems described in [3] and MT errors were manually marked and classified. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://www.statmt.org/wmt11/evaluation-task.html
[3] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press.
This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
slk
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
machine translation
errors classification
EN-SK translation
Manually Classified Errors in En->Sk Translation
corpus
text
yes
LINDAT / CLARIAH-CZ
Petra@@Galuščáková@@galuscakova@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
European Union@@FP7-ICT-2007-3-231720@@EuroMatrix Plus@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/231720
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E09003@@EuroMatrixPlus – Bringing Machine Translation for European Languages to the User@@nationalFunds@@
17723
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADD-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Bojar, Ondřej
Galuščáková, Petra
2012-05-15T14:45:32Z
2012-05-15T14:45:32Z
2012-05-15
http://hdl.handle.net/11858/00-097C-0000-0006-AADD-3
Manually ranked outputs of Czech-Slovak translations. Three annotators manually ranked outputs of five MT systems (Česílko, Česílko2, Google Translate and two Moses setups) on three data sets (100 sentences randomly selected from books, 100 sentences randomly selected from Acquis corpus and 50 first sentences from WMT 2010 test set). Ranking was applied in MT systems comparison in [1].
References:
[1] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011
This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
slk
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
machine translation
evaluation
manual ranking
Manually Ranked Translation Outputs
corpus
text
yes
LINDAT / CLARIAH-CZ
Petra@@Galuščáková@@galuscakova@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
European Union@@FP7-ICT-2007-3-231720@@EuroMatrix Plus@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/231720
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E09003@@EuroMatrixPlus – Bringing Machine Translation for European Languages to the User@@nationalFunds@@
2156841
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AADF-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Galuščáková, Petra
Garabík, Radovan
Bojar, Ondřej
2012-05-15T15:54:40Z
2012-05-15T15:54:40Z
2012-05-15
http://hdl.handle.net/11858/00-097C-0000-0006-AADF-0
Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation.
References:
[1] http://langtech.jrc.it/JRC-Acquis.html/
[2] http://www.statmt.org/europarl/
[3] http://apertium.eu/data
[4] http://opus.lingfil.uu.se/
[5] http://ec.europa.eu/
This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
slk
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
parallel corpus
Czech-Slovak corpus
Czech-Slovak Parallel Corpus
corpus
text
yes
LINDAT / CLARIAH-CZ
Petra@@Galuščáková@@galuscakova@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
European Union@@FP7-ICT-2007-3-231720@@EuroMatrix Plus@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/231720
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E09003@@EuroMatrixPlus – Bringing Machine Translation for European Languages to the User@@nationalFunds@@
5700000@@sentences
1192222551
2
Czech-Slovak|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=czeslo_cs_m
Slovak-Czech|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=czeslo_sk_m
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AAE0-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Galuščáková, Petra
Garabík, Radovan
Bojar, Ondřej
2012-05-15T16:11:21Z
2012-05-15T16:11:21Z
2012-05-15
http://hdl.handle.net/11858/00-097C-0000-0006-AAE0-A
English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation.
References:
[1] http://langtech.jrc.it/JRC-Acquis.html/
[2] http://www.statmt.org/europarl/
[3] http://apertium.eu/data
[4] http://opus.lingfil.uu.se/
[5] http://ec.europa.eu/
This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
slk
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
parallel corpus
English-Slovak corpus
English-Slovak Parallel Corpus
corpus
text
yes
LINDAT / CLARIAH-CZ
Petra@@Galuščáková@@galuscakova@ufal.mff.cuni.cz@@Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
European Union@@FP7-ICT-2007-3-231720@@EuroMatrix Plus@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/231720
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E09003@@EuroMatrixPlus – Bringing Machine Translation for European Languages to the User@@nationalFunds@@
1172350203
2
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-AAFE-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hajič, Jan
Kuboň, Vladislav
Homola, Petr
2012-05-22T16:48:19Z
2012-05-22T16:48:19Z
2012-05-22
http://hdl.handle.net/11858/00-097C-0000-0006-AAFE-A
Česílko is a tool enabling the fast and efficient translation from one source language into many target languages, which are mutually related.
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
http://quest.ms.mff.cuni.cz/cesilko/
machine translation
Czech-Slovak translation
Česílko
toolService
yes
LINDAT / CLARIAH-CZ
http://lindat.mff.cuni.cz/services/cesilko/demo.php
26145345
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-B847-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Spoustová, Johanka
Spousta, Miroslav
2012-06-21T11:53:56Z
2012-06-21T11:53:56Z
2012-06-21
http://hdl.handle.net/11858/00-097C-0000-0006-B847-6
Web corpus of Czech, created in 2011. Contains newspapers+magazines, discussions, blogs. See http://www.lrec-conf.org/proceedings/lrec2012/summaries/120.html for details.
GA405/09/0278
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
http://creativecommons.org/licenses/by/3.0/
PUB
corpus
Czech
web
CWC2011
corpus
CWC2011
Spoustová
Johanka
Charles University in Prague, UFAL
unrestrictedUse
Creative Commons - Attribution 3.0 Unported (CC BY 3.0)
download
True
#1-Internet as a Language Corpus
#1-National
Web corpus of Czech, created in 2011. Contains newspapers₊magazines, discussions, blogs.
text
corpus
ces
2650000000
words
johanka@ucw.cz
yes
LINDAT / CLARIAH-CZ
Grantová agentura České republiky@@GA405/09/0278@@Internet jako jazykový korpus@@nationalFunds@@
2650000000@@words
6074441470
6
basic|https://lindat.mff.cuni.cz/services/kontext/first_form?corpname=cwc_11_cs_w
with syntactic annotation|https://lindat.mff.cuni.cz/services/kontext/first_form?corpname=cwc_parsed_cs_a
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0006-DB11-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Bejček, Eduard
Hajič, Jan
Panevová, Jarmila
Mírovský, Jiří
Spoustová, Johanka
Štěpánek, Jan
Straňák, Pavel
Šidák, Pavel
Vimmrová, Pavlína
Šťastná, Eva
Ševčíková, Magda
Smejkalová, Lenka
Homola, Petr
Popelka, Jan
Lopatková, Markéta
Hrabalová, Lucie
Klyueva, Natalia
Žabokrtský, Zdeněk
2012-08-09T17:00:20Z
2012-08-09T17:00:20Z
2011-12-06
http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
The Prague Dependency Treebank 2.5 annotates the same texts as the PDT 2.0. The annotation on the original four layers was fixed or improved in various aspects (see Documentation). Moreover, new information was added to the data:
Annotation of multiword expressions
Pair/group meaning
Clause segmentation
Ministry of Education of the Czech Republic projects No.:
LM2010013
LC536
MSM0021620838
Grant Agency of the Czech Republic grants No.:
P406/2010/0875
P202/10/1333
P406/10/P193
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
http://hdl.handle.net/11858/00-097C-0000-0001-B098-5
http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
http://ufal.mff.cuni.cz/pdt2.5
treebank
multiword expressions
clauses
tectogrammatics
dependency
PDT
Prague Dependency Treebank 2.5
corpus
Prague Dependency Treebank 2.5
Bejček
Eduard
Charles University in Prague, UFAL
unrestrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
download
True
Prague Dependency Treebank 2.5
National
The Prague Dependency Treebank 2.5 annotates the same texts as the PDT 2.0. The annotation on the original four layers was fixed or improved in various aspects (see Documentation). Moreover, new information was added to the data:
Annotation of multiword expressions
Pair/group meaning
Clause segmentation
text
corpus
ces
2000000
tokens
bejcek@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LC536@@Centrum komputační lingvistiky@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620838@@Moderní metody, struktury a systémy informatiky@@nationalFunds@@
Grantová agentura České republiky@@GAP406/10/0875@@Komputační lingvistika: Explicitní popis jazyka a anotovaná data se zřetelem na češtinu@@nationalFunds@@
Grantová agentura České republiky@@GAP202/10/1333@@NoSCoM: nestandardní výpočetní modely a jejich aplikace ve složitosti, lingvistice a učení@@nationalFunds@@
Grantová agentura České republiky@@GPP406/10/P193@@Nástroje pro revizi a tektogramatickou anotaci českého závislostního korpusu@@nationalFunds@@
2000000@@tokens
709707346
1
search| https://lindat.mff.cuni.cz/services/pmltq/pdt25/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0007-70FD-E2022-03-14T14:21:37Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Zeman, Daniel
2012-10-25T12:42:49Z
2012-10-25T12:42:49Z
2006-06
http://hdl.handle.net/11858/00-097C-0000-0007-70FD-E
DZ Interset is a means of converting among various tag sets in natural language processing. The core idea is similar to interlingua-based machine translation. DZ Interset defines a set of features that are encoded by the various tag sets. The set of features should be as universal as possible. It does not need to encode everything that is encoded by any tag set but it should encode all information that people may want to access and/or port from one tag set to another.
New tag sets are attached by writing a driver for them. Once the driver is ready, you can easily convert tags between the new set and any other set for which you also have a driver. This reusability is an obvious advantage over writing a targeted conversion procedure each time you need to convert between a particular pair of tag sets.
grant MSM 0021620838 of the Ministry of Education of the Czech Republic
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
https://wiki.ufal.ms.mff.cuni.cz/user:zeman:interset
morphology
NLP
Perl
DZ Interset
toolService
DZ Interset
Zeman
Daniel
Charles University in Prague, UFAL
restrictedUse
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
academic-nonCommercialUse
attribution
downloadable
True
#1-Výzkumný záměr
#1-nationalFunds
DZ Interset is a means of converting among various tag sets in natural language processing. The core idea is similar to interlingua-based machine translation. DZ Interset defines a set of features that are encoded by the various tag sets. The set of features should be as universal as possible. It does not need to encode everything that is encoded by any tag set but it should encode all information that people may want to access and/or port from one tag set to another.
New tag sets are attached by writing a driver for them. Once the driver is ready, you can easily convert tags between the new set and any other set for which you also have a driver. This reusability is an obvious advantage over writing a targeted conversion procedure each time you need to convert between a particular pair of tag sets.
toolService
tool
1
mb
zeman@ufal.mff.cuni.cz
false
yes
LINDAT / CLARIAH-CZ
http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620838@@Moderní metody, struktury a systémy informatiky@@nationalFunds@@
1@@mb
2203707
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0008-D259-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Bojar, Ondřej
Zeman, Daniel
Dušek, Ondřej
Břečková, Jana
Farkačová, Hana
Grošpic, Pavel
Kačenová, Kristýna
Knechtová, Eva
Koubová, Anna
Lukavská, Jana
Nováková, Petra
Petrdlíková, Jana
2012-11-13T16:36:01Z
2012-11-13T16:36:01Z
2012-11-13
http://hdl.handle.net/11858/00-097C-0000-0008-D259-7
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved.
This project has been sponsored by the grants GAČR P406/11/1499 and EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic)
deu
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/231720
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
reference translation
German-Czech
parallel corpus
Additional German-Czech reference translations of the WMT'11 test set
corpus
Additional German-Czech reference translations of the WMT'11 test set
Dušek
Ondřej
Charles University in Prague, UFAL
restrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
academic-nonCommercialUse
attribution
shareAlike
True
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved.
text
corpus
527
kb
odusek@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Grantová agentura České republiky@@GAP406/11/1499@@Čeština ve věku strojového překladu@@nationalFunds@@
European Union@@FP7-ICT-2007-3-231720@@EuroMatrix Plus@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/231720
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E09003@@EuroMatrixPlus – Bringing Machine Translation for European Languages to the User@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E11051@@EuroMatrixPlus - Enlarged European Union Bringing Machine Translation for European Languages to the User@@nationalFunds@@
527@@kb
540096
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-60D6-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Majliš, Martin
2013-06-25T13:21:15Z
2013-06-25T13:21:15Z
2011-12-20
http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc.
A set of 120 corpora collected using this tool is available at https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0022-6133-9
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0022-6133-9
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
web data
wikipedia
corpus creation
W2C – Web to Corpus – tool
toolService
W2C – Web to Corpus – tool
Popel
Martin
Charles University in Prague, UFAL
restrictedUse
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
attribution
shareAlike
downloadable
True
A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plaain text, identify language, etc.
toolService
suiteOfTools
popel@ufal.mff.cuni.cz
false
yes
LINDAT / CLARIAH-CZ
750549
2
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0008-E130-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Poláková, Lucie
Jínová, Pavlína
Zikánová, Šárka
Hajičová, Eva
Mírovský, Jiří
Nedoluzhko, Anna
Rysová, Magdaléna
Pavlíková, Veronika
Zdeňková, Jana
Pergler, Jiří
Ocelák, Radek
2012-11-14T08:58:57Z
2012-11-14T08:58:57Z
2012-11-14
PDiT 1.0
http://hdl.handle.net/11858/00-097C-0000-0008-E130-A
Annotation of discourse relations is a project related to the Prague Dependency Treebank 2.5. It represents a new manually annotated layer of language description, above the existing layers of the PDT, and it portrays linguistic phenomena from the perspective of discourse structure and coherence.
GACR P406/12/0658, GACR P406/2010/0875, GACR 405/09/0729, Ministry of Education ME10018, Ministry of Education LM2010013
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
http://ufal.mff.cuni.cz/discourse/
discourse
treebank
annotation
Prague Discourse Treebank 1.0
corpus
Prague Discourse Treebank 1.0
Mírovský
Jiří
Charles University in Prague, UFAL
restrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
academic-nonCommercialUse
attribution
shareAlike
CD-ROM
downloadable
True
#1-From the structure of a sentence to textual relationships
#2-Computational Linguistics: Explicit description of language and annotated data focused on Czech
#3-Coreference, discourse relations and information structure in a contrastive perspective
#1-nationalFunds
#2-nationalFunds
#3-nationalFunds
Annotation of discourse relations is a project related to the Prague Dependency Treebank 2.5. It represents a new layer of manual annotation, above the existing layers of the PDT and it portrays linguistic phenomena from the perspective of discourse structure and coherence.
text
corpus
49431
sentences
mirovsky@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
http://ufal.mff.cuni.cz/discourse/data.php
Grantová agentura České republiky@@GAP406/12/0658@@Koreference, diskurs a aktuální členění v kontrastivním pohledu@@nationalFunds@@
Grantová agentura České republiky@@GAP406/10/0875@@Komputační lingvistika: Explicitní popis jazyka a anotovaná data se zřetelem na češtinu@@nationalFunds@@
Grantová agentura České republiky@@GA405/09/0729@@Od struktury věty k textovým vztahům@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@ME10018@@K počítačové analýze struktury textu@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
49431@@sentences
100229403
3
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000C-2112-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Šebesta, Karel
Bedřichová, Zuzanna
Šormová, Kateřina
Štindlová, Barbora
Hrdlička, Milan
Hrdličková, Tereza
Hana, Jiří
Rosen, Alexandr
Petkevič, Vladimír
Jelínek, Tomáš
Škodová, Svatava
Poláčková, Marie
Janeš, Petr
Lundáková, Kateřina
Skoumalová, Hana
Šťastný, Klement
Sládek, Šimon
Pierscieniak, Piotr
2012-12-12T11:24:11Z
2012-12-12T11:24:11Z
2012-12-12
http://hdl.handle.net/11858/00-097C-0000-000C-2112-B
Corpus AKCES 3 includes texts written in czech by non-native speakers (AKCES/CLAC - Czech Language Acquisition Corpora)
ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
ces
Charles University in Prague, ÚČJTK
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
http://utkl.ff.cuni.cz/learncorp/
Czech as a foreign language
Czech language acquisition corpora
non-native speakers
AKCES
second language aquisition
AKCES 3
corpus
AKCES 3
Šebesta
KS
Charles University in Prague, ÚČJTK
restrictedUse
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
academic-nonCommercialUse
webExecutable
Inovace vzdělávání v oboru čeština jako druhý jazyk; Jazyk jako lidská činnost, její produkt a faktor; Lingvistika
nationalFunds
Corpus AKCES 3 includes texts written in czech by non-native speakers (AKCES/CLAC - Czech Language Acquisition Corpora)
text
corpus
11.32
mb
sebesta@ff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@CZ.1.07/2.2.00/07.0259@@Innovation in Education in the Field of Czech as a Second Language@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620825@@Jazyk jako lidská činnost, její produkt a faktor@@nationalFunds@@
Univerzita Karlova v Praze@@P10 – Lingvistika@@Program rozvoje vědních oblastí na Univerzitě Karlově P10 – Lingvistika, modul Osvojování a vývoj jazykové a komunikační kompetence u populace ČR, řešeno od r. 2012@@nationalFunds@@
11.32@@mb
17846670
4
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67C-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Richter, Michal
2013-02-02T00:16:12Z
2013-02-02T00:16:12Z
2013-02-02
http://hdl.handle.net/11858/00-097C-0000-000D-F67C-5
Statistical spell- and (occasional) grammar-checker. There are three versions: a unix command line utility and an OS X SpellServer with a System Service, that integrates with native OS X GUI applications, and a web service run by Lindat-Clarin, that can be used either through a web form in a browser, or by web applications using API.
The LINDAT-CLARIN project (LM2010013), fully supported by TheMinistry of Education, Sports and Youth of The Czech Republic under the programme LM of "Large Infrastructures"
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11234/1-1469
BSD 2-Clause "Simplified" or "FreeBSD" license
http://opensource.org/licenses/BSD-2-Clause
PUB
https://redmine.ms.mff.cuni.cz/projects/korektor
grammar checker
spellchecker
Korektor
toolService
Korektor
Richter
Michal
Charles University in Prague, UFAL
restrictedUse
BSD 2-Clause "Simplified" or "FreeBSD" license
webExecutable
downloadable
accessibleThroughInterface
True
#1-LINDAT-CLARIN project (LM2010013)
#1-nationalFunds
Statistical spell- and (occasional) grammar-checker. There are three versions: a unix command line utility and an OS X SpellServer with a System Service, that integrates with native OS X GUI applications, and a web service run by Lindat-Clarin, that can be used either through a web form in a browser, or by web applications using API.
toolService
tool
70
mb
stranak@ufal.mff.cuni.cz
true
yes
LINDAT / CLARIAH-CZ
http://lindat.mff.cuni.cz/services/korektor
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
70@@mb
574411350
3
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000C-2293-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Šebesta, Karel
Bedřichová, Zuzanna
Štindlová, Barbora
Hrdlička, Milan
Hrdličková, Tereza
Hana, Jiří
Rosen, Alexandr
Petkevič, Vladimír
Jelínek, Tomáš
Škodová, Svatava
Janeš, Petr
Lundáková, Kateřina
Skoumalová, Hana
Šťastný, Klement
Sládek, Šimon
2012-12-12T11:45:49Z
2012-12-12T11:45:49Z
2012-12-12
http://hdl.handle.net/11858/00-097C-0000-000C-2293-0
Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora)
ESF (OPVK CZ.1.07/2.2.00/07.0259), MŠMT (MSM0021620825), UK (P10)
ces
Charles University
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
http://utkl.ff.cuni.cz/learncorp/
language of children
Czech language acquisition
adolescents
AKCES
AKCES 4
corpus
AKCES 4
Šebesta
KS
Charles University in Prague, ÚČJTK
restrictedUse
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
academic-nonCommercialUse
attribution
noDerivatives
webExecutable
Inovace vzdělávání v oboru čeština jako druhý jazyk; Jazyk jako lidská činnost, její produkt a faktor; Lingvistika
nationalFunds
Corpus AKCES 4 includes texts written in czech by youth growing up in locations at risk of social exclusion (AKCES/CLAC - Czech Language Acquisition Corpora)
text
corpus
4.502
mb
sebesta@ff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@CZ.1.07/2.2.00/07.0259@@Innovation in Education in the Field of Czech as a Second Language@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620825@@Jazyk jako lidská činnost, její produkt a faktor@@nationalFunds@@
Univerzita Karlova v Praze@@P10 – Lingvistika@@Program rozvoje vědních oblastí na Univerzitě Karlově P10 – Lingvistika, modul Osvojování a vývoj jazykové a komunikační kompetence u populace ČR, řešeno od r. 2012@@nationalFunds@@
4.502@@mb
10687872
4
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC91-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Ircing, Pavel
2013-01-01T14:55:41Z
2013-01-01T14:55:41Z
2013-01-01
ZCU_CZ_ ebu_ContentGenreCS_CZ
http://hdl.handle.net/11858/00-097C-0000-000D-EC91-2
The EBUContentGenre is a thesaurus containing the hierarchical description of various genres utilized in the TV broadcasting industry. This thesaurus is a part of a complex metadata specification called EBUCore intended for multifaceted description of audiovisual content. EBUCore (http://tech.ebu.ch/docs/tech/tech3293v1_3.pdf) is a set of descriptive and technical metadata based on the Dublin Core and adapted to media. EBUCore is the flagship metadata specification of European Broadcasting Union, the largest professional association of broadcasters around the world. It is developed and maintained by EBU's Technical Department (http://tech.ebu.ch). The translated thesaurus can be used for effective cataloguing of (mostly TV) audiovisual content and consequent development of systems for automatic cataloguing (topic/genre detection).
Technology Agency of the Czech Republic, project No. TA01011264
ces
eng
University of West Bohemia, Department of Cybernetics
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
thesaurus
metadata annotation
topic detection
Czech translation of the EBUContentGenre thesaurus
lexicalConceptualResource
Czech translation of the EBUContentGenre thesaurus
Ircing
Pavel
University of West Bohemia
restrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
academic-nonCommercialUse
attribution
shareAlike
downloadable
True
Eliminace jazykových bariér handicapovaných diváků České televize II
nationalFunds
The EBUContentGenre is a thesaurus containing the hierarchical description of various genres utilized in the TV broadcasting industry. This thesaurus is a part of a complex metadata specification called EBUCore intended for multifaceted description of audiovisual content. EBUCore (http://tech.ebu.ch/docs/tech/tech3293v1_3.pdf) is a set of descriptive and technical metadata based on the Dublin Core and adapted to media. EBUCore is the flagship metadata specification of European Broadcasting Union, the largest professional association of broadcasters around the world. It is developed and maintained by EBU's Technical Department (http://tech.ebu.ch). The translated thesaurus can be used for effective cataloguing of (mostly TV) audiovisual content and consequent development of systems for automatic cataloguing (topic/genre detection).
text
lexicalConceptualResource
thesaurus
1266
keywords
ircing@kky.zcu.cz
yes
LINDAT / CLARIAH-CZ
Technologická agentura České republiky@@TA01011264@@Eliminace jazykových bariér handicapovaných diváků České televize II@@nationalFunds@@
1266@@keywords
1174002
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC92-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Šmídl, Luboš
2013-01-01T14:56:06Z
2013-01-01T14:56:06Z
2013-01-01
ZCU_CZ_ ATCC-LM4ASR
http://hdl.handle.net/11858/00-097C-0000-000D-EC92-F
The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0).
Technology Agency of the Czech Republic, project No. TA01030476
eng
University of West Bohemia, Department of Cybernetics
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
PUB
pronunciation lexicon
n-gram counts
language model
ATCC: Pronunciation lexicon and n-gram counts for ASR module
lexicalConceptualResource
ATCC: Pronunciation lexicon and n-gram counts for ASR module
Šmídl
Luboš
University of West Bohemia
restrictedUse
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
academic-nonCommercialUse
attribution
downloadable
True
Inteligentní technologie pro zvýšení bezpečnosti letového provozu
nationalFunds
The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0).
text
lexicalConceptualResource
other
236500
other
ircing@kky.zcu.cz
yes
LINDAT / CLARIAH-CZ
Technologická agentura České republiky@@TA01030476@@Inteligentní technologie pro zvýšení bezpečnosti letového provozu@@nationalFunds@@
236500@@other
7896750
7
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-EC98-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Šmídl, Luboš
Pražák, Aleš
2013-01-04T13:24:56Z
2013-01-04T13:24:56Z
2013-01-04
ZCU_CZ_OVM
http://hdl.handle.net/11858/00-097C-0000-000D-EC98-3
The corpus consists of transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. It contains 35 hours of speech and corresponding word-by-word transcriptions, including the transcription of some non-speech events. Speakers’ names are also assigned to corresponding segments. The resulting corpus is suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 16 sound files (WAV PCM, 16-bit, 48 kHz, mono) and transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
ces
University of West Bohemia, Department of Cybernetics
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
PUB
speech corpus
acoustic model
speaker identification
speaker verification
OVM – Otázky Václava Moravce
corpus
OVM – Otázky Václava Moravce
Ircing
Pavel
University of West Bohemia
restrictedUse
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
academic-nonCommercialUse
attribution
downloadable
True
ownFunds
The corpus consists of transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. It contains 35 hours of speech and corresponding word-by-word transcriptions, including the transcription of some non-speech events. Speakers’ names are also assigned to corresponding segments. The resulting corpus is suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 16 sound files (WAV PCM, 16-bit, 48 kHz, mono) and transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
audio
corpus
35
hours
ircing@kky.zcu.cz
yes
LINDAT / CLARIAH-CZ
35@@hours
12118689370
32
search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=ovm_cs_w
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F696-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pomikálek, Jan
2013-02-05T12:04:53Z
2013-02-05T12:04:53Z
2011
http://hdl.handle.net/11858/00-097C-0000-000D-F696-9
jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/justext/). It is successfully used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The boilerplate removal algorithm is able to remove most of non-grammatical sentences from a web page like navigation, advertisements, tables, short notes and so on. It has been shown it overperforms or at least keeps up with it's competitors (according to comparison with participants of Cleaneval competition in author's Ph.D. thesis). The precise removal of unwanted content and scalability of the algorithm has been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- over 20 TB of HTML pages were processed resulting in corpora of 70 billions tokens altogether.
PRESEMT, Lexical Computing Ltd
eng
Masaryk University, NLP Centre
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
http://code.google.com/p/justext/
boilerplate
web documents
text cleaning
boilerplate removal
text corpora
jusText
toolService
jusText
Pomikálek
Jan
Natural Language Processing Centre, Faculty of Informatics Masaryk University
restrictedUse
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
attribution
shareAlike
downloadable
True
PRESEMT
euFunds
jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/justext/). It is successfully used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The boilerplate removal algorithm is able to remove most of non-grammatical sentences from a web page like navigation, advertisements, tables, short notes and so on. It has been shown it overperforms or at least keeps up with it's competitors (according to comparison with participants of Cleaneval competition in author's Ph.D. thesis). The precise removal of unwanted content and scalability of the algorithm has been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- over 20 TB of HTML pages were processed resulting in corpora of 70 billions tokens altogether.
toolService
tool
732
kb
jan.pomikalek@gmail.com
false
yes
LINDAT / CLARIAH-CZ
PRESEMT@@@@@@@@
Lexical Computing Ltd.@@@@@@@@
732@@kb
750175
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67A-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pomikálek, Jan
2013-02-01T16:32:21Z
2013-02-01T16:32:21Z
2011
http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9
Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9.
PRESEMT, Lexical Computing Ltd
eng
Masaryk University, NLP Centre
BSD 3-Clause "New" or "Revised" license
http://opensource.org/licenses/BSD-3-Clause
PUB
http://code.google.com/p/chared/
character encoding
character encoding detection
charset
unicode
Chared
toolService
Chared
Pomikálek
Jan
Natural Language Processing Centre, Faculty of Informatics Masaryk University
restrictedUse
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
attribution
shareAlike
downloadable
True
PRESEMT
euFunds
Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9.
toolService
tool
23
mb
jan.pomikalek@gmail.com
false
yes
LINDAT / CLARIAH-CZ
PRESEMT@@@@@@@@
Lexical Computing Ltd.@@@@@@@@
23@@mb
24156936
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67B-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pomikálek, Jan
2013-02-01T16:34:32Z
2013-02-01T16:34:32Z
2011
http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7
onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether.
PRESEMT, Lexical Computing Ltd
eng
Masaryk University, NLP Centre
BSD 3-Clause "New" or "Revised" license
http://opensource.org/licenses/BSD-3-Clause
PUB
http://code.google.com/p/onion/
deduplication
corpus
text deduplication
n-gram deduplication
n-gram model
onion
toolService
onion
Pomikálek
Jan
Natural Language Processing Centre, Faculty of Informatics Masaryk University
restrictedUse
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
attribution
shareAlike
downloadable
True
PRESEMT
euFunds
onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether.
toolService
tool
17
kb
jan.pomikalek@gmail.com
false
yes
LINDAT / CLARIAH-CZ
http://code.google.com/p/onion/
PRESEMT@@@@@@@@
Lexical Computing Ltd.@@@@@@@@
17@@kb
17127
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0015-8DAF-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hajič, Jan
Hajičová, Eva
Panevová, Jarmila
Sgall, Petr
Cinková, Silvie
Fučíková, Eva
Mikulová, Marie
Pajas, Petr
Popelka, Jan
Semecký, Jiří
Šindlerová, Jana
Štěpánek, Jan
Toman, Josef
Urešová, Zdeňka
Žabokrtský, Zdeněk
2013-03-28T14:16:10Z
2013-03-28T14:16:10Z
2012
http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
Texts
The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part.
Data
The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 sentence-aligned. An additional automatic alignment on the node level (different for each annotation layer) is part of this release, too. The original Penn Treebank-like file structure (25 sections, each containing up to one hundred files) has been preserved. Only those PTB documents which have both POS and structural annotation (total of 2312 documents) have been translated to Czech and made part of this release.
Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are:
dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values)
semantic labeling of content words and types of coordinating structures
argument structure, including an argument structure ("valency") lexicon for both languages
ellipsis and anaphora resolution.
This annotation style is called tectogrammatical annotation and it constitutes the tectogrammatical layer in the corpus. For more details see below and documentation.
Annotation of the Czech part
Sentences of the Czech translation were automatically morphologically annotated and parsed into surface-syntax dependency trees in the PDT 2.0 annotation style. This annotation style is sometimes called analytical annotation; it constitutes the analytical layer of the corpus. The manual tectogrammatical (deep-syntax) annotation was built as a separate layer above the automatic analytical (surface-syntax) parse. A sample of 2,000 sentences was manually annotated on the analytical layer.
Annotation of the English part
The resulting manual tectogrammatical annotation was built above an automatic transformation of the original phrase-structure annotation of the Penn Treebank into surface dependency (analytical) representations, using the following additional linguistic information from other sources:
PropBank (LDC2004T14)
VerbNet
NomBank (LDC2008T23)
flat noun phrase structures (by courtesy of D. Vadas and J.R. Curran)
For each sentence, the original Penn Treebank phrase structure trees are preserved in this corpus together with their links to the analytical and tectogrammatical annotation.
Ministry of Education of the Czech Republic projects No.:
MSM0021620838
LC536
ME09008
LM2010013
7E09003+7E11051
7E11041
Czech Science Foundation, grants No.:
GAP406/10/0875
GPP406/10/P193
GA405/09/0729
Research funds of the Faculty of Mathematics and Physics, Charles University, Czech Republic, Grant Agency of the Academy of Sciences of the Czech Republic: No. 1ET101120503
Students participating in this project have been running their own student grants from the Grant Agency of the Charles University, which were connected to this project. Only ongoing projects are mentioned: 116310, 158010, 3537/2011
Also, this work was funded in part by the following projects sponsored by the European Commission:
Companions, No. 034434
EuroMatrix, No. 034291
EuroMatrixPlus, No. 231720
Faust, No. 247762
ces
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/231720
info:eu-repo/grantAgreement/EC/FP7/247762
http://hdl.handle.net/11234/1-1664
CC-BY-NC-SA + LDC99T42
https://lindat.mff.cuni.cz/repository/xmlui/page/license-pcedt2
RES
http://ufal.mff.cuni.cz/pcedt2.0
parallel treebank
PCEDT
parallel corpus
Wall Street Journal
WSJ
Penn Treebank
dependency annotation
PDT
Prague Czech-English Dependency Treebank 2.0
corpus
Prague Czech-English Dependency Treebank 2.0
Hajič
Jan
Charles University in Prague, UFAL
restrictedUse
LDC + CC-BY-NC-SA
academic-nonCommercialUse
shareAlike
other
downloadable
True
#1-MSM0021620838 - Moderní metody, struktury a systémy informatiky
#2-LC536 - Integrated center for natural language processing
#3-ME09008 - Mnohojazyčná univerzální anotace lingvistických dat
#4-LM2010013 - LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat
#5-7E09003 - EuroMatrixPlus—Bringing Machine Translation for European Languages to the User
#6-7E11051 - EuroMatrixPlus - Enlarged European Union Bringing Machine Translation for European Languages to the User
#7-7E11041 - Feedback Analysis for User Adaptive Statistical Translation
#8-GAP406/10/0875 - Computational Linguistics: Explicit description of language and annotated data focused on Czech
#9-GPP406/10/P193 - Tools for Revision and Tectogrammatical Annotation of a Czech Dependency Treebank
#10-GA405/09/0729 - From the structure of a sentence to textual relationships
#11-Companions, No. 034434
#12-EuroMatrix, No. 034291
#13-EuroMatrixPlus, No. 231720
#14-Faust, No. 247762
#1-nationalFunds
#2-nationalFunds
#3-nationalFunds
#4-nationalFunds
#5-nationalFunds
#6-nationalFunds
#7-nationalFunds
#8-nationalFunds
#9-nationalFunds
#10-nationalFunds
#11-euFunds
#12-euFunds
#13-euFunds
#14-euFunds
Texts
The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part.
Data
The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 sentence-aligned. An additional automatic alignment on the node level (different for each annotation layer) is part of this release, too. The original Penn Treebank-like file structure (25 sections, each containing up to one hundred files) has been preserved. Only those PTB documents which have both POS and structural annotation (total of 2312 documents) have been translated to Czech and made part of this release.
Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are:
dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values)
semantic labeling of content words and types of coordinating structures
argument structure, including an argument structure ("valency") lexicon for both languages
ellipsis and anaphora resolution.
This annotation style is called tectogrammatical annotation and it constitutes the tectogrammatical layer in the corpus. For more details see below and documentation.
Annotation of the Czech part
Sentences of the Czech translation were automatically morphologically annotated and parsed into surface-syntax dependency trees in the PDT 2.0 annotation style. This annotation style is sometimes called analytical annotation; it constitutes the analytical layer of the corpus. The manual tectogrammatical (deep-syntax) annotation was built as a separate layer above the automatic analytical (surface-syntax) parse. A sample of 2,000 sentences was manually annotated on the analytical layer.
Annotation of the English part
The resulting manual tectogrammatical annotation was built above an automatic transformation of the original phrase-structure annotation of the Penn Treebank into surface dependency (analytical) representations, using the following additional linguistic information from other sources:
PropBank (LDC2004T14)
VerbNet
NomBank (LDC2008T23)
flat noun phrase structures (by courtesy of D. Vadas and J.R. Curran)
For each sentence, the original Penn Treebank phrase structure trees are preserved in this corpus together with their links to the analytical and tectogrammatical annotation.
text
corpus
49208
sentences
hajic@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
http://ufal.mff.cuni.cz/pcedt2.0/trees/00/01/wsj_0001_1.xhtml
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620838@@Moderní metody, struktury a systémy informatiky@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LC536@@Centrum komputační lingvistiky@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@ME09008@@Mnohojazyčná univerzální anotace lingvistických dat@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E09003@@EuroMatrixPlus – Bringing Machine Translation for European Languages to the User@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E11051@@EuroMatrixPlus - Enlarged European Union Bringing Machine Translation for European Languages to the User@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@7E11041@@Feedback Analysis for User Adaptive Statistical Translation@@nationalFunds@@
Grantová agentura České republiky@@GAP406/10/0875@@Komputační lingvistika: Explicitní popis jazyka a anotovaná data se zřetelem na češtinu@@nationalFunds@@
Grantová agentura České republiky@@GPP406/10/P193@@Nástroje pro revizi a tektogramatickou anotaci českého závislostního korpusu@@nationalFunds@@
Grantová agentura České republiky@@GA405/09/0729@@Od struktury věty k textovým vztahům@@nationalFunds@@
Grantová agentura Akademie věd České republiky@@1ET101120503@@Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů@@nationalFunds@@
Grantová agentura Univerzity Karlovy v Praze@@GAUK 116310/2010@@Anglicko-český strojový překlad s využitím hloubkové syntaxe@@nationalFunds@@
European Union@@FP6-IST-5-034434-IP@@Companions IP@@euFunds@@
Grantová agentura Univerzity Karlovy v Praze@@GAUK 3537/2011@@Detekce větné polarity v počítačovém korpusu@@nationalFunds@@
European Union@@FP6-IST-5-034291-STP@@Euromatrix@@euFunds@@
European Union@@FP7-ICT-2007-3-231720@@EuroMatrix Plus@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/231720
European Union@@FP7-ICT-2009-4-247762@@Faust@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/247762
Grantová agentura Univerzity Karlovy v Praze@@GAUK 1580/2010@@Značkování aktuálního členění věty v paralelním anglicko-českém závislostním korpusu@@nationalFunds@@
49208@@sentences
2069118389
3
Czech-English|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=pcedt_20_cs_a
English-Czech|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=pcedt_20_en_a
Czech part only|https://lindat.mff.cuni.cz/services/pmltq/pcedt20_cz/
parallel (login)|https://lindat.mff.cuni.cz/services/pmltq/pcedt20/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000E-011B-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Grác, Marek
2013-02-26T13:40:06Z
2013-02-26T13:40:06Z
2011
http://hdl.handle.net/11858/00-097C-0000-000E-011B-8
In NLP Centre, dividing text into sentences is currently done with
a tool which uses rule-based system. In order to make enough training
data for machine learning, annotators manually split the corpus of contemporary text
CBB.blog (1 million tokens) into sentences.
Each file contains one hundredth of the whole corpus and all data were
processed in parallel by two annotators.
The corpus was created from ten contemporary blogs:
hintzu.otaku.cz
modnipeklo.cz
bloc.cz
aleneprokopova.blogspot.com
blog.aktualne.cz
fuchsova.blog.onaidnes.cz
havlik.blog.idnes.cz
blog.aktualne.centrum.cz
klusak.blogspot.cz
myego.cz/welldone
ces
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
http://nlp.fi.muni.cz/projekty/cocb/
corpus
blogs
annotation
annotators
sentences
machine learning
Corpus of contemporary blogs
corpus
Corpus of contemporary blogs
Grác
Marek
Masaryk university, NLP Centre
restrictedUse
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
academic-nonCommercialUse
attribution
noDerivatives
downloadable
True
In NLP Centre, dividing text into sentences is currently done with
a tool which uses rule-based system. In order to make enough training
data for machine learning, we split the corpus of contemporary text
CBB.blog (1 million tokens) with annotators into senteces.
Each file contains one hundredth of the whole corpus and all data were
processed in parallel by two annotators.
The corpus was created from ten contemporary blogs:
hintzu.otaku.cz
modnipeklo.cz
bloc.cz
aleneprokopova.blogspot.com
blog.aktualne.cz
fuchsova.blog.onaidnes.cz
havlik.blog.idnes.cz
blog.aktualne.centrum.cz
klusak.blogspot.cz
myego.cz/welldone
text
corpus
10
mb
grac@fi.muni.cz
yes
LINDAT / CLARIAH-CZ
10@@mb
4174388
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B2E-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Grác, Marek
Čapek, Tomáš
2014-01-09T11:13:28Z
2014-01-09T11:13:28Z
2011
http://hdl.handle.net/11858/00-097C-0000-0023-1B2E-0
Semantic net `sholva' contains more than 150 000 records for which there was sufficient agreement among annotators. Indvidual words are labeled in the following categories:
person, person / individual, event and substance.
ces
Masaryk University, NLP Centre
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
https://nlp.fi.muni.cz/projekty/sholva/
semantic net
semantic tagging
sholva-0.6
lexicalConceptualResource
sholva-0.6
Grác
Marek
Masaryk university, NLP Centre
restrictedUse
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
academic-nonCommercialUse
attribution
noDerivatives
downloadable
True
Semantic net `sholva' contains more than 150 000 records for which there was sufficient agreement among annotators. Indvidual words are labeled in the following categories:
person, person / individual, event and substance.
text
lexicalConceptualResource
wordnet
3
mb
grac@fi.muni.cz
yes
LINDAT / CLARIAH-CZ
3@@mb
523067
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0015-A780-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hajič, Jan
Hlaváčová, Jaroslava
2013-05-02T14:45:11Z
2013-05-02T14:45:11Z
2013
http://hdl.handle.net/11858/00-097C-0000-0015-A780-9
Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11234/1-1673
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
http://ufal.mff.cuni.cz/morfflex
morphological dictionary
morphology
Czech
MorfFlex CZ
lexicalConceptualResource
MorfFlex CZ
Hajič
Jan
Charles University in Prague, UFAL
restrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
academic-nonCommercialUse
attribution
shareAlike
hardDisk
True
N/A
ownFunds
Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
text
lexicalConceptualResource
computationalLexicon
113537915
lexicalTypes
hajic@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
113537915@@lexicalTypes
1447439713
4
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0019-89A0-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Šebesta, Karel
Goláňová, Hana
2013-05-13T09:17:21Z
2013-05-13T09:17:21Z
2013-05-11
http://hdl.handle.net/11858/00-097C-0000-0019-89A0-9
Corpus AKCES 2 consists of trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants.
MŠMT (MSM0021620825), UK (PRVOUK P 10)
ces
Charles University in Prague, ÚČJTK
http://hdl.handle.net/11858/00-097C-0000-0023-3FBB-3
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
http://akces.ff.cuni.cz
youth language
classroom
language acquisition corpus
AKCES
AKCES 2
corpus
AKCES 2
Šebesta
KS
Charles University in Prague, ÚČJTK
restrictedUse
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
academic-nonCommercialUse
attribution
noDerivatives
downloadable
True
nationalFunds
Corpus AKCES 2 uncludes trancripts of records of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora)
text
corpus
792764
words
sebesta@ff.cuni.cz
yes
LINDAT / CLARIAH-CZ
https://wiki.korpus.cz/doku.php/cnk:schola2010
https://www.korpus.cz/corpora/run.cgi/first?reload=1&corpname=omezeni%2Fschola2010
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620825@@Jazyk jako lidská činnost, její produkt a faktor@@nationalFunds@@
Univerzita Karlova v Praze@@P10 – Lingvistika@@Program rozvoje vědních oblastí na Univerzitě Karlově P10 – Lingvistika, modul Osvojování a vývoj jazykové a komunikační kompetence u populace ČR, řešeno od r. 2012@@nationalFunds@@
792764@@words
2246888
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-6133-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Majliš, Martin
2013-06-25T15:08:15Z
2013-06-25T15:08:15Z
2011-12-20
http://hdl.handle.net/11858/00-097C-0000-0022-6133-9
A set of corpora for 120 languages automatically collected from wikipedia and the web.
Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
afr
als
amh
ara
arg
arz
ast
aze
bel
ben
bos
bpy
bre
bug
bul
cat
ceb
ces
chv
cos
cym
dan
deu
diq
ell
eng
epo
est
eus
fao
fas
fin
fra
fry
gan
gla
gle
glg
glk
guj
hat
hbs
heb
hif
hin
hrv
hsb
hun
hye
ido
ina
ind
isl
ita
jav
jpn
kan
kat
kaz
kor
kur
lat
lav
lim
lit
lmo
ltz
mal
mar
mkd
mlg
mon
mri
msa
mya
nap
nds
nep
new
nld
nno
nor
oci
oss
pam
pms
pol
por
que
ron
rus
sah
scn
sco
slk
slv
spa
sqi
srp
sun
swa
swe
tam
tat
tel
tgk
tgl
tha
tur
ukr
urd
uzb
vec
vie
vol
war
wln
yid
yor
zho
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
multilingual corpora
W2C – Web to Corpus – Corpora
corpus
W2C – Web to Corpus – Corpora
Popel
Martin
Charles University in Prague, UFAL
restrictedUse
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
attribution
shareAlike
downloadable
accessibleThroughInterface
True
A set of corpora for 120 languages automatically collected from wikipedia and the web.
Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
text
corpus
55
gb
popel@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
55@@gb
20309577059
122
English|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=w2c_en_a
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-AAF5-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Tamchyna, Aleš
Dušek, Ondřej
Rosa, Rudolf
2013-08-14T10:52:07Z
2013-08-14T10:52:07Z
2013-08-13
http://hdl.handle.net/11858/00-097C-0000-0022-AAF5-B
MTMonkey is a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre- and post processing.
It consists of an application server and remote workers which handle text processing and communicate translation requests to MT systems. The communication between the application server and the workers is based on the XML-RPC protocol.
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 257528 (KHRESMOI). This work has been using language resources developed and/or stored and/or distributed by the LINDAT-Clarin project of the Ministry of Education of the Czech Republic (project LM2010013). This work has been supported by the AMALACH grant (DF12P01OVV02) of the Ministry of Culture of the Czech Republic.
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/257528
Apache License 2.0
http://opensource.org/licenses/Apache-2.0
PUB
https://github.com/ufal/mtmonkey
machine translation
distributed computing
web service
infrastructure
MTMonkey
toolService
MTMonkey
Dušek
Ondřej
Charles University in Prague, UFAL
restrictedUse
Apache License 2.0
attribution
downloadable
True
#1-The KHRESMOI Project (EU 7th Framework Programme grant agreement no. 257528)
#2-LINDAT-CLARIN project (LM2010013)
#3-AMALACH project (DF12P01OVV02 of the Ministry of Culture of Czech Republic)
#1-euFunds
#2-nationalFunds
#3-nationalFunds
MTMonkey is a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre- and post processing.
It consists of an application server and remote workers which handle text processing and communicate translation requests to MT systems. The communication between the application server and the workers is based on the XML-RPC protocol.
toolService
infrastructure
odusek@ufal.mff.cuni.cz
false
yes
LINDAT / CLARIAH-CZ
European Union@@FP7-ICT-2010-6-257528@@Khresmoi@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/257528
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
Ministerstvo kultury České republiky@@DF12P01OVV022@@Zpřístupnění rozsáhlého video archivu kulturního dědictví pomocí metod automatického rozpoznávání mluvené řeči a strojového překladu. (AMALACH)@@nationalFunds@@
118200
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C73C-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Ševčíková, Magda
Žabokrtský, Zdeněk
Straková, Jana
2013-09-07T11:15:32Z
2013-09-07T11:15:32Z
2007
http://hdl.handle.net/11858/00-097C-0000-0022-C73C-7
The presented Czech Named Entity Corpus 1.0 is the first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification.
1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů)
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
named entity recognition
named entitity corpus
Czech
NER
corpus
Czech Named Entity Corpus 1.0
corpus
Czech Named Entity Corpus 1.0
Straková
Jana
Charles University in Prague, UFAL
restrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
academic-nonCommercialUse
attribution
shareAlike
downloadable
1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů)
nationalFunds
The presented Czech Named Entity Corpus 1.0 is the first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification.
text
corpus
6000
sentences
strakova@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Grantová agentura Akademie věd České republiky@@1ET101120503@@Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů@@nationalFunds@@
6000@@sentences
9571474
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C7F6-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pajas, Petr
Štěpánek, Jan
Sedlák, Michal
2013-09-09T16:04:21Z
2013-09-09T16:04:21Z
2009-01-01
http://hdl.handle.net/11858/00-097C-0000-0022-C7F6-3
System for querying annotated treebanks in PML format. The querying uses it own query language with graphical representation. It has two different implementations (SQL and Perl) and several clients (TrEd, browser-based, command line interface).
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
GNU General Public License, version 2
http://www.gnu.org/licenses/gpl-2.0.html
PUB
http://ufal.mff.cuni.cz/pmltq
treebank
query
search
PML Tree Query
toolService
PML Tree Query
Štěpánek
Jan
Charles University in Prague, UFAL
restrictedUse
GNU General Public License, version 2
downloadable
Integration of language resources for information extraction from natural texts
nationalFunds
System for querying annotated treebanks in PML format. The querying uses it own query language with graphical representation. It has two different implementations (SQL and Perl) and several clients (TrEd, browser-based, command line interface).
toolService
tool
jan.stepanek@matfyz.cz
false
yes
LINDAT / CLARIAH-CZ
https://lindat.mff.cuni.cz/services/pmltq/
2370042
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-C7FD-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Sedlák, Michal
2013-09-10T09:59:26Z
2013-09-10T09:59:26Z
2013-09-10
http://hdl.handle.net/11858/00-097C-0000-0022-C7FD-6
Simple web build on the top of the PML Tree Query service.
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Artistic License (Perl) 1.0
http://opensource.org/licenses/Artistic-Perl-1.0
PUB
https://redmine.ms.mff.cuni.cz/projects/pmltq-web
Perl
PML-TQ
PML
PMLTQ::Web
toolService
PMLTQ::Web
Sedlák
Michal
Charles University in Prague, UFAL
restrictedUse
Artistic License (Perl) 1.0
accessibleThroughInterface
Simple web build on the top of the PML Tree Query service.
toolService
tool
sedlak@ufal.mff.cuni.cz
false
yes
LINDAT / CLARIAH-CZ
https://lindat.mff.cuni.cz/services/pmltq/
248524
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-10B2-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Bojar, Ondřej
Macháček, Matouš
Tamchyna, Aleš
Zeman, Daniel
2013-12-10T13:41:44Z
2013-12-10T13:41:44Z
2013-09-01
http://hdl.handle.net/11858/00-097C-0000-0023-10B2-F
This dataset contains the whole set of very many Czech translations for 50 English source sentences coming from WMT11 test set (http://www.statmt.org/wmt11).
In total, there are 15431447 Czech sentences, i.e. 300k reference translations per source English sentence on average, but the exact number greatly varies across sentences.
You can find more details in included README file.
If you use this dataset, please cite the following paper which describes the technique used to construct the Czech translations:
Bojar Ondřej, Macháček Matouš, Tamchyna Aleš, Zeman Daniel:
Scratching the Surface of Possible Translations.
Lecture Notes in Computer Science, Vol. 8082, Text, Speech and Dialogue: 16th
International Conference, TSD 2013. Proceedings, Copyright © Springer Verlag,
Berlin / Heidelberg, ISBN 978-3-642-40584-6, ISSN 0302-9743, pp. 465-474, 2013, DOI: 10.1007/978-3-642-40585-3_59
P406/11/1499 of the Grant Agency of the Czech Republic, FP7-ICT-2011-7-288487 (MosesCore) of the European Union and 1356213 of the Grant Agency of the Charles University
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/288487
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
machine translation
automatic machine translation evaluation
reference translation
Many Czech References for 50 Sentences Selected from WMT11 Data
corpus
Many Czech References for 50 Sentences Selected from WMT11 Data
Macháček
Matouš
Charles University in Prague, UFAL
restrictedUse
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
attribution
shareAlike
downloadable
This dataset contains the whole set of very many Czech translations for 50 English source sentences coming from WMT11 test set (http://www.statmt.org/wmt11).
In total, there are 15431447 Czech sentences, i.e. 300k reference translations per source English sentence on average, but the exact number greatly varies across sentences.
You can find more details in included README file.
If you use this dataset, please cite the following paper which describes the technique used to construct the Czech translations:
Bojar Ondřej, Macháček Matouš, Tamchyna Aleš, Zeman Daniel:
Scratching the Surface of Possible Translations.
Lecture Notes in Computer Science, Vol. 8082, Text, Speech and Dialogue: 16th
International Conference, TSD 2013. Proceedings, Copyright © Springer Verlag,
Berlin / Heidelberg, ISBN 978-3-642-40584-6, ISSN 0302-9743, pp. 465-474, 2013
text
corpus
15431447
sentences
machacekmatous@gmail.com
yes
LINDAT / CLARIAH-CZ
Grantová agentura České republiky@@GAP406/11/1499@@Čeština ve věku strojového překladu@@nationalFunds@@
European Union@@FP7-ICT-2011-7-288487@@MosesCore@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/288487
Grantová agentura Univerzity Karlovy v Praze@@GAUK 13562/2013@@Využití mnohonásobných referencí ve strojovém překladu@@nationalFunds@@
15431447@@sentences
122537300
2
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-D9BF-52021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pecina, Pavel
Dušek, Ondřej
Hajič, Jan
Urešová, Zdeňka
2013-10-11T07:54:49Z
2014-04-02T23:00:03Z
2013-10-10
Khresmoi-Query-MT-Test-Data-1.0
http://hdl.handle.net/11858/00-097C-0000-0022-D9BF-5
This package contains data sets for development and testing of machine translation of medical search short queries between Czech, English, French, and German. The queries come from general public and medical experts.
This work was supported by the EU FP7 project Khresmoi (European Comission contract No. 257528). The language resources are distributed by the LINDAT/Clarin project of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2010013).
We thank Health on the Net Foundation for granting the license for the English general public queries, TRIP database for granting the license for the English medical expert queries, and three anonymous translators and three medical experts for translating amd revising the data.
eng
fra
deu
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
info:eu-repo/grantAgreement/EC/FP7/257528
http://hdl.handle.net/11234/1-2121
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
http://creativecommons.org/licenses/by-nc/3.0/
PUB
http://khresmoi.eu
corpus
test data
medical
health
machine translation
Czech
French
German
English
Khresmoi Query Translation Test Data 1.0
corpus
Khresmoi Query Translation Test Data 1.0
Hajič
Jan
Charles University in Prague, UFAL
restrictedUse
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
academic-nonCommercialUse
attribution
downloadable
True
#1-KHRESMOI - KNOWLEDGE HELPER FOR MEDICAL AND OTHER INFORMATION USERS, EU NO. 257528
LINDAT/CLARIN, MSMT CR, LM2010013
#1-euFunds
nationalFunds
This package contains data sets for development and testing of machine translation of medical search short queries between Czech, English, French, and German. The queries come from general public and medical experts.
text
corpus
1508
terms
hajic@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
European Union@@FP7-ICT-2010-6-257528@@Khresmoi@@euFunds@@info:eu-repo/grantAgreement/EC/FP7/257528
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
1508@@terms
56681
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-EE02-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Bojar, Ondřej
Tamchyna, Aleš
2013-11-09T22:44:43Z
2013-11-09T22:44:43Z
2013-11-07
http://hdl.handle.net/11858/00-097C-0000-0022-EE02-C
Statistical component of Chimera, a state-of-the-art MT system.
Project DF12P01OVV022 of the Ministry of Culture of the Czech Republic (NAKI -- Amalach).
eng
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
moses
machine translation
Plain-Moses-Chimera
toolService
Plain-Moses-Chimera
Bojar
Ondřej
Charles University in Prague, UFAL
restrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
academic-nonCommercialUse
attribution
shareAlike
Statistical component of Chimera, a state-of-the-art MT system.
toolService
suiteOfTools
bojar@ufal.mff.cuni.cz
true
yes
LINDAT / CLARIAH-CZ
Ministerstvo kultury České republiky@@DF12P01OVV022@@Zpřístupnění rozsáhlého video archivu kulturního dědictví pomocí metod automatického rozpoznávání mluvené řeči a strojového překladu. (AMALACH)@@nationalFunds@@
3263723520
2
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-FE82-72021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Habernal, Ivan
Ptáček, Tomáš
Steinberger, Josef
2013-11-29T15:41:00Z
2013-11-29T15:41:00Z
2013-07-17
http://hdl.handle.net/11858/00-097C-0000-0022-FE82-7
Corpus consisting of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.
ces
University of West Bohemia
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
http://liks.fav.zcu.cz/sentiment/
sentiment analysis
opinion mining
Facebook Data for Sentiment Analysis
corpus
Facebook Data for Sentiment Analysis
Habernal
Ivan
University of West Bohemia in Pilsen, KIV
restrictedUse
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
attribution
shareAlike
downloadable
Corpus consists of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.
text
corpus
1084
kb
habernal@kiv.zcu.cz
yes
LINDAT / CLARIAH-CZ
http://liks.fav.zcu.cz/sentiment/
1084@@kb
1109729
1
search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=facebook_cs_m
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0022-FF60-B2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Veselovská, Kateřina
Bojar, Ondřej
2013-12-02T22:10:38Z
2013-12-02T22:10:38Z
2013-12-02
http://hdl.handle.net/11858/00-097C-0000-0022-FF60-B
Czech subjectivity lexicon, i.e. a list of subjectivity clues for sentiment analysis in Czech. The list contains 4626 evaluative items (1672 positive and 2954 negative) together with their part of speech tags, polarity orientation and source information.
The core of the Czech subjectivity lexicon has been gained by automatic translation of a freely available English subjectivity lexicon downloaded from http://www.cs.pitt.edu/mpqa/subj_lexicon.html. For translating the data into Czech, we used parallel corpus CzEng 1.0 containing 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep layers of syntactic representation. Afterwards, the lexicon has been manually refined by an experienced annotator.
The work on this project has been supported by the GAUK 3537/2011 grant and by SVV project number 267 314.
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
http://ufal.mff.cuni.cz/seance
subjectivity lexicon
sentiment analysis
opinion mining
polarity clues
Czech SubLex 1.0
lexicalConceptualResource
Czech SubLex 1.0
Veselovská
Kateřina
Charles University in Prague, UFAL
restrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
academic-nonCommercialUse
attribution
shareAlike
downloadable
GAUK 3537/2011 grant and SVV project number 267 314.
nationalFunds
Czech subjectivity lexicon, i.e. a list of subjectivity clues for sentiment analysis in Czech. The list contains 4626 evaluative items (1672 positive and 2954 negative) together with their part of speech tag, polarity orientation and source information.
The core of the Czech subjectivity lexicon has been gained by automatic translation of a freely available English subjectivity lexicon downloaded from http://www.cs.pitt.edu/mpqa/subj_lexicon.html. For translating the data into Czech, we used parallel corpus CzEng 1.0 containing 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep layers of syntactic representation. Afterwards, the lexicon has been manually refined by an experienced annotator.
text
lexicalConceptualResource
wordList
207
kb
veselovska@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Grantová agentura Univerzity Karlovy v Praze@@GAUK 3537/2011@@Detekce větné polarity v počítačovém korpusu@@nationalFunds@@
Univerzita Karlova v Praze (mimo GAUK)@@SVV 267 314@@Teoretické základy informatiky a výpočetní lingvistiky@@nationalFunds@@
207@@kb
381830
3
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119C-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Kopřivová, Marie
Waclawičová, Martina
2013-12-13T11:55:09Z
2013-12-13T11:55:09Z
2006
http://hdl.handle.net/11858/00-097C-0000-0023-119C-C
Corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 221 recordings made in 2002–2006 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 754, the metadata include sociolinguistic information about them.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC.
Výzkumný záměr MSM0021620823 – Český národní korpus a korpusy dalších jazyků
ces
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
https://wiki.korpus.cz/doku.php/cnk:oral2006
corpus
informal spoken language
ORAL2006: Corpus of informal spoken Czech
corpus
ORAL2006: Corpus of informal spoken Czech
Křen
Michal
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
restrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
academic-nonCommercialUse
attribution
shareAlike
downloadable
True
nationalFunds
Corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 221 recordings made in 2002–2006 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 754, the metadata include sociolinguistic information about them. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC.
text
corpus
1000000
words
michal.kren@ff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620823@@Český národní korpus a korpusy dalších jazyků@@nationalFunds@@
1000000@@words
2634065
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119D-A2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Waclawičová, Martina
Kopřivová, Marie
Křen, Michal
Válková, Lucie
2013-12-13T11:56:16Z
2013-12-13T11:56:16Z
2008
http://hdl.handle.net/11858/00-097C-0000-0023-119D-A
Balanced corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 297 recordings made in 2002–2007 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 995, the corpus is balanced in their main sociolinguistic categories (gender, age group, education, region of childhood residence).
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC.
MSM0021620823 – Český národní korpus a korpusy dalších jazyků
ces
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
https://wiki.korpus.cz/doku.php/cnk:oral2008
informal spoken language
balanced corpus
ORAL2008: Balanced corpus of informal spoken Czech
corpus
ORAL2008: Balanced corpus of informal spoken Czech
Křen
Michal
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
restrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
academic-nonCommercialUse
attribution
shareAlike
downloadable
True
nationalFunds
Balanced corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 297 recordings made in 2002–2007 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 995, the corpus is balanced in their main sociolinguistic categories (gender, age group, education, region of childhood residence).
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC.
text
corpus
1000000
words
michal.kren@ff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620823@@Český národní korpus a korpusy dalších jazyků@@nationalFunds@@
1000000@@words
2707529
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119E-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Čermák, František
Hlaváčová, Jaroslava
Hnátková, Milena
Jelínek, Tomáš
Kocek, Jan
Kopřivová, Marie
Křen, Michal
Novotná, Renata
Petkevič, Vladimír
Schmiedtová, Věra
Skoumalová, Hana
Spoustová, Johanka
Šulc, Michal
Velíšek, Zdeněk
2013-12-13T15:01:52Z
2013-12-13T15:01:52Z
2005
http://hdl.handle.net/11858/00-097C-0000-0023-119E-8
Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2000–2004 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
MSM0021620823 – Český národní korpus a korpusy dalších jazyků
ces
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
ACA
https://wiki.korpus.cz/doku.php/cnk:syn2005
balanced corpus
written language
SYN2005: balanced corpus of written Czech
corpus
SYN2005: balanced corpus of written Czech
Křen
Michal
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
restrictedUse
Czech National Corpus (Shuffled Corpus Data)
academic-nonCommercialUse
attribution
noRedistribution
downloadable
True
nationalFunds
Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2000–2004 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
text
corpus
100000000
words
michal.kren@ff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620823@@Český národní korpus a korpusy dalších jazyků@@nationalFunds@@
100000000@@words
754725795
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119F-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Křen, Michal
Bartoň, Tomáš
Cvrček, Václav
Hnátková, Milena
Jelínek, Tomáš
Kocek, Jan
Novotná, Renata
Petkevič, Vladimír
Procházka, Pavel
Schmiedtová, Věra
Skoumalová, Hana
2013-12-13T16:55:38Z
2013-12-13T16:55:38Z
2010
http://hdl.handle.net/11858/00-097C-0000-0023-119F-6
Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2005–2009 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
MSM0021620823 – Český národní korpus a korpusy dalších jazyků
ces
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
ACA
https://wiki.korpus.cz/doku.php/cnk:syn2010
balanced corpus
written language
SYN2010: balanced corpus of written Czech
corpus
SYN2010: balanced corpus of written Czech
Křen
Michal
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
restrictedUse
Czech National Corpus (Shuffled Corpus Data)
academic-nonCommercialUse
attribution
noRedistribution
downloadable
True
nationalFunds
Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2005–2009 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
text
corpus
100000000
words
michal.kren@ff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620823@@Český národní korpus a korpusy dalších jazyků@@nationalFunds@@
100000000@@words
757967700
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1358-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Čermák, František
Hlaváčová, Jaroslava
Hnátková, Milena
Jelínek, Tomáš
Kocek, Jan
Kopřivová, Marie
Křen, Michal
Novotná, Renata
Petkevič, Vladimír
Schmiedtová, Věra
Skoumalová, Hana
Spoustová, Johanka
Šulc, Michal
Velíšek, Zdeněk
2013-12-18T09:00:57Z
2013-12-18T09:00:57Z
2006
http://hdl.handle.net/11858/00-097C-0000-0023-1358-3
Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
MSM0021620823 – Český národní korpus a korpusy dalších jazyků
ces
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
ACA
https://wiki.korpus.cz/doku.php/cnk:syn2006pub
corpus
written language
SYN2006PUB: corpus of Czech newspapers
corpus
SYN2006PUB: corpus of Czech newspapers
Křen
Michal
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
restrictedUse
Czech National Corpus (Shuffled Corpus Data)
academic-nonCommercialUse
attribution
noRedistribution
downloadable
True
nationalFunds
Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
text
corpus
300000000
words
michal.kren@ff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620823@@Český národní korpus a korpusy dalších jazyků@@nationalFunds@@
300000000@@words
2409355640
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1359-12021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Křen, Michal
Bartoň, Tomáš
Hnátková, Milena
Jelínek, Tomáš
Petkevič, Vladimír
Procházka, Pavel
Skoumalová, Hana
2013-12-18T09:06:37Z
2013-12-18T09:06:37Z
2010
http://hdl.handle.net/11858/00-097C-0000-0023-1359-1
Corpus of contemporary Czech newspapers and magazines sized 700 MW. It contains various titles published between 1995–2007. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
MSM0021620823 – Český národní korpus a korpusy dalších jazyků
ces
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
ACA
https://wiki.korpus.cz/doku.php/cnk:syn2009pub
corpus
written language
SYN2009PUB: corpus of Czech newspapers
corpus
SYN2009PUB: corpus of Czech newspapers
Křen
Michal
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
restrictedUse
Czech National Corpus (Shuffled Corpus Data)
academic-nonCommercialUse
attribution
noRedistribution
downloadable
True
nationalFunds
Corpus of contemporary Czech newspapers and magazines sized 700 MW. It contains various titles published between 1995–2007. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
text
corpus
700000000
words
michal.kren@ff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@MSM 0021620823@@Český národní korpus a korpusy dalších jazyků@@nationalFunds@@
700000000@@words
5683158234
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1AAF-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Bejček, Eduard
Hajičová, Eva
Hajič, Jan
Jínová, Pavlína
Kettnerová, Václava
Kolářová, Veronika
Mikulová, Marie
Mírovský, Jiří
Nedoluzhko, Anna
Panevová, Jarmila
Poláková, Lucie
Ševčíková, Magda
Štěpánek, Jan
Zikánová, Šárka
2014-01-08T20:17:10Z
2014-01-08T20:17:10Z
2013-12-31
PDT 3.0
http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
PDT 3.0 is a new version of Prague Dependency Treebank. It contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and semantic annotation (0.8 MW); in addition, certain properties of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations are annotated at the semantic level.
the Grant Agency of the Czech Republic: grants P406/12/0658 "Coreference, discourse relations and information structure in a contrastive perspective", P406/2010/0875 "Computational Linguistics: Explicit description of language and annotated data focused on Czech", 405/09/0729 "From the structure of a sentence to textual relationships", and GPP406/12/P175 (Selected derivational relations for automatic processing of Czech);
the Ministry of Education, Youth and Sports of the Czech Republic: the KONTAKT project ME10018 "Towards a computational analysis of text structure" and the LINDAT-Clarin project LM2010013;
the Grant Agency of Charles University in Prague: GAUK 103609 "Textual (Inter-sentential) Relations and their Representation in a Language Corpus" and GAUK 4383/2009 "Methods of coreference resolution".
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
http://hdl.handle.net/11858/00-097C-0000-0008-E130-A
http://hdl.handle.net/11234/1-1905
http://hdl.handle.net/11234/1-2621
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
http://ufal.mff.cuni.cz/pdt3.0
treebank
dependency
tectogrammatics
topic-focus articulation
multiword expressions
coreference
bridging relations
discourse
PDT
Prague Dependency Treebank 3.0
corpus
Prague Dependency Treebank 3.0
Mírovský
Jiří
Charles University in Prague, UFAL
restrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
academic-nonCommercialUse
attribution
shareAlike
downloadable
True
#1-Computational Linguistics: Explicit description of language and annotated data focused on Czech
#1-nationalFunds
PDT 3.0 is a new version of Prague Dependency Treebank. It contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and semantic annotation (0.8 MW); in addition, certain properties of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations are annotated at the semantic level.
text
corpus
49431
sentences
mirovsky@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
http://ufal.mff.cuni.cz/pdt3.0
Grantová agentura České republiky@@GAP406/12/0658@@Koreference, diskurs a aktuální členění v kontrastivním pohledu@@nationalFunds@@
Grantová agentura České republiky@@GAP406/10/0875@@Komputační lingvistika: Explicitní popis jazyka a anotovaná data se zřetelem na češtinu@@nationalFunds@@
Grantová agentura České republiky@@GA405/09/0729@@Od struktury věty k textovým vztahům@@nationalFunds@@
Grantová agentura České republiky@@GPP406/12/P175@@Vybrané derivační vztahy pro automatické zpracování češtiny@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@ME10018@@K počítačové analýze struktury textu@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
Grantová agentura Univerzity Karlovy v Praze@@GAUK 1036/2009@@Textové (mezivětné) vztahy a jejich zachycení v jazykovém korpusu@@nationalFunds@@
Grantová agentura Univerzity Karlovy v Praze@@GAUK 4383/2009@@Methods of coreference resolution@@nationalFunds@@
49431@@sentences
128176198
2
search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=pdt_30_cs_a
search|https://lindat.mff.cuni.cz/services/pmltq/pdt30/
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B04-C2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Ševčíková, Magda
Žabokrtský, Zdeněk
Straková, Jana
Straka, Milan
2014-01-09T10:03:56Z
2014-01-09T10:03:56Z
2014-01-09
http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C
Czech Named Entity Corpus 1.1 fixes some issues of the Czech Named Entity Corpus 1.0: misannotated entities are fixed, all formats contain the same data, tmt format is replaced with treex format, all formats contain splitting into training, development and testing portion of the data.
SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky), LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat), GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny), PRVOUK (PRVOUK)
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0022-C73C-7
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
http://ufal.mff.cuni.cz/cnec/
named entity recognition
corpus
Czech Named Entity Corpus 1.1
corpus
Czech Named Entity Corpus 1.1
Straková
Jana
Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics in Prague
unrestrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky)
LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat)
GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny)
PRVOUK (PRVOUK)
Czech Named Entity Corpus 1.1 fixes some issues of the Czech Named Entity Corpus 1.0: misannotated entities are fixed, all formats contain the same data, tmt format is replaced with treex format, all formats contain splitting into training, development and testing portion of the data.
text
corpus
5868
sentences
strakova@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Univerzita Karlova v Praze (mimo GAUK)@@SVV 267 314@@Teoretické základy informatiky a výpočetní lingvistiky@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
Grantová agentura České republiky@@GPP406/12/P175@@Vybrané derivační vztahy pro automatické zpracování češtiny@@nationalFunds@@
Univerzita Karlova v Praze (mimo GAUK)@@PRVOUK@@PRVOUK@@nationalFunds@@
5868@@sentences
10987946
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1B22-82021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Ševčíková, Magda
Žabokrtský, Zdeněk
Straková, Jana
Straka, Milan
2014-01-09T10:24:31Z
2014-01-09T10:24:31Z
2014-01-09
http://hdl.handle.net/11858/00-097C-0000-0023-1B22-8
Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named entities.
SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky), LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat), GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny), PRVOUK (PRVOUK)
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
http://ufal.mff.cuni.cz/cnec/
named entity recognition
Czech Named Entity Corpus 2.0
corpus
Czech Named Entity Corpus 2.0
Straková
Jana
Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics in Prague
unrestrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky)
LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat)
GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny)
PRVOUK (PRVOUK)
Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named entities.
text
corpus
8993
sentences
strakova@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Univerzita Karlova v Praze (mimo GAUK)@@SVV 267 314@@Teoretické základy informatiky a výpočetní lingvistiky@@nationalFunds@@
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
Grantová agentura České republiky@@GPP406/12/P175@@Vybrané derivační vztahy pro automatické zpracování češtiny@@nationalFunds@@
Univerzita Karlova v Praze (mimo GAUK)@@PRVOUK@@PRVOUK@@nationalFunds@@
8993@@sentences
13931704
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-1D76-92021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Grůber, Martin
2014-01-13T10:49:11Z
2014-01-13T10:49:11Z
2014-01-10
http://hdl.handle.net/11858/00-097C-0000-0023-1D76-9
The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded natural dialogues between a computer and seniors.
European Commission Sixth Framework Programme
Information Society Technologies Integrated Project IST-34434
ces
University of West Bohemia
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
PUB
http://www.companions-project.org/
speech corpus
expressive
text-to-speech synthesis
Czech Senior COMPANION Expressive Speech Corpus
corpus
Czech Senior COMPANION Expressive Speech Corpus
Ircing
Pavel
University of West Bohemia, Dept. of Cybernetics
restrictedUse
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
academic-nonCommercialUse
downloadable
True
COMPANIONS - Intelligent, Persistent, Personalised Multimodal Interfaces to the Internet
EU
The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded natural dialogues between a computer and seniors.
audio
corpus
6508
utterances
ircing@kky.zcu.cz
yes
LINDAT / CLARIAH-CZ
European Union@@FP6-IST-5-034434-IP@@Companions IP@@euFunds@@
6508@@utterances
1166695251
4
search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=companions_cs_w
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-3B09-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Křen, Michal
Hnátková, Milena
Jelínek, Tomáš
Petkevič, Vladimír
Procházka, Pavel
Skoumalová, Hana
2014-01-29T12:40:44Z
2014-01-29T12:40:44Z
2013
http://hdl.handle.net/11858/00-097C-0000-0023-3B09-4
Corpus of contemporary Czech newspapers and magazines sized 935 MW. It contains various titles published between 2005–2009. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
LM2011023 – Český národní korpus
http://wiki.korpus.cz/doku.php/en:cnk:syn2013pub
ces
Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
ACA
http://wiki.korpus.cz/doku.php/en:cnk:syn2013pub
corpus
written language
SYN2013PUB: corpus of written Czech newspapers
corpus
SYN2013PUB: corpus of written Czech newspapers
Křen
Michal
Charles University in Prague, Faculty of Arts, Institute of the Czech National Corpus
restrictedUse
Czech National Corpus (Shuffled Corpus Data)
academic-nonCommercialUse
attribution
noRedistribution
downloadable
True
nationalFunds
Corpus of contemporary Czech newspapers and magazines sized 935 MW. It contains various titles published between 2005–2009. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document.
text
corpus
935 000 000
words
michal.kren@ff.cuni.cz
yes
LINDAT / CLARIAH-CZ
https://kontext.korpus.cz/first_form?corpname=syn2013pub
Ministerstvo školství, mládeže a tělovýchovy@@LM2011023@@Český národní korpus@@nationalFunds@@
935000000@@words
7482024899
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-3FBB-32021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Šebesta, Karel
Goláňová, Hana
2014-02-06T12:11:46Z
2014-02-06T12:11:46Z
2013-12-18
http://hdl.handle.net/11858/00-097C-0000-0023-3FBB-3
Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants.
UK, PRVOUK P10
ces
Charles University in Prague, ÚČJTK
http://hdl.handle.net/11858/00-097C-0000-0019-89A0-9
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
PUB
http://akces.ff.cuni.cz
youth language
classroom
language acquisition corpus
AKCES
AKCES 2 ver. 2
corpus
AKCES 2 ver. 2
Šebesta
Karel
Charles University in Prague, ÚČJTK
notAvailable
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
academic-nonCommercialUse
downloadable
True
Program rozvoje vědních oblasti na Univerzitě Karlově, Program P10 - Lingvistika
National
Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data as the corpus "Schola 2010" (see the link for search), but all the proper names have been removed in order to protect the privacy of participants.
text
corpus
792764
words
sebesta@ff.cuni.cz
yes
LINDAT / CLARIAH-CZ
http://ames.ff.cuni.cz/
Univerzita Karlova v Praze@@P10 – Lingvistika@@Program rozvoje vědních oblastí na Univerzitě Karlově P10 – Lingvistika, modul Osvojování a vývoj jazykové a komunikační kompetence u populace ČR, řešeno od r. 2012@@nationalFunds@@
792764@@words
4040419
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4087-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Pajas, Petr
Vandas, Karel
Mišutka, Jozef
Kamran, Amir
Jawaid, Bushra
Košarko, Ondřej
Sedlák, Michal
Josífko, Michal
Straňák, Pavel
Hajič, Jan
2014-02-08T23:10:55Z
2014-02-08T23:10:55Z
2014
http://hdl.handle.net/11858/00-097C-0000-0023-4087-6
One of the goals of LINDAT/CLARIN Centre for Language Research Infrastructure is to provide technical background to institutions or researchers who wants to share their tools and data used for research in linguistics or related research fields. The digital repository is built on a highly customised DSpace platform.
LM2010013 - FULLY SUPPORTED BY THE MINISTRY OF EDUCATION, SPORTS AND YOUTH OF THE CZECH REPUBLIC
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0001-48F2-1
http://hdl.handle.net/11234/1-1481
http://svn.ms.mff.cuni.cz/redmine/projects/dspace-modifications
linguistics
digital data
digital repository
language repository
linguistic data
Linguistic digital repository based on DSpace
toolService
Linguistic digital repository based on DSpace
Mišutka
Jozef
Charles University in Prague, UFAL
unrestrictedUse
BSD-style
downloadable
True
#1-LM2010013
#1-nationalFunds
One of the goals of LINDAT/CLARIN Centre for Language Research Infrastructure is to provide technical background to institutions or researchers who wants to share their tools and data used for research in linguistics or related research fields. The digital repository is built on a highly customised DSpace platform.
toolService
infrastructure
misutka@ufal.mff.cuni.cz
false
no
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2010013@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@
0
0
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4336-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Hajič, Jan
2014-02-13T22:01:22Z
2014-02-13T22:01:22Z
2014-02-13
http://hdl.handle.net/11858/00-097C-0000-0023-4336-4
One of the very first steps in automatic processing of Czech text is morphological analysis and lemmatization.
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://lindat.mff.cuni.cz/services/morph/
morphological analysis
lemmatization
Czech Morphological Analyzer v1
toolService
Czech Morphological Analyzer v1
Hajič
Jan
Charles University in Prague, UFAL
notAvailable
One of the very first steps in automatic processing of Czech text is morphological analysis and lemmatization.
toolService
service
jan.hajic@mff.cuni.cz
true
no
LINDAT / CLARIAH-CZ
https://lindat.mff.cuni.cz/services/morph/index.html
0
0
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4337-22021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Cinková, Silvie
Fučíková, Eva
Šindlerová, Jana
Hajič, Jan
2014-02-13T22:05:17Z
2014-02-13T22:05:17Z
2014-02-13
http://hdl.handle.net/11858/00-097C-0000-0023-4337-2
EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank and Verbnet, two existing English predicate-argument lexicons used, i.a., for the PropBank project. The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank, which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form with examples from the PCEDT.
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
PUB
http://lindat.mff.cuni.cz/services/EngVallex/
Annotations
Corpora
Data
Lexicons
Monolingual
Semantics
Valency
EngVallex - English Valency Lexicon
lexicalConceptualResource
text
computationalLexicon
yes
LINDAT / CLARIAH-CZ
http://lindat.mff.cuni.cz/services/EngVallex/
Jan@@Hajič@@jan.hajic@mff.cuni.cz@@Charles University in Prague, UFAL
4337@@entries
7148@@frames
1240084
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4338-F2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Urešová, Zdeňka
Štěpánek, Jan
Hajič, Jan
Panevova, Jarmila
Mikulová, Marie
2014-02-13T22:05:12Z
2014-02-13T22:05:12Z
2014-02-13
http://hdl.handle.net/11858/00-097C-0000-0023-4338-F
The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool), and also in more human readable form including corpus examples (see the WEBSITE link below). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives.
ces
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
PUB
http://lindat.mff.cuni.cz/services/PDT-Vallex/
annotation
corpora
data
lexicon
semantics
valency
PDT
PDT-Vallex: Czech Valency lexicon linked to treebanks
lexicalConceptualResource
text
lexicon
yes
LINDAT / CLARIAH-CZ
http://lindat.mff.cuni.cz/services/PDT-Vallex/
Jan@@Hajič@@jan.hajic@mff.cuni.cz@@Charles University in Prague, UFAL
7121@@entries
11933@@frames
1302880
1
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-43CD-02021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Straka, Milan
Straková, Jana
2014-02-14T13:50:36Z
2014-02-14T13:50:36Z
2014-02-14
http://hdl.handle.net/11858/00-097C-0000-0023-43CD-0
MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological generation, tagging and tokenization and is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, MorphoDiTa achieves state-of-the-art results with a throughput around 10-200K words per second. MorphoDiTa is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://hdl.handle.net/11858/00-097C-0000-0001-48FE-9
http://ufal.mff.cuni.cz/morphodita
tagging
morphological analysis
morphological generation
tokenization
MorphoDiTa: Morphological Dictionary and Tagger
toolService
MorphoDiTa: Morphologic Dictionary and Tagger
Straka
Milan
Charles University in Prague, UFAL
unrestrictedUse
LGPL
attribution
shareAlike
downloadable
MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological generation, tagging and tokenization and is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, MorphoDiTa achieves state-of-the-art results with a throughput around 10-200K words per second. MorphoDiTa is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
toolService
tool
straka@ufal.mff.cuni.cz
false
no
LINDAT / CLARIAH-CZ
http://lindat.mff.cuni.cz/services/morphodita/
0
0
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-43CE-E2021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Straka, Milan
Straková, Jana
2014-02-14T13:51:18Z
2014-02-14T13:51:18Z
2014-02-14
http://hdl.handle.net/11858/00-097C-0000-0023-43CE-E
NameTag is an open-source tool for named entity recognition (NER). NameTag identifies proper names in text and classifies them into predefined categories, such as names of persons, locations, organizations, etc. NameTag is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, NameTag achieves state-of-the-art performance (Straková et al. 2013). NameTag is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
eng
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
http://ufal.mff.cuni.cz/nametag
named entity recognizer
NameTag
toolService
NameTag
Straka
Milan
Charles University in Prague, UFAL
unrestrictedUse
LGPL
attribution
shareAlike
downloadable
NameTag is an open-source tool for named entity recognition (NER). NameTag identifies proper names in text and classifies them into predefined categories, such as names of persons, locations, organizations, etc. NameTag is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, NameTag achieves state-of-the-art performance (Straková et al. 2013). NameTag is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
toolService
tool
straka@ufal.mff.cuni.cz
false
no
LINDAT / CLARIAH-CZ
http://lindat.mff.cuni.cz/services/nametag/
0
0
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4670-62021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Korvas, Matěj
Plátek, Ondřej
Dušek, Ondřej
Žilka, Lukáš
Jurčíček, Filip
2014-02-21T10:42:18Z
2014-02-21T10:42:18Z
2014-02-21
http://hdl.handle.net/11858/00-097C-0000-0023-4670-6
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts.
The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits.
This is the Czech data part of the dataset.
This research was funded by the Ministry of
Education, Youth and Sports of the Czech Republic under the grant agreement
LK11221.
ces
Charles University, Faculty of Mathematics and Physics
http://hdl.handle.net/11234/1-1740
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
https://ufal.mff.cuni.cz/grants/vystadial
acoustic data
speech corpus
spoken corpus
orthographic transcriptions
telephone speech
voip
dialogue system
Vystadial 2013 – Czech data
corpus
Vystadial 2013 – Czech data
Korvas
Matěj
Faculty of Mathematics and Physics, Charles University in Prague, UFAL
unrestrictedUse
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
evaluationUse
commercialUse
attribution
shareAlike
downloadable
True
MŠMT LK11221 (Vývoj metod pro návrh statistických mluvených dialogových systémů)
National
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts.
The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits.
This is the Czech data part of the dataset.
audio
corpus
18
hours
korvas@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LK11221@@Vývoj metod pro návrh statistických mluvených dialogových systémů@@nationalFunds@@
18@@hours
1580742931
1
search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=vystadial_2013_cs_w
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-4671-42021-06-29T08:45:33Zhdl_11858_00-097C-0000-0001-486F-Dhdl_11234_3430hdl_11858_00-097C-0000-0001-4877-A
Korvas, Matěj
Plátek, Ondřej
Dušek, Ondřej
Žilka, Lukáš
Jurčíček, Filip
2014-02-21T10:45:40Z
2014-02-21T10:45:40Z
2014-02-21
http://hdl.handle.net/11858/00-097C-0000-0023-4671-4
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts.
The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits.
This is the English data part of the dataset.
This research was funded by the Ministry of
Education, Youth and Sports of the Czech Republic under the grant agreement
LK11221.
eng
Charles University, Faculty of Mathematics and Physics
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
PUB
https://ufal.mff.cuni.cz/grants/vystadial
acoustic data
speech corpus
spoken corpus
orthographic transcriptions
telephone speech
voip
dialogue system
Vystadial 2013 – English data
corpus
Vystadial 2013 – English data
Korvas
Matěj
Faculty of Mathematics and Physics, Charles University in Prague, UFAL
unrestrictedUse
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
evaluationUse
commercialUse
attribution
shareAlike
downloadable
True
MŠMT LK11221 (Vývoj metod pro návrh statistických mluvených dialogových systémů)
National
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems. It ships in three parts: Czech data, English data, and scripts.
The data comprise over 41 hours of speech in English and over 15 hours in Czech, plus orthographic transcriptions. The scripts implement data pre-processing and building acoustic models using the HTK and Kaldi toolkits.
This is the English data part of the dataset.
audio
corpus
45
hours
korvas@ufal.mff.cuni.cz
yes
LINDAT / CLARIAH-CZ
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LK11221@@Vývoj metod pro návrh statistických mluvených dialogových systémů@@nationalFunds@@
45@@hours
2793418303
1
search|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=vystadial_2013_en_w
dim///hdl_11858_00-097C-0000-0001-4877-A/100