Show simple item record

 
dc.contributor.author Hajič, Jan
dc.contributor.author Hajičová, Eva
dc.contributor.author Panevová, Jarmila
dc.contributor.author Sgall, Petr
dc.contributor.author Cinková, Silvie
dc.contributor.author Fučíková, Eva
dc.contributor.author Mikulová, Marie
dc.contributor.author Pajas, Petr
dc.contributor.author Popelka, Jan
dc.contributor.author Semecký, Jiří
dc.contributor.author Šindlerová, Jana
dc.contributor.author Štěpánek, Jan
dc.contributor.author Toman, Josef
dc.contributor.author Urešová, Zdeňka
dc.contributor.author Žabokrtský, Zdeněk
dc.date.accessioned 2013-03-28T14:16:10Z
dc.date.available 2013-03-28T14:16:10Z
dc.date.issued 2012
dc.identifier.uri http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
dc.description Texts The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part. Data The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 sentence-aligned. An additional automatic alignment on the node level (different for each annotation layer) is part of this release, too. The original Penn Treebank-like file structure (25 sections, each containing up to one hundred files) has been preserved. Only those PTB documents which have both POS and structural annotation (total of 2312 documents) have been translated to Czech and made part of this release. Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are: dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values) semantic labeling of content words and types of coordinating structures argument structure, including an argument structure ("valency") lexicon for both languages ellipsis and anaphora resolution. This annotation style is called tectogrammatical annotation and it constitutes the tectogrammatical layer in the corpus. For more details see below and documentation. Annotation of the Czech part Sentences of the Czech translation were automatically morphologically annotated and parsed into surface-syntax dependency trees in the PDT 2.0 annotation style. This annotation style is sometimes called analytical annotation; it constitutes the analytical layer of the corpus. The manual tectogrammatical (deep-syntax) annotation was built as a separate layer above the automatic analytical (surface-syntax) parse. A sample of 2,000 sentences was manually annotated on the analytical layer. Annotation of the English part The resulting manual tectogrammatical annotation was built above an automatic transformation of the original phrase-structure annotation of the Penn Treebank into surface dependency (analytical) representations, using the following additional linguistic information from other sources: PropBank (LDC2004T14) VerbNet NomBank (LDC2008T23) flat noun phrase structures (by courtesy of D. Vadas and J.R. Curran) For each sentence, the original Penn Treebank phrase structure trees are preserved in this corpus together with their links to the analytical and tectogrammatical annotation.
dc.description.sponsorship Ministry of Education of the Czech Republic projects No.: MSM0021620838 LC536 ME09008 LM2010013 7E09003+7E11051 7E11041 Czech Science Foundation, grants No.: GAP406/10/0875 GPP406/10/P193 GA405/09/0729 Research funds of the Faculty of Mathematics and Physics, Charles University, Czech Republic, Grant Agency of the Academy of Sciences of the Czech Republic: No. 1ET101120503 Students participating in this project have been running their own student grants from the Grant Agency of the Charles University, which were connected to this project. Only ongoing projects are mentioned: 116310, 158010, 3537/2011 Also, this work was funded in part by the following projects sponsored by the European Commission: Companions, No. 034434 EuroMatrix, No. 034291 EuroMatrixPlus, No. 231720 Faust, No. 247762
dc.language.iso ces
dc.language.iso eng
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation info:eu-repo/grantAgreement/EC/FP7/231720
dc.relation info:eu-repo/grantAgreement/EC/FP7/247762
dc.relation.isreplacedby http://hdl.handle.net/11234/1-1664
dc.rights CC-BY-NC-SA + LDC99T42
dc.rights.uri https://lindat.mff.cuni.cz/repository/xmlui/page/license-pcedt2
dc.source.uri http://ufal.mff.cuni.cz/pcedt2.0
dc.subject parallel treebank
dc.subject PCEDT
dc.subject parallel corpus
dc.subject Wall Street Journal
dc.subject WSJ
dc.subject Penn Treebank
dc.subject dependency annotation
dc.subject.other PDT
dc.title Prague Czech-English Dependency Treebank 2.0
dc.type corpus
metashare.ResourceInfo#ContactInfo#PersonInfo.surname Hajič
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName Jan
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName Charles University in Prague, UFAL
metashare.ResourceInfo#DistributionInfo.availability restrictedUse
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse academic-nonCommercialUse
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse shareAlike
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse other
metashare.ResourceInfo#DistributionInfo#LicenseInfo.distributionAccessMedium downloadable
metashare.ResourceInfo#ValidationInfo.validated True
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #1-MSM0021620838 - Moderní metody, struktury a systémy informatiky
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #2-LC536 - Integrated center for natural language processing
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #3-ME09008 - Mnohojazyčná univerzální anotace lingvistických dat
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #4-LM2010013 - LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #5-7E09003 - EuroMatrixPlus—Bringing Machine Translation for European Languages to the User
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #6-7E11051 - EuroMatrixPlus - Enlarged European Union Bringing Machine Translation for European Languages to the User
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #7-7E11041 - Feedback Analysis for User Adaptive Statistical Translation
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #8-GAP406/10/0875 - Computational Linguistics: Explicit description of language and annotated data focused on Czech
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #9-GPP406/10/P193 - Tools for Revision and Tectogrammatical Annotation of a Czech Dependency Treebank
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #10-GA405/09/0729 - From the structure of a sentence to textual relationships
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #11-Companions, No. 034434
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #12-EuroMatrix, No. 034291
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #13-EuroMatrixPlus, No. 231720
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName #14-Faust, No. 247762
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #1-nationalFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #2-nationalFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #3-nationalFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #4-nationalFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #5-nationalFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #6-nationalFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #7-nationalFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #8-nationalFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #9-nationalFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #10-nationalFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #11-euFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #12-euFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #13-euFunds
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType #14-euFunds
metashare.ResourceInfo#ContentInfo.mediaType text
metashare.ResourceInfo#TextInfo#SizeInfo.size 49208
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit sentences
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email hajic@ufal.mff.cuni.cz
dc.rights.label RES
has.files yes
branding LINDAT / CLARIAH-CZ
demo.uri http://ufal.mff.cuni.cz/pcedt2.0/trees/00/01/wsj_0001_1.xhtml
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky MSM 0021620838 Moderní metody, struktury a systémy informatiky nationalFunds
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky LC536 Centrum komputační lingvistiky nationalFunds
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky ME09008 Mnohojazyčná univerzální anotace lingvistických dat nationalFunds
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky LM2010013 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky 7E09003 EuroMatrixPlus – Bringing Machine Translation for European Languages to the User nationalFunds
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky 7E11051 EuroMatrixPlus - Enlarged European Union Bringing Machine Translation for European Languages to the User nationalFunds
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky 7E11041 Feedback Analysis for User Adaptive Statistical Translation nationalFunds
sponsor Grantová agentura České republiky GAP406/10/0875 Komputační lingvistika: Explicitní popis jazyka a anotovaná data se zřetelem na češtinu nationalFunds
sponsor Grantová agentura České republiky GPP406/10/P193 Nástroje pro revizi a tektogramatickou anotaci českého závislostního korpusu nationalFunds
sponsor Grantová agentura České republiky GA405/09/0729 Od struktury věty k textovým vztahům nationalFunds
sponsor Grantová agentura Akademie věd České republiky 1ET101120503 Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů nationalFunds
sponsor Grantová agentura Univerzity Karlovy v Praze GAUK 116310/2010 Anglicko-český strojový překlad s využitím hloubkové syntaxe nationalFunds
sponsor European Union FP6-IST-5-034434-IP Companions IP euFunds
sponsor Grantová agentura Univerzity Karlovy v Praze GAUK 3537/2011 Detekce větné polarity v počítačovém korpusu nationalFunds
sponsor European Union FP6-IST-5-034291-STP Euromatrix euFunds
sponsor European Union FP7-ICT-2007-3-231720 EuroMatrix Plus euFunds info:eu-repo/grantAgreement/EC/FP7/231720
sponsor European Union FP7-ICT-2009-4-247762 Faust euFunds info:eu-repo/grantAgreement/EC/FP7/247762
sponsor Grantová agentura Univerzity Karlovy v Praze GAUK 1580/2010 Značkování aktuálního členění věty v paralelním anglicko-českém závislostním korpusu nationalFunds
size.info 49208 sentences
files.size 2069118389
files.count 3
featuredService.kontext Czech-English|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=pcedt_20_cs_a
featuredService.kontext English-Czech|http://lindat.mff.cuni.cz/services/kontext/run.cgi/first_form?corpname=pcedt_20_en_a
featuredService.pmltq Czech part only|https://lindat.mff.cuni.cz/services/pmltq/pcedt20_cz/
featuredService.pmltq parallel (login)|https://lindat.mff.cuni.cz/services/pmltq/pcedt20/


 Files in this item

This item is
Restricted Use
and licensed under:
CC-BY-NC-SA + LDC99T42
Distributed under Creative Commons Attribution Required Noncommercial Share Alike
Icon
Name
PCEDT-2-full-DVD.zip
Size
1.06 GB
Format
application/zip
Description
Data + doc + tools
MD5
fe874b94a84ffea87a788592169b343f
 Download file
Icon
Name
PCEDT-2-doc-with-trees.zip
Size
625.61 MB
Format
application/zip
Description
Documentation, including visualisation of all trees
MD5
489c19ab2ce4cec967919bb0e12e3c58
 Download file
Icon
Name
PCEDT-2-data-only.zip
Size
263.41 MB
Format
application/zip
Description
Only the data files (incl. valency lexicons)
MD5
7a7d3395752cd1074826685c62bc35aa
 Download file

Show simple item record