Prague Arabic Dependency Treebank 1.0

Introduction

This page contains information on Prague Arabic Dependency Treebank 1.0, Linguistic Data Consortium (LDC) catalog number LDC2004T23 and ISBN 1-58563-319-4. Authors: Jan Hajič, Otakar Smrž, Petr Zemánek, Petr Pajas, Jan Šnaidauf, Emanuel Beška, Jakub Kráčmar, Kamila Hassanová.

Prague Arabic Dependency Treebank (PADT) not only consists of multi-level linguistic annotations over the language of Modern Standard Arabic, but even provides a variety of unique software implementations designed for general use in Natural Language Processing (NLP). This paper delivers an overview of the recent and most interesting results, findings and innovations within the project.

The PADT project might be summarized as an open-ended activity of the Center for Computational Linguistics, the Institute of Formal and Applied Linguistics, and the Institute of Comparative Linguistics, Charles University in Prague, resting in multi-level annotation of Arabic language resources in the light of the theory of Functional Generative Description (Sgall et al., 1986; Hajičová and Sgall, 2003). The project is a younger sibling to Prague Dependency Treebank for Czech (Hajič et al., 2001), and is maintained upon co-operation with the Linguistic Data Consortium, University of Pennsylvania, who release non-annotated corpora of Arabic newswire and develop an independent Penn Arabic Treebank (Maamouri et al., 2004).

Levels of Description

The PADT scenario of annotations employs the upper three levels of the Functional Generative Description (FGD), intending to infer linguistic meaning from the orthographical or phonological realization of the language, and skipping the lower two levels that decompose it down to phonetics. Morphological annotations identify the textual forms of a discourse lexically and recognize the grammatical categories they assume. Processing on the analytical level describes the superficial syntactic structures present in the discourse, whereas the tectogrammatical level reveals the underlying ones and restores the linguistic meaning.

The morphological level of PADT has for long been the same as that available in Penn Arabic Treebank, Part 2. PADT adopted the way of Buckwalter Arabic Morphological Analyzer and the annotators were using the SelectPOS disambiguation tool written in Python by Kazuaki Maeda.

As reasoned in (Smrž, in prep), the confrontation of this and numerous other implementations of Arabic morphology, which all turned out to give descriptions of morphs rather than morphemes, with the grammatical rules and syntactic behavior of the language (Fischer (2001), inter alia) brought us to reviewing the system and introducing the Functional Arabic Morphology. The increasing need for the new type of annotations required a different disambiguation tool, and the general idea of MorphoTrees came into existence, implemented as an annotation context for TrEd, the general annotation environment written in Perl by Petr Pajas.

Annotations on the analytical level have been treated earlier in (Žabokrtský and Smrž, 2003), where the relations between the PADT dependency analytical trees and the phrase-structure trees of the Penn Arabic Treebank were studied. Here, we explain the principles of analytical annotation proper, extending on the types of predicates and discussing their representation. We formulate a hypothesis on using the analytical data to supplement the lexicons of Arabic morphological analyzers with important grammatical categories like humanness, logical gender, etc.

The third, tectogrammatical level, has not yet been outlined in Arabic in such a detail that would let PADT annotations commence. The power and success of tectogrammatics in Prague Dependency Treebank for Czech is, however, more than promising and motivating (Čmejrek et al., 2003; Hajič et al., 2003).

Data Survey

Please see file.tbl for the directory structure of this publication, as well as a complete list of files.

Please go to data for a listing of data files. The software tools may be found in the tools directory.

The corpus of PADT 1.0 consists of morphologically and analytically annotated newswire texts of Modern Standard Arabic, which originate from the Arabic Gigaword and the plain data of Penn Arabic Treebank, Part 1 and Penn Arabic Treebank, Part 2.

The PADT 1.0 distribution comprises over 113 500 tokens of data annotated analytically and provided with the disambiguated morphological information. In addition, the release includes complete annotations of MorphoTrees resulting in more than 148 000 tokens, 49 000 of which have received the analytical processing. The contents are further divided into data sets as indicated in the Table.

Data Set [A] Tokens [M] Tokens/Para Tokens/Doc Original Data Provider News Period Related Corpora
AFP 13 000 N/A 34.6 [N/A] 260 [N/A] Agence France Presse July 2000 Penn ATB Part 1
UMH 38 500 N/A 43.6 [N/A] 290 [N/A] Ummah Press Service Spring 2002 Penn ATB Part 2
XIN 13 500 N/A 31.2 [N/A] 155 [N/A] Xinhua News Agency May 2003 Arabic Gigaword
ALH 10 000 73 500 47.0 [47.8] 405 [405] Al Hayat News Agency September 2001 Arabic Gigaword
ANN 12 500 25 500 60.3 [50.3] 740 [630] An Nahar News Agency November 2002 Arabic Gigaword
XIA 26 500 49 500 29.7 [25.9] 235 [205] Xinhua News Agency May 2003 Arabic Gigaword

In the Table, tokens give the number of syntactic units that are annotated [A] analytically [M] within MorphoTrees. Approximate ratios of tokens per paragraph and tokens per document come in the next columns, distinguishing the two types of annotation. The sets of selected documents could cover only a couple of days of the specified period of time.

Documentation

Please follow the link to the complete documentation of the data and tools provided in this release.

To list all the documentation files, go to the docs directory summarized below:

Support

PADT 1.0 was supported by the Ministry of Education of the Czech Republic, projects LN00A063 and MSM113200006, and by the Grant Agency of the Czech Republic, project 405/02/0823.

Updates

Updates or bug fixes may be available in the LDC catalog entry for this corpus, or at the PADT website.

Your questions and suggestions are welcome at padt (at) ckl (dot) mff (dot) cuni (dot) cz.

References

Martin Čmejrek, Jan Cuřín, and Jiří Havelka. 2003.
Czech-English Dependency-based Machine Translation. In EACL 2003 Proceedings of the Conference, pages 83-90, Budapest, Hungary, April 2003.
Wolfdietrich Fischer. 2001.
A Grammar of Classical Arabic. Yale Language Series. Yale University Press, third revised edition. Translated by Jonathan Rodgers.
Jan Hajič, Barbora Hladká, and Petr Pajas. 2001.
The Prague Dependency Treebank: Annotation Structure and Support. In Proceedings of the IRCS Workshop on Linguistic Databases, pages 105-114, Philadelphia, December 2001. University of Pennsylvania.
Jan Hajič, Jarmila Panevová, Zdeňka Urešová, Alevtina Bémová, Veronika Kolářová, and Petr Pajas. 2003.
PDT-VALLEX: Creating a Large-coverage Valency Lexicon for Treebank Annotation. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories, pages 57-68, Växjö, Sweden, November 2003.
Jan Hajič, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, and Emanuel Beška. 2004.
Prague Arabic Dependency Treebank: Development in Data and Tools. In Proceedings of the NEMLAR International Conference on Arabic Language Resources and Tools, pages 110-117, Cairo, Egypt, September 2004.
Eva Hajičová and Petr Sgall. 2003.
Dependency Syntax in Functional Generative Description. In Dependenz und Valenz - Dependency and Valency, volume I, pages 570-592. Walter de Gruyter.
Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki. 2004.
The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus. In Proceedings of the NEMLAR International Conference on Arabic Language Resources and Tools, pages 102-109, Cairo, Egypt, September 2004.
Petr Sgall, Eva Hajičová, and Jarmila Panevová. 1986.
The Meaning of the Sentence in Its Semantic and Pragmatic Aspects. D. Reidel & Academia, Dordrecht & Prague.
Otakar Smrž and Petr Pajas. 2004.
MorphoTrees of Arabic and Their Annotation in the TrEd Environment. In Proceedings of the NEMLAR International Conference on Arabic Language Resources and Tools, pages 38-41, Cairo, Egypt, September 2004.
Otakar Smrž and Petr Zemánek. 2002.
Sherds from an Arabic Treebanking Mosaic. Prague Bulletin of Mathematical Linguistics, (78):63-76.
Otakar Smrž, Jan Šnaidauf, and Petr Zemánek. 2002.
Prague Dependency Treebank for Arabic: Multi-Level Annotation of Arabic Corpus. In Proceedings of the International Symposium on Processing of Arabic, pages 147-155, Manouba, Tunisia, April 2002.
Otakar Smrž. in prep.
Functional Arabic Morphology. Formal System and Implementation. Ph.D. thesis, Charles University in Prague.
Zdeněk Žabokrtský and Otakar Smrž. 2003.
Arabic Syntactic Trees: from Constituency to Dependency. In EACL 2003 Conference Companion, pages 183-186, Budapest, Hungary, April 2003.

Content Copyright

Portions © 2002-2004 Trustees of the University of Pennsylvania, © 2000 Agence France Presse, © 2001 Al Hayat News Agency, © 2002 Ummah Press Service, © 2002 An Nahar News Agency, © 2003 Xinhua News Agency, © 2002-2004 Center for Computational Linguistics & Institute of Formal and Applied Linguistics & Institute of Comparative Linguistics, Charles University in Prague.

Please proceed to the Research-Usage License Agreement for the Prague Arabic Dependency Treebank 1.0, or to its on-line version.


Contact ldc@ldc.upenn.edu.
© 2004 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.