A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic lemmas and their corresponding inflected word forms (stems) with details (POS + Root). Each lexical entry represents a lemma followed by all its possible stems and each stem is enriched by its morphological features especially the root and the POS.
It is composed of 164,845 lemmas representing 7,200,918 stems, detailed as follow:
757 Arabic particles
2,464,631 verbal stems
4,735,587 nominal stems
The lexicon is provided as an LMF conformant XML-based file in UTF8 encoding, which represents about 1,22 Gb of data.
Citation:
– Namly Driss, Karim Bouzoubaa, Abdelhamid El Jihad, and Si Lhoussain Aouragh. “Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique.” In Recent Advances in NLP: The Case of Arabic Language, pp. 81-100. Springer, Cham, 2020.
This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered database (German texts only) of CommonCrawl (as of March 2018). First, the URLs were filtered according to their top-level domain (de, at, ch). Then the texts were classified using NTextCat and only uniquely German texts were included in the corpus. The texts were then annotated using TreeTagger (token, lemma, part-of-speech). 2.58 million documents - 232.87 million sentences - 3.021 billion tokens. You can use CorpusExplorer (http://hdl.handle.net/11234/1-2634) to convert this data into various other corpus formats (XML, JSON, Weblicht, TXM and many more).
Report from the celebration of the fourteenth anniversary of the Czechoslovak Republic held in front of the Municipal House in Prague on 28 October 1932. The gathering was attended by troops and legionnaires. A Philips Radio broadcast vehicle stands in front of the entrance. The segment includes a silent recording of a speech given by the Former Secretary of the National Committee and current Chairman of the Senate František Soukup.
Segment from the celebration of the fifteenth anniversary of the Czechoslovak Republic held on Old Town Square in Mladá Boleslav. A ceremonial line-up of the local military garrison. General Šípek delivers a speech before the municipal council, soldiers and inhabitants of the town. This is followed by a parade through the town, attended by the troops, representatives of the Sokol community, the local fire brigade, the gendarmerie, nurses and members of other town associations. Footage from a footrace through the streets of Mladá Boleslav to the Monument to the Fallen, won by Mr Mlejnek. Footage from the celebration in Kosmonosy, the place linked with the ground-breaking ceremony for the Resistance Memorial. The ceremony was attended by representatives of local associations and corporations as well as the county association of Czechoslovak legionnaires.
The segment captures a military parade of new army recruits, held in the third courtyard of Prague Castle on 28 October 1930 as part of the celebration of the twelfth anniversary of the Czechoslovak Republic. Prime Minister František Udržal and Minister of Defence Karel Viškovský attend the parade, standing in for the absent President Masaryk.