Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC corpus.
This paper compares the state of Czech legal order before and after the reform of the Private law. The analysis is based on the linguistic investigation of corpora containing legal texts. We analyze two corpora of Czech legal texts and show the relation between the changes (amendments and modifications) in the wording of law acts and their transparency. Our work explores changes on the level of words, collocations and legal terms. The tools used in this research are the corpus manager Manatee/Bonito with integrated Word Sketch Engine and Czech morphological analyzer Majka. Our results thus obtained lead us to conclude that the functionality of the Czech legal system is under threat from its own opacity and obfuscation. and V článku předkládáme porovnání stavu českého právního systému před reformou soukromého práva a po ní, které je založeno na zkoumání korpusů obsahujících právní texty. Zkoumáme dva korpusy českých právních textů a demonstrujeme vztah mezi změnami v textech zákonů a jejich transparentností. Zkoumáme tedy změny na úrovni jednotlivých slov, kolokací a právních termínů a rovněž i jejich počty a frekvence. Nástroje, s nimiž tu pracujeme, jsou korpusový manažer Manatee/Bonito s integrovaným vytvářením slovních profilů (Word Sketch Engine) a morfologický analyzátor Majka. Takto získané výsledky přesvědčivě prokazují, že analyzované změny ohrožují funkcionalitu českého právního systému.
Hindi monolingual corpus. It is based primarily on web crawls performed using various tools and at various times. Since the web is a living data source, we treat these crawls as completely separate sources, despite they may overlap. To estimate the magnitude of this overlap, we compared the total number of segments if we concatenate the individual sources (each source being deduplicated on its own) with the number of segments if we de-duplicate all sources to- gether. The difference is just around 1%, confirming, that various web crawls (or their subsequent processings) differ significantly.
HindMonoCorp contains data from:
Hindi web texts, a monolingual corpus containing mainly Hindi news articles has already been collected and released by Bojar et al. (2008). We use the HTML files as crawled for this corpus in 2010 and we add a small crawl performed in 2013 and re-process them with the current pipeline. These sources are denoted HWT 2010 and HWT 2013 in the following.
Hindi corpora in W2C have been collected by Martin Majliš during his project to automatically collect corpora in many languages (Majliš and Žabokrtský, 2012). There are in fact two corpora of Hindi available—one from web harvest (W2C Web) and one from the Wikipedia (W2C Wiki).
SpiderLing is a web crawl carried out during November and December 2013 using SpiderLing (Suchomel and Pomikálek, 2012). The pipeline includes extraction of plain texts and deduplication at the level of documents, see below.
CommonCrawl is a non-profit organization that regu- larly crawls the web and provides anyone with the data. We are grateful to Christian Buck for extracting plain text Hindi segments from the 2012 and 2013-fall crawls for us.
Intercorp – 7 books with their translations scanned and manually alligned per paragraph
RSS Feeds from Webdunia.com and the Hindi version of BBC International followed by our custom crawler from September 2013 till January 2014. and LM2010013,