jusText

Pomikálek, Jan

dc.contributor.author	Pomikálek, Jan
dc.date.accessioned	2013-02-05T12:04:53Z
dc.date.available	2013-02-05T12:04:53Z
dc.date.issued	2011
dc.identifier.uri	http://hdl.handle.net/11858/00-097C-0000-000D-F696-9
dc.description	jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/justext/). It is successfully used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The boilerplate removal algorithm is able to remove most of non-grammatical sentences from a web page like navigation, advertisements, tables, short notes and so on. It has been shown it overperforms or at least keeps up with it's competitors (according to comparison with participants of Cleaneval competition in author's Ph.D. thesis). The precise removal of unwanted content and scalability of the algorithm has been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- over 20 TB of HTML pages were processed resulting in corpora of 70 billions tokens altogether.
dc.description.sponsorship	PRESEMT, Lexical Computing Ltd
dc.language.iso	eng
dc.publisher	Masaryk University, NLP Centre
dc.rights	Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
dc.rights.uri	http://creativecommons.org/licenses/by-sa/3.0/
dc.source.uri	http://code.google.com/p/justext/
dc.subject	boilerplate
dc.subject	web documents
dc.subject	text cleaning
dc.subject	boilerplate removal
dc.subject	text corpora
dc.title	jusText
dc.type	toolService
metashare.ResourceInfo#ContactInfo#PersonInfo.surname	Pomikálek
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName	Jan
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName	Natural Language Processing Centre, Faculty of Informatics Masaryk University
metashare.ResourceInfo#DistributionInfo.availability	restrictedUse
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse	attribution
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse	shareAlike
metashare.ResourceInfo#DistributionInfo#LicenseInfo.distributionAccessMedium	downloadable
metashare.ResourceInfo#ValidationInfo.validated	True
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName	PRESEMT
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType	euFunds
metashare.ResourceInfo#TextInfo#SizeInfo.size	732
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit	kb
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email	jan.pomikalek@gmail.com
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	false
metashare.ResourceInfo#ContentInfo.detailedType	tool
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
sponsor	PRESEMT
sponsor	Lexical Computing Ltd.
size.info	732 kb
files.size	750175
files.count	1

Soubory tohoto záznamu

Licenční kategorie:

Publicly Available

Licence: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)

Název: justext-1.2.tar.gz
Velikost: 732.59 KB
Formát: application/x-gzip
Popis: jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.
MD5: ea88bcd10f5b88f3badd6fb74f45e852

Stáhnout soubor Náhled

Náhled souboru

justext-1.2
- justext
  - stoplists
    - Albanian.txt3 kB
    - Vietnamese.txt1 kB
    - Romanian.txt6 kB
    - Russian.txt75 kB
    - Armenian.txt44 kB
    - Dutch.txt1 kB
    - Tagalog.txt980 B
    - Nepali.txt25 kB
    - Occitan.txt3 kB
    - Icelandic.txt4 kB
    - Serbo_Croatian.txt16 kB
    - Belarusian.txt64 kB
    - Volapuk.txt533 B
    - English.txt3 kB
    - Italian.txt3 kB
    - Finnish.txt64 kB
    - Waray_Waray.txt47 B
    - Ido.txt249 B
    - Bosnian.txt15 kB
    - Slovenian.txt16 kB
    - Haitian.txt175 B
    - Belarusian_Taraskievica.txt60 kB
    - Bulgarian.txt14 kB
    - Georgian.txt125 kB
    - Norwegian_Bokmal.txt2 kB
    - French.txt2 kB
    - Quechua.txt2 kB
    - Latin.txt13 kB
    - Bengali.txt18 kB
    - Persian.txt4 kB
    - Kannada.txt157 kB
    - Samogitian.txt14 kB
    - Gujarati.txt5 kB
    - Urdu.txt1 kB
    - Sicilian.txt3 kB
    - Esperanto.txt3 kB
    - Hindi.txt4 kB
    - West_Frisian.txt1 kB
    - Macedonian.txt11 kB
    - Irish.txt1 kB
    - Hungarian.txt28 kB
    - Swahili.txt2 kB
    - Igbo.txt733 B
    - Welsh.txt3 kB
    - Croatian.txt17 kB
    - Norwegian_Nynorsk.txt1 kB
    - Serbian.txt20 kB
    - German.txt4 kB
    - Asturian.txt4 kB
    - Turkish.txt25 kB
    - Javanese.txt4 kB
    - Aromanian.txt3 kB
    - Greek.txt7 kB
    - Lithuanian.txt42 kB
    - Hebrew.txt44 kB
    - Basque.txt17 kB
    - Newar.txt1 kB
    - Aragonese.txt918 B
    - Marathi.txt38 kB
    - Lombard.txt738 B
    - Czech.txt24 kB
    - Indonesian.txt6 kB
    - Kurdish.txt3 kB
    - Tamil.txt150 kB
    - Piedmontese.txt419 B
    - Ukrainian.txt73 kB
    - Slovak.txt28 kB
    - Korean.txt128 kB
    - Sundanese.txt5 kB
    - Maltese.txt10 kB
    - Bishnupriya_Manipuri.txt950 B
    - Estonian.txt49 kB
    - Telugu.txt93 kB
    - Malayalam.txt301 kB
    - Swedish.txt4 kB
    - Arabic.txt29 kB
    - Breton.txt1 kB
    - Danish.txt3 kB
    - Azerbaijani.txt32 kB
    - Cebuano.txt57 B
    - Chuvash.txt34 kB
    - Polish.txt28 kB
    - Luxembourgish.txt2 kB
    - Neapolitan.txt1 kB
    - Latvian.txt29 kB
    - Afrikaans.txt1 kB
    - Western_Panjabi.txt2 kB
    - Walloon.txt910 B
    - Yoruba.txt1 kB
    - Spanish.txt1 kB
    - Malay.txt6 kB
    - Portuguese.txt3 kB
    - Simple_English.txt1 kB
    - Catalan.txt1 kB
    - Low_Saxon.txt1 kB
    - Galician.txt2 kB
  - __init__.py372 B
  - core.py28 kB
- bin
  - justext240 B
- setup.py939 B
- README238 B
- COPYING1 kB
- ._COPYING208 B
- PKG-INFO589 B

Zobrazit minimální záznam