dc.contributor.author |
Pomikálek, Jan |
dc.date.accessioned |
2013-02-05T12:04:53Z |
dc.date.available |
2013-02-05T12:04:53Z |
dc.date.issued |
2011 |
dc.identifier.uri |
http://hdl.handle.net/11858/00-097C-0000-000D-F696-9 |
dc.description |
jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/justext/). It is successfully used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The boilerplate removal algorithm is able to remove most of non-grammatical sentences from a web page like navigation, advertisements, tables, short notes and so on. It has been shown it overperforms or at least keeps up with it's competitors (according to comparison with participants of Cleaneval competition in author's Ph.D. thesis). The precise removal of unwanted content and scalability of the algorithm has been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- over 20 TB of HTML pages were processed resulting in corpora of 70 billions tokens altogether. |
dc.description.sponsorship |
PRESEMT, Lexical Computing Ltd |
dc.language.iso |
eng |
dc.publisher |
Masaryk University, NLP Centre |
dc.rights |
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) |
dc.rights.uri |
http://creativecommons.org/licenses/by-sa/3.0/ |
dc.source.uri |
http://code.google.com/p/justext/ |
dc.subject |
boilerplate |
dc.subject |
web documents |
dc.subject |
text cleaning |
dc.subject |
boilerplate removal |
dc.subject |
text corpora |
dc.title |
jusText |
dc.type |
toolService |
metashare.ResourceInfo#ContactInfo#PersonInfo.surname |
Pomikálek |
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName |
Jan |
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName |
Natural Language Processing Centre, Faculty of Informatics Masaryk University |
metashare.ResourceInfo#DistributionInfo.availability |
restrictedUse |
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse |
attribution |
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse |
shareAlike |
metashare.ResourceInfo#DistributionInfo#LicenseInfo.distributionAccessMedium |
downloadable |
metashare.ResourceInfo#ValidationInfo.validated |
True |
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName |
PRESEMT |
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType |
euFunds |
metashare.ResourceInfo#TextInfo#SizeInfo.size |
732 |
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit |
kb |
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email |
jan.pomikalek@gmail.com |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent |
false |
metashare.ResourceInfo#ContentInfo.detailedType |
tool |
dc.rights.label |
PUB |
has.files |
yes |
branding |
LINDAT / CLARIAH-CZ |
sponsor |
PRESEMT |
sponsor |
Lexical Computing Ltd. |
size.info |
732 kb |
files.size |
750175 |
files.count |
1 |