Show simple item record

 
dc.contributor.author Pomikálek, Jan
dc.date.accessioned 2013-02-01T16:34:32Z
dc.date.available 2013-02-01T16:34:32Z
dc.date.issued 2011
dc.identifier.uri http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7
dc.description onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether.
dc.description.sponsorship PRESEMT, Lexical Computing Ltd
dc.language.iso eng
dc.publisher Masaryk University, NLP Centre
dc.rights BSD 3-Clause "New" or "Revised" license
dc.rights.uri http://opensource.org/licenses/BSD-3-Clause
dc.source.uri http://code.google.com/p/onion/
dc.subject deduplication
dc.subject corpus
dc.subject text deduplication
dc.subject n-gram deduplication
dc.subject n-gram model
dc.title onion
dc.type toolService
metashare.ResourceInfo#ContactInfo#PersonInfo.surname Pomikálek
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName Jan
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName Natural Language Processing Centre, Faculty of Informatics Masaryk University
metashare.ResourceInfo#DistributionInfo.availability restrictedUse
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse attribution
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse shareAlike
metashare.ResourceInfo#DistributionInfo#LicenseInfo.distributionAccessMedium downloadable
metashare.ResourceInfo#ValidationInfo.validated True
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName PRESEMT
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType euFunds
metashare.ResourceInfo#TextInfo#SizeInfo.size 17
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit kb
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email jan.pomikalek@gmail.com
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent false
metashare.ResourceInfo#ContentInfo.detailedType tool
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
demo.uri http://code.google.com/p/onion/
sponsor PRESEMT
sponsor Lexical Computing Ltd.
size.info 17 kb
files.size 17127
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
BSD 3-Clause "New" or "Revised" license
BSD Attribution Required
Icon
Name
onion-1.1.tar.gz
Size
16.73 KB
Format
application/x-gzip
Description
Onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts
MD5
8d97874f25abc72015e8ef63fca62f64
 Download file  Preview
 File Preview  

Show simple item record