onion

Pomikálek, Jan

dc.contributor.author	Pomikálek, Jan
dc.date.accessioned	2013-02-01T16:34:32Z
dc.date.available	2013-02-01T16:34:32Z
dc.date.issued	2011
dc.identifier.uri	http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7
dc.description	onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether.
dc.description.sponsorship	PRESEMT, Lexical Computing Ltd
dc.language.iso	eng
dc.publisher	Masaryk University, NLP Centre
dc.rights	BSD 3-Clause "New" or "Revised" license
dc.rights.uri	http://opensource.org/licenses/BSD-3-Clause
dc.source.uri	http://code.google.com/p/onion/
dc.subject	deduplication
dc.subject	corpus
dc.subject	text deduplication
dc.subject	n-gram deduplication
dc.subject	n-gram model
dc.title	onion
dc.type	toolService
metashare.ResourceInfo#ContactInfo#PersonInfo.surname	Pomikálek
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName	Jan
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName	Natural Language Processing Centre, Faculty of Informatics Masaryk University
metashare.ResourceInfo#DistributionInfo.availability	restrictedUse
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse	attribution
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse	shareAlike
metashare.ResourceInfo#DistributionInfo#LicenseInfo.distributionAccessMedium	downloadable
metashare.ResourceInfo#ValidationInfo.validated	True
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName	PRESEMT
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType	euFunds
metashare.ResourceInfo#TextInfo#SizeInfo.size	17
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit	kb
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email	jan.pomikalek@gmail.com
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	false
metashare.ResourceInfo#ContentInfo.detailedType	tool
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
demo.uri	http://code.google.com/p/onion/
sponsor	PRESEMT
sponsor	Lexical Computing Ltd.
size.info	17 kb
files.size	17127
files.count	1

Files in this item

This item is

Publicly Available

and licensed under:
BSD 3-Clause "New" or "Revised" license

Name: onion-1.1.tar.gz
Size: 16.73 KB
Format: application/x-gzip
Description: Onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts
MD5: 8d97874f25abc72015e8ef63fca62f64

Download file Preview

File Preview

onion-1.1
- Makefile584 B
- README680 B
- Makefile.config131 B
- COPYING1 kB
- doc
  - .svn
    - text-base
    - entries208 B
    - all-wcprops67 B
    - tmp
    - props
    - prop-base
  - man
    - .svn
      - text-base
      - entries213 B
      - all-wcprops71 B
      - tmp
        text-base
        props
        prop-base
      - props
      - prop-base
    - man1
      - .svn
        text-base
        hashdup.1.svn-base1 kB
        onion.1.svn-base5 kB
        hashgen.1.svn-base1 kB
        entries692 B
        all-wcprops358 B
        tmp
        text-base
        props
        prop-base
        props
        prop-base
      - hashgen.11 kB
      - hashdup.11 kB
      - onion.15 kB
- src
  - buzhash.h1 kB
  - Makefile386 B
  - hashgen.c6 kB
  - version.h563 B
  - buzhash.c9 kB
  - onion.c17 kB
  - version.c713 B
  - onion-binsearch.c18 kB
  - hashdup.c4 kB

Show simple item record