Files in this item
This item is
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
Publicly Available
and licensed under:Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
- Name
- justext-1.2.tar.gz
- Size
- 732.59 KB
- Format
- application/x-gzip
- Description
- jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.
- MD5
- ea88bcd10f5b88f3badd6fb74f45e852
- justext-1.2
- justext
- stoplists
- Albanian.txt3 kB
- Vietnamese.txt1 kB
- Romanian.txt6 kB
- Russian.txt75 kB
- Armenian.txt44 kB
- Dutch.txt1 kB
- Tagalog.txt980 B
- Nepali.txt25 kB
- Occitan.txt3 kB
- Icelandic.txt4 kB
- Serbo_Croatian.txt16 kB
- Belarusian.txt64 kB
- Volapuk.txt533 B
- English.txt3 kB
- Italian.txt3 kB
- Finnish.txt64 kB
- Waray_Waray.txt47 B
- Ido.txt249 B
- Bosnian.txt15 kB
- Slovenian.txt16 kB
- Haitian.txt175 B
- Belarusian_Taraskievica.txt60 kB
- Bulgarian.txt14 kB
- Georgian.txt125 kB
- Norwegian_Bokmal.txt2 kB
- French.txt2 kB
- Quechua.txt2 kB
- Latin.txt13 kB
- Bengali.txt18 kB
- Persian.txt4 kB
- Kannada.txt157 kB
- Samogitian.txt14 kB
- Gujarati.txt5 kB
- Urdu.txt1 kB
- Sicilian.txt3 kB
- Esperanto.txt3 kB
- Hindi.txt4 kB
- West_Frisian.txt1 kB
- Macedonian.txt11 kB
- Irish.txt1 kB
- Hungarian.txt28 kB
- Swahili.txt2 kB
- Igbo.txt733 B
- Welsh.txt3 kB
- Croatian.txt17 kB
- Norwegian_Nynorsk.txt1 kB
- Serbian.txt20 kB
- German.txt4 kB
- Asturian.txt4 kB
- Turkish.txt25 kB
- Javanese.txt4 kB
- Aromanian.txt3 kB
- Greek.txt7 kB
- Lithuanian.txt42 kB
- Hebrew.txt44 kB
- Basque.txt17 kB
- Newar.txt1 kB
- Aragonese.txt918 B
- Marathi.txt38 kB
- Lombard.txt738 B
- Czech.txt24 kB
- Indonesian.txt6 kB
- Kurdish.txt3 kB
- Tamil.txt150 kB
- Piedmontese.txt419 B
- Ukrainian.txt73 kB
- Slovak.txt28 kB
- Korean.txt128 kB
- Sundanese.txt5 kB
- Maltese.txt10 kB
- Bishnupriya_Manipuri.txt950 B
- Estonian.txt49 kB
- Telugu.txt93 kB
- Malayalam.txt301 kB
- Swedish.txt4 kB
- Arabic.txt29 kB
- Breton.txt1 kB
- Danish.txt3 kB
- Azerbaijani.txt32 kB
- Cebuano.txt57 B
- Chuvash.txt34 kB
- Polish.txt28 kB
- Luxembourgish.txt2 kB
- Neapolitan.txt1 kB
- Latvian.txt29 kB
- Afrikaans.txt1 kB
- Western_Panjabi.txt2 kB
- Walloon.txt910 B
- Yoruba.txt1 kB
- Spanish.txt1 kB
- Malay.txt6 kB
- Portuguese.txt3 kB
- Simple_English.txt1 kB
- Catalan.txt1 kB
- Low_Saxon.txt1 kB
- Galician.txt2 kB
- __init__.py372 B
- core.py28 kB
- stoplists
- bin
- justext240 B
- setup.py939 B
- README238 B
- COPYING1 kB
- ._COPYING208 B
- PKG-INFO589 B
- justext