CorpusCatcher

I Use This!

Inactive

Commits : Listings

Analyzed 1 day ago. based on code collected 1 day ago.

Commit Message	Contributor	Files Modified	Date
Jun 07, 2024 — Jun 07, 2025 Showing page 1 of 1 Search / Filter on:
consistently use <http://url.example.net> with <> but not (<>)	Alexander Dupuy	More...	about 13 years ago
rename README -> README.rst for GitHub formatting	Alexander Dupuy	More...	about 13 years ago
Reformat README as Markdown/reStructuredText	Alexander Dupuy	More...	about 13 years ago
Added .gitignore	Julen Ruiz Aizpuru	More...	about 13 years ago
Removed the feature that counts donwloaded files	Laurette Pretorius	More...	almost 14 years ago
svn path=/src/trunk/corpuscatcher/; revision=17530	Laurette Pretorius	More...	about 14 years ago
Tries to align files in two folders - src and tgt language - using html structure, numbers and url correspondence	Laurette Pretorius	More...	about 14 years ago
Improved pattern matching for urls by using regex's	Laurette Pretorius	More...	about 14 years ago
Added an option (-e) to specify a pattern to be matched in the URLs to be downloaded.	Laurette Pretorius	More...	about 14 years ago
Fixed a bug related to selecting encodings for html files	Laurette Pretorius	More...	about 14 years ago
Assume that immediately consequtive lines are part of the same paragraph and join them. Split paragraphs in our outputs by two newlines.	Friedel Wolff	More...	about 14 years ago
Some cleanup, simplification, reordering	Friedel Wolff	More...	about 14 years ago
Better support for non-list output (output as running text)	Friedel Wolff	More...	about 14 years ago
Suppress unnecessary warning about having the browser handle gzipped data	Walter Leibbrandt	More...	over 16 years ago
Don't convert pages if there's nothing to convert.	Walter Leibbrandt	More...	over 16 years ago
- Moved browser object initialization to a seperate method (so that it's available to importing clients). - Added a "browser" parameter to download_url().	Walter Leibbrandt	More...	over 16 years ago
Fixed a bug where only the last crawled URL (and its connections) are converted to text.	Walter Leibbrandt	More...	over 16 years ago
Make corpuscatcher an importable module.	Walter Leibbrandt	More...	over 16 years ago
Added support for handling more encodings.	Walter Leibbrandt	More...	almost 17 years ago
- Added -V/--version command-line argument - Added more specific settings to the mechanize.Browser object used for crawling	Walter Leibbrandt	More...	almost 17 years ago
Added -V/--version command-line argument.	Walter Leibbrandt	More...	almost 17 years ago
Documentation updated: - Added LICENSE and __version__.py - README points the read to the README on the wiki.	Walter Leibbrandt	More...	almost 17 years ago
Fix copyright date	Friedel Wolff	More...	almost 17 years ago
Correct copyright dates	Friedel Wolff	More...	almost 17 years ago
Initial version of CorpusCatcher tools.	Walter Leibbrandt	More...	almost 17 years ago

CorpusCatcher

Commits : Listings

Project Summary

Code Data

SCM Data

Community Data