0
I Use This!
Inactive

Commits : Listings

Analyzed about 20 hours ago. based on code collected 1 day ago.
Apr 17, 2023 — Apr 17, 2024
Commit Message Contributor Files Modified Lines Added Lines Removed Code Location Date
consistently use <http://url.example.net> with <> but not (<>) More... about 12 years ago
rename README -> README.rst for GitHub formatting More... about 12 years ago
Reformat README as Markdown/reStructuredText More... about 12 years ago
Added .gitignore More... about 12 years ago
Removed the feature that counts donwloaded files More... almost 13 years ago
svn path=/src/trunk/corpuscatcher/; revision=17530 More... almost 13 years ago
Tries to align files in two folders - src and tgt language - using html structure, numbers and url correspondence More... almost 13 years ago
Improved pattern matching for urls by using regex's More... almost 13 years ago
Added an option (-e) to specify a pattern to be matched in the URLs to be downloaded. More... almost 13 years ago
Fixed a bug related to selecting encodings for html files More... almost 13 years ago
Assume that immediately consequtive lines are part of the same paragraph and join them. Split paragraphs in our outputs by two newlines. More... about 13 years ago
Some cleanup, simplification, reordering More... about 13 years ago
Better support for non-list output (output as running text) More... about 13 years ago
Suppress unnecessary warning about having the browser handle gzipped data More... over 15 years ago
Don't convert pages if there's nothing to convert. More... over 15 years ago
- Moved browser object initialization to a seperate method (so that it's available to importing clients). - Added a "browser" parameter to download_url(). More... over 15 years ago
Fixed a bug where only the last crawled URL (and its connections) are converted to text. More... over 15 years ago
Make corpuscatcher an importable module. More... over 15 years ago
Added support for handling more encodings. More... over 15 years ago
- Added -V/--version command-line argument - Added more specific settings to the mechanize.Browser object used for crawling More... almost 16 years ago
Added -V/--version command-line argument. More... almost 16 years ago
Documentation updated: - Added LICENSE and __version__.py - README points the read to the README on the wiki. More... almost 16 years ago
Fix copyright date More... almost 16 years ago
Correct copyright dates More... almost 16 years ago
Initial version of CorpusCatcher tools. More... almost 16 years ago