openhub.net
Black Duck Software, Inc.
Black Duck Open Hub
Follow @
OH
Sign In
Join Now
Projects
People
Organizations
Tools
Blog
BDSA
Projects
People
Projects
Organizations
Forums
C
CorpusCatcher
Settings
|
Report Duplicate
0
I Use This!
×
Login Required
Log in to Open Hub
Remember Me
Inactive
Commits
: Listings
Analyzed
about 20 hours
ago. based on code collected
1 day
ago.
Apr 17, 2023 — Apr 17, 2024
Showing page 1 of 1
Search / Filter on:
Commit Message
Contributor
Files Modified
Lines Added
Lines Removed
Code Location
Date
consistently use <http://url.example.net> with <> but not (<>)
Alexander Dupuy
More...
about 12 years ago
rename README -> README.rst for GitHub formatting
Alexander Dupuy
More...
about 12 years ago
Reformat README as Markdown/reStructuredText
Alexander Dupuy
More...
about 12 years ago
Added .gitignore
Julen Ruiz Aizpuru
More...
about 12 years ago
Removed the feature that counts donwloaded files
Laurette Pretorius
More...
almost 13 years ago
svn path=/src/trunk/corpuscatcher/; revision=17530
Laurette Pretorius
More...
almost 13 years ago
Tries to align files in two folders - src and tgt language - using html structure, numbers and url correspondence
Laurette Pretorius
More...
almost 13 years ago
Improved pattern matching for urls by using regex's
Laurette Pretorius
More...
almost 13 years ago
Added an option (-e) to specify a pattern to be matched in the URLs to be downloaded.
Laurette Pretorius
More...
almost 13 years ago
Fixed a bug related to selecting encodings for html files
Laurette Pretorius
More...
almost 13 years ago
Assume that immediately consequtive lines are part of the same paragraph and join them. Split paragraphs in our outputs by two newlines.
Friedel Wolff
More...
about 13 years ago
Some cleanup, simplification, reordering
Friedel Wolff
More...
about 13 years ago
Better support for non-list output (output as running text)
Friedel Wolff
More...
about 13 years ago
Suppress unnecessary warning about having the browser handle gzipped data
Walter Leibbrandt
More...
over 15 years ago
Don't convert pages if there's nothing to convert.
Walter Leibbrandt
More...
over 15 years ago
- Moved browser object initialization to a seperate method (so that it's available to importing clients). - Added a "browser" parameter to download_url().
Walter Leibbrandt
More...
over 15 years ago
Fixed a bug where only the last crawled URL (and its connections) are converted to text.
Walter Leibbrandt
More...
over 15 years ago
Make corpuscatcher an importable module.
Walter Leibbrandt
More...
over 15 years ago
Added support for handling more encodings.
Walter Leibbrandt
More...
over 15 years ago
- Added -V/--version command-line argument - Added more specific settings to the mechanize.Browser object used for crawling
Walter Leibbrandt
More...
almost 16 years ago
Added -V/--version command-line argument.
Walter Leibbrandt
More...
almost 16 years ago
Documentation updated: - Added LICENSE and __version__.py - README points the read to the README on the wiki.
Walter Leibbrandt
More...
almost 16 years ago
Fix copyright date
Friedel Wolff
More...
almost 16 years ago
Correct copyright dates
Friedel Wolff
More...
almost 16 years ago
Initial version of CorpusCatcher tools.
Walter Leibbrandt
More...
almost 16 years ago
This site uses cookies to give you the best possible experience. By using the site, you consent to our use of cookies. For more information, please see our
Privacy Policy
Agree