Tags : Browse Projects

Select a tag to browse associated projects and drill deeper into the tag cloud.

Heritrix: Internet Archive Web Crawler

Compare

  Analyzed 23 minutes ago

The archive-crawler project is building a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.

79K lines of code

11 current contributors

8 days since last commit

10 users on Open Hub

Low Activity
4.33333
   
I Use This
Tags webcrawler

crawler4j

Compare

  Analyzed about 12 hours ago

Crawler4j is an open source Java Crawler which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes! Sample UsageFirst, you need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and ... [More] handles the downloaded page. The following is a sample implementation: import java.util.ArrayList; import java.util.regex.Pattern; import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.url.WebURL; public class MyCrawler extends WebCrawler { Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); public My [Less]

8.29K lines of code

5 current contributors

over 3 years since last commit

4 users on Open Hub

Inactive
5.0
 
I Use This

LinkChecker

Compare

  Analyzed 4 months ago

Check websites and HTML documents for broken links. * recursive and multithreaded checking * output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats * HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links support * restriction of link ... [More] checking with regular expression filters for URLs * proxy support * username/password authorization for HTTP and FTP and Telnet [Less]

45.2K lines of code

10 current contributors

4 months since last commit

3 users on Open Hub

Activity Not Available
3.0
   
I Use This

Smart Cache Loader

Compare

  Analyzed less than a minute ago

Smart Cache Loader is most configurable web batch downloader in world! If you have a very specific needs to grab some portions of web site -- this is right tool for you! This program can be also used as web crawler if you need to crawl defined parts of www site(s).

5.28K lines of code

0 current contributors

almost 4 years since last commit

2 users on Open Hub

Inactive
4.0
   
I Use This

Spidr

Compare

  Analyzed about 13 hours ago

Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

4.39K lines of code

1 current contributors

3 months since last commit

1 users on Open Hub

Very Low Activity
0.0
 
I Use This

Anemone

Compare

  Analyzed about 18 hours ago

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site. The multi-threaded design makes Anemone ... [More] fast. The API makes it simple. And the expressiveness of Ruby makes it powerful. [Less]

2.15K lines of code

0 current contributors

almost 12 years since last commit

1 users on Open Hub

Inactive
5.0
 
I Use This

mediacloud_backend

Compare

  Analyzed about 10 hours ago

MediaCloud backend repository

23K lines of code

0 current contributors

over 8 years since last commit

0 users on Open Hub

Inactive
0.0
 
I Use This

Tachyon_project

Compare

  Analyzed about 7 hours ago

Tachyon is a fast web application security reconnaissance tool. It is specifically meant to crawl a web application and look for left over or non-indexed files with the addition of reporting pages or scripts leaking internal data.

2.32K lines of code

0 current contributors

15 days since last commit

0 users on Open Hub

Very Low Activity
0.0
 
I Use This

heritrix-crawl-filter

Compare

  Analyzed 40 minutes ago

A candidate chain processor for applying regular expression filters on URIs coming from defined seeds

211 lines of code

0 current contributors

over 13 years since last commit

0 users on Open Hub

Inactive
0.0
 
I Use This
Licenses: No declared licenses

Zeitcrawler

Compare

  Analyzed about 14 hours ago

A specialized crawler for the German newspaper 'Die Zeit'. Starting from the front page or from a given list of links, the crawler retrieves newspaper articles and gathers new links to explore as it goes, stripping the text of each article out of the HTML formatting and saving it into a raw text ... [More] file. The project includes scripts to convert it into the XML format for further use with natural language processing tools. [Less]

1.64K lines of code

0 current contributors

about 10 years since last commit

0 users on Open Hub

Inactive
0.0
 
I Use This