Crawler4j is an open source Java Crawler which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes!
Sample UsageFirst, you need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and handles the downloaded page. The following is a sample implementation:
import java.util.ArrayList;
import java.util.regex.Pattern;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.url.WebURL;
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
public My
Commercial Use
Modify
Distribute
Place Warranty
Sub-License
Private Use
Use Patent Claims
Hold Liable
Use Trademarks
Include Copyright
State Changes
Include License
Include Notice
These details are provided for information only. No information here is legal advice and should not be used as such.
There are no reported vulnerabilities
30 Day SummaryMar 24 2025 — Apr 23 2025
|
12 Month SummaryApr 23 2024 — Apr 23 2025
|