0
I Use This!
Inactive
Analyzed 1 day ago. based on code collected 1 day ago.

Project Summary

A specialized crawler for the German newspaper 'Die Zeit'.

Starting from the front page or from a given list of links, the crawler retrieves newspaper articles and gathers new links to explore as it goes, stripping the text of each article out of the HTML formatting and saving it into a raw text file.

The project includes scripts to convert it into the XML format for further use with natural language processing tools.

Tags

academic computational_linguistics corpus corpus_linguistics crawler digital_humanities natural_language_processing nlp perl unix webcrawler xml

In a Nutshell, Zeitcrawler...

GNU General Public License v3.0 only
Permitted

Commercial Use

Modify

Distribute

Place Warranty

Use Patent Claims

Forbidden

Sub-License

Hold Liable

Required

Distribute Original

Disclose Source

Include Copyright

State Changes

Include License

Include Install Instructions

These details are provided for information only. No information here is legal advice and should not be used as such.

This Project has No vulnerabilities Reported Against it

Did You Know...

  • ...
    65% of companies leverage OSS to speed application development in 2016
  • ...
    by exploring contributors within projects, you can view details on every commit they have made to that project
  • ...
    nearly 1 in 3 companies have no process for identifying, tracking, or remediating known open source vulnerabilities
  • ...
    compare projects before you chose one to use

Languages

Perl
37%
Java
31%
Python
21%
2 Other
11%

30 Day Summary

Nov 3 2025 — Dec 3 2025

12 Month Summary

Dec 3 2024 — Dec 3 2025

Ratings

Be the first to rate this project
Click to add your rating
  
Review this Project!