A specialized crawler for the German newspaper 'Die Zeit'.
Starting from the front page or from a given list of links, the crawler retrieves newspaper articles and gathers new links to explore as it goes, stripping the text of each article out of the HTML formatting and saving it into a raw text file.
The project includes scripts to convert it into the XML format for further use with natural language processing tools.
Commercial Use
Modify
Distribute
Place Warranty
Use Patent Claims
Sub-License
Hold Liable
Distribute Original
Disclose Source
Include Copyright
State Changes
Include License
Include Install Instructions
These details are provided for information only. No information here is legal advice and should not be used as such.
30 Day SummarySep 2 2024 — Oct 2 2024
|
12 Month SummaryOct 2 2023 — Oct 2 2024
|