NlpTools is a library for natural language processing written in php. Its development is driven by the author's needs for text classification, clustering, tokenizing, stemming etc.
A specialized crawler for the French sport newspaper L'Équipe.
Starting from the front page or from a given list of links, the crawler retrieves newspaper articles and gathers new links to explore as it goes, stripping the text of each article out of the HTML formatting and saving it into a raw
... [More] text file.
The project includes scripts to convert it into the XML format for further use with natural language processing tools. [Less]
Tools to crawl German official speeches repositories in order to gather a corpus.
More information to come.
A complete version of the corpus including a visualization tool is available here : http://purl.org/corpus/german-speeches
A specialized crawler for the German newspaper 'Die Zeit'.
Starting from the front page or from a given list of links, the crawler retrieves newspaper articles and gathers new links to explore as it goes, stripping the text of each article out of the HTML formatting and saving it into a raw text
... [More] file.
The project includes scripts to convert it into the XML format for further use with natural language processing tools. [Less]
This site uses cookies to give you the best possible experience.
By using the site, you consent to our use of cookies.
For more information, please see our
Privacy Policy