WebScraper uses the Integrity v6 engine to quickly scan a website, and can output the data (currently) as CSV or JSON.
- Easy to scan a site – just enter the starting URL and press “Go”
- Easy to export – choose the columns you want
- Plenty of extraction options, including HTML elements with certain classes or IDs, regular expressions, or entire content in a number of formats (html, plain text, markdown)
- Configuration of various limits on the crawl and the output file size
- Adds option to simply add a column for 'h1' through to 'h4' (under 'Content' in the 'Add a column' dialog. (Useful for extracting info within a heading that doesn't have a class or id).
- Enhances the 'list of urls' functionality. If your starting point is a local list of urls in plain text, and if they are 'deep links' rather than domains, then the 'down but not up' rule will apply unless the 'crawl above starting directory' checkbox is ticked.
- Small interface glitch corrected. If scan was run to completion, small changes made to the configuration, Go pressed again, the scan would proceed, clearing previous data but the counter within the url / progress field (to the right) would not be reset.
- Inherits any recent changes within the Integrity crawling engine
OS X 10.8 or later, 64-bit processor