WebScraper uses the Integrity v6 engine to quickly scan a website, and can output the data (currently) as CSV or JSON.
- Easy to scan a site – just enter the starting URL and press “Go”
- Easy to export – choose the columns you want
- Plenty of extraction options, including HTML elements with certain classes or IDs, regular expressions, or entire content in a number of formats (html, plain text, markdown)
- Configuration of various limits on the crawl and the output file size
- Adds capability of downloading images to a folder during the scan. See Complex setup > Output file columns > Also download images to folder.
- Images can optionally be downloaded only if they match a pattern, either partial url or regex match. (leave box blank to download all images discovered)
- Adds option to filter output file - ie only include data in output file from certain pages (eg information pages or product pages). This is done by matching the url of the page being scraped, either by partial url (eg /product/) or a regex match
- Fixes issue with saving project. (note that saving project does not save data, only settings and configuration. Save data separately using Export from the Results screen or File > Export)
- Incorporates the version 8 crawling engine which has many improvements
- Adds 'limit requests to X per minute' control
- Updates pre-defined user-agent strings
OS X 10.8 or later, 64-bit processor