WebScraper uses the Integrity v6 engine to quickly scan a website, and can output the data (currently) as CSV or JSON.
- Easy to scan a site – just enter the starting URL and press “Go”
- Easy to export – choose the columns you want
- Plenty of extraction options, including HTML elements with certain classes or IDs, regular expressions, or entire content in a number of formats (html, plain text, markdown)
- Configuration of various limits on the crawl and the output file size
- Improves the ' output file column builder' table - columns appear in columns rather than rows as before, so hopefully easier to use. You can drag the columns to re-order them, edit their headings, edit the configuration of that column or delete the column.
- Improves the output file filter (Used to be called 'information page contains'). This can now be regarded as a 'select where' and allows for setting up a number of rules, AND'd or OR'd. These can be based on a 'contains' partial match, or regex. More options for each rule such as contains / doesn't contain, and applying the rule to the url or the entire content.
- Adds a proper links table, this can be used to collect / list all link target urls discovered on the way, and optionally image urls too. This list can be filtered for just links / just images / internal / external / redirected / pdf documents
- Adds capability to easily extract headings (h1-h7) with particular class / id (previously the class or id method was limited to divs, spans, p's and dd's)
- Alters the 'count' that is displayed at the right of the address bar. Now it literally displays the number of pages scraped which = rows in the output table. Previously it was a count of pages discovered, which may not be the same number now that you can make rules that act as an output filter.
- Fixes recently-introduced bug which prevented your output columns from saving properly in a saved project
OS X 10.8 or later, 64-bit processor