Tags 5231

WebScraper 4.6.0 – Scan and output website data as CSV or JSON.

Appked/ Utilities/on 2018-10-13 08:18

WebScraper 4.6.0

WebScraper uses the Integrity v6 engine to quickly scan a website, and can output the data (currently) as CSV or JSON.

  • Easy to scan a site – just enter the starting URL and press “Go”
  • Easy to export – choose the columns you want
  • Plenty of extraction options, including HTML elements with certain classes or IDs, regular expressions, or entire content in a number of formats (html, plain text, markdown)
  • Configuration of various limits on the crawl and the output file size


Version 4.6.0:

  • Class and Regex helpers are more helpful:
    • Able to select text in Class helper to see which classes apply and choose the most appropriate one
    • Able to press 'Use this' in Regex helper to insert the current expression into the 'add column' dialog and return to that dialog
    • Replaces the 'press return to test' with a 'test' button to avoid the unexpected
    • Related to the above, fixes a problem that has existed since the class and regex helper windows became one. If the helper is switched from class to regex or vice versa before an expression or class is chosen, then the 'add column' dialog will now show the appropriate tab with the expression or class filled in.
  • Helps you to write the regular expression:
    • Copy and paste a suitable chunk from the source code, select part that you want to actually collect, press the new "(xyz)" button
    • If the part you want to collect is a decimal number, press the "(123)" button
    • Press the "XYZ" button to replace selected parts of the pasted code that you don't want to collect but that may be different on each page
    • Press the "↵" button to replace all whitespace with a suitable expression fragment. This makes the expression more robust by allowing for invisible space to vary between pages
  • Expression field within Regex helper is smarter:
    • Automatically trims whitespace and return characters from each end of pasted source code. With a single-line text field this is often a cause of frustration and confusion
    • Automatically replaces return characters within multi-line pasted code to make it more reliable and make sure that everything is visible in the single-line text field
  • Adds user preferences for many of these things
  • Adds preference for when exporting to CSV format, if the output file is bigger than 64k rows, to save multiple files. Older versions of Excel (and current versions of Numbers) have a limit of 64k rows. This preference is off by default and the decision needs to be made before running the scan because the output is split while scanning.
  • Small fixes
    • when project was saved, the output filter switch ("all of the following / one of the following") wasn't being saved. This now fixed


OS X 10.8 or later, 64-bit processor