ScreamingFrog's a great tool and if you know how to use it you can make it incredibly powerful. Take for example spidering large sites. The site I'm using in this example, currently has over a million pages indexed within Google – now in relation to some of the sites on the internet that's a drop in the ocean but still you wouldn’t call it small by any means, yet time and time again I see/hear of people struggling to crawl a site of this size.
I'm going to run you through how you might spider this site, first things first you'll need to identify how the site is set up and where all these URLs are coming from, in this case it's an ecommerce site so the usual suspects are faceted navigation, pagination and site search; among others but these are the main offenders.
Once you have identified these, generally speaking they are a bulk-fix or at least a few bulk fixes for sections of the site – this is where the SF resources are eaten up; crawling pages that don't need to be crawled, we’ve identified the issues and set in place the correct directives for the search engines to abide-by so we need to exclude them from our crawls so we can concentrate on the sites main pages – luckily SF has a handy feature, it's called…Exclude!
Here's how to use it.
Set your spider going; do your normal set-up, specify what you want to crawl and how fast etc.
Look for further duplicating URLs, these will likely be in the form of query strings, look for anything with a ‘?' in it. Once you’ve found these make sure they we apart of your first assessment – if not then obviously address them!
At this point just stop the spider, its likely showing 6% or something like that and not going anywhere, add the URLs that you have identified from your initial review, as well as the ones you identified from your mini crawl and prepare for a little Excel trickery.
Once in Excel, highlight the each cell and use ‘Text to Columns' to strip down the query from the domain – If you're not sure what to do, simply navigate to the ‘DATA' tab and select ‘Text to Columns', then select ‘Delimited'. Next ‘Other:' and add in a question mark as seen in the following screenshot and click Finish:
Next replace the cells in column A with a question mark. Select the cells in column B and run through the ‘Text to Columns' process again but this time with an equals sign. Again as shown in the following screenshot (in case you haven’t realised what’s going on you are basically splitting the strings by a character you specified):
The Exclude feature reads regex, so this needs to be added into the strings when you concatenate them – if you don't know regex you can, in most cases, get by with the examples given on their site. You can see what I have done below to block out the spider from those pages.
Finally open up your previous crawl and clear it of data, now go to Configuration > Exclude and copy & paste in your regex, hit Ok and restart your crawl.
You may have to add a few more to the list but now you should be able to collect all Meta data etc. for all of the sites pages you actually care about. You could just as easily do this on the fly but I have given this example in Excel in case you have a large amount and want to manipulate the data en mass.
Always remember to action these pages before you strip them out!