Link Removals – Gathering & Cleaning The Data

by Craig Addyman
Bronco - Digital Marketing Agency

I've thought about writing this for some time now as I think it must be difficult for in-house SEO's and marketing managers to tackle such a thing as a link analysis and the removal of bad links, sure if you only have a handful of links pointing to your site then it's a fairly easy process but as your backlink profile grows the task of reviewing them if an issue arises, or better yet reviewing them before anything catastrophic happens, can be a real challenge.

I’m going to attempt to give you a process to try make your own analysis more manageable, this isn’t strictly the analysis itself but more of the cleaning and processing of your data – ready for your review and analysis.

This is a process that roughly mirrors what we do at Bronco, the (BIG) difference being, we have our own purpose-built crawlers, app UI as well as a few other hidden treasures. You can see an early version of this tool here. This is particularly for those with tight budgets and don’t have developers to hand.

The difference in the process I am going to share with you, is you'll end up working out of a spreadsheet.

Why you should manually review and remove?

Note: Just to get this covered, you might might have reasons for or against removing links depending on your own situation or your overall business strategy and the tactics used to implement that strategy, which is fine, this is just for those that want to remove links.

Now, getting back to the question, I think this is a two part answer; firstly you should know what is linking to your site so you can remove anything that maybe negative and manually reviewing links is pretty much the only way you are going to be able to do this, Google can't fully identify paid links, neither can some software you are about to buy.

Secondly, why you should remove links; There is lots of chatter from agencies and consultants about only using the disavow tool that Google provides you – in fact there have been people out there that have been quick to say it just weeks after the disavow tool came out, I'm not here to discuss/argue this, I and we (Bronco) are firmly in the removal camp, here's why:

  • Disavow files can be deleted and/or replaced, we have seen it happen
  • We don't know how disavow will be handled in the future

Even if you are only going to disavow you still need a process of gathering and review your links.

Gathering Link Data

To give you a quick rundown of what we are about to do, so it makes a little more sense as we do it, we are going to be collecting a large volume of link data, this will then be ‘cleaned’; removing duplicates and checking what is still actually live. We will then gather the results ready for review.

So, first things first, we need to find all the links that are pointing to your site, you want to get as many of these as you can, so, if you have been actively link building over the years, grab a list of all those links, same if someone else has been doing it for you.

Next you'll want to login to Google Webmaster Tools and in the sidebar go to Links to Your Site

google-webmaster-latest-links

Within the Who Links the Most section you will find Download Latest Links

Download-Latest-Links

Next, head over to ahrefs.com, here we want to again download all available backlinks that point to your site.

ahrefs-links

Finally go to majestic.com and do the same, but be sure to select their ‘Historic index'.

majestic-links

When downloading your links always run them off the Root Domain.

Both ahrefs and Majestic have impressive datasets but I have yet to see a Webmaster Tools (WMT) download that didn't have some unique domains in them so always get those from WMT too (the issues with just downloading from WMT is that the datasets are usually limited once you get over a certain number of links – more on this later).

We should now be in a position where we have all 3 files, the next thing we want to do is grab all the referring links from each file, remove the duplicates and then write them to a new file. There a few ways you might do this, the most obvious is manually using Microsoft Excel – this is fine for smaller datasets but really you are going to struggle with anything over a certain size, instead you will want to drop the data into a database (the preferred method) or run some kind of script to do the job for you.

The following is a Python script I wrote to basically open a file and look at a particular column number (the index starts at 0)

Note: We are going to be running a number of Python scripts so now might be a good time to brush-up on some basic usage of installing and running, of Python scripts .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# -*- coding: utf-8 -*-
import csv
 
def find_unique_urls(filename, column_number):
    '''
    Returns a list of URL from a specified column/file.
    '''
    url_lis = []
    with open(filename) as f:
        data = csv.reader(f)
        data.next() # skips the header
        try:
            for line in data:
                url_lis.append(line[column_number].strip())
        except:
            print 'error' 
    return url_lis
 
 
 
results = find_unique_urls('ahrefs.csv', 3) + find_unique_urls('WMT.csv', 0) + find_unique_urls('Majestic.csv', 1) 
with open('output.csv','wb') as f:
    w = csv.writer(f)
    for i in set(results):
        w.writerow([i])
 
print 'done'

On line 21, is where you will want to change the file name and column number to match your own data (double check the columns as they do tend to change from time to time) in this example ahrefs is 3 Webmaster Tools is 0 and Majestic is 1 these are the referral link columns.

Ok, so moving on, we now have a file that has unique values in it (‘output.csv’) – you might also be able to open this with Excel now or at least a text editor – For clarity, I couldn’t open the majestic CSV download for this blog but the combined files of just the referrals columns I could. The file may have a small number of duplicates, if so these will be caused by upper Vs. lowercase URL.

My new file has 87,706 links in it that we need to crawl to see what is actually still live, as we know the web moves fast and links are live one minute and not the next, the link database companies like ahrefs and Majestic understandably hold on to the data they collect (so we can look at a sites link history) but when it comes to removals they might not be live anymore – which means the data required cleaning.

The following is another script that will read our output.csv file and strip the links to their respective domains and count their frequencies. I did attempt this with Excel to keep the coding to a minimum but it crashed. Sigh. The script is similar to the last, again just edit the file name if you’ve changed it (line 28 in this case).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# -*- coding: utf-8 -*-
 
import csv
import collections
from urlparse import urlparse
 
def find_domain_counts(filename, column_number):
    '''
    Returns a dict of domains and their occurrences.
    '''
    domain_lis = []
    with open(filename) as f:
        data = csv.reader(f)
 
        try:
            for line in data:
                o = urlparse(line[column_number])
                domain_lis.append(o.netloc)
        except:
            print 'error'
 
    counter = collections.Counter(domain_lis)
 
    return {domain:value for (domain, value) in counter.iteritems()}
 
 
 
results = find_domain_counts('output.csv', 0) 
with open('domainsout.csv','wb') as f:
    w = csv.writer(f)
    for i, v in results.items():
        w.writerow([i, v])
 
print 'done'

After running this you’ll see a new file called ‘domainsout.csv’, open this up and just give it a sense check ready for what we will cover next, we are essentially trying to avoid requesting the same site thousands of times – I’ll let you make your best judgement here. If there is some that are pointing high volumes of links at your site then you may want to remove them from the data-set and find out why – in more cases it will be a sidebar or footer link.

Now that we have got our unique URL and sense checked the number of requests we are going to be making and potentially dealt with some that are pointing high numbers at our domain, it’s time to see what links are actually live. I’m going to suggest two ways in doing this…

The first and potentially more user-friendly way is to drop them into Screaming Frog, fire it up and put it in List Mode. Again we don’t personally use this at Bronco.

screaming-frog-site-crawler-list-mode

We also want to add a Custom field, which you’ll find under the Configuration drop down – here you want to add the domain that is being linked to, this will allow Screaming Frog to filter any page that contains it.

(click to enlarge)
screaming-frog-custom-fields

Next copy all the URLs from the original ‘output.csv’ (not the domain count file) into a .txt file (you can just rename it with a new extension) and then upload it and the crawl will start automatically – You might want to change the Speed (also found under the Configurations tab).

Again this is fine for smaller datasets or in case you have it installed on a server – I loaded 87 thousand into it and it crashed but this is a reflection of memory allocation not really the crawler. There is a nice tutorial here that takes you through a cloud installation.

The second option I’m going to offer is another set of Python scripts, this one will be a little more advanced but again you will only be required to run them so don’t worry if you are not a comfortable coding.

The following script is a multi-threaded crawler that is doing two things; checking if the link is actually live and grabbing the anchor text for you (in case you want it for your review), once complete it will write the results to a CSV file. I ran this against 23,970 links and it ran fine, taking just over 3 hours to run. Depending how many links you need to check, just throw it up on to a server to run over night, you can always just crank up the number of threads if you have decent hardware to accommodate it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# -*- coding: utf-8 -*-
import Queue
from threading import Thread
import csv
from bs4 import BeautifulSoup # pip install beautifulsoup4
import requests # pip install requests
 
domain_check = 'davidnaylor.co.uk' # change to domain you are checking
 
# create the instance
q = Queue.LifoQueue()
 
# add items to the queue
with open('urls.txt') as f:
    for line in set(f.readlines()):
        q.put(line.strip())
 
results = {} 
 
def grab_data_from_queue():
    while not q.empty(): # check that the queue isn't empty
        user_agent = {'User-agent': 'Mozilla/5.0'}
 
        try:
            url = q.get() 
            r  = requests.get(url.strip(), timeout=5, headers=user_agent)    
            data = r.text
            soup = BeautifulSoup(data, "html5lib")
 
 
            for link in soup.find_all('a', href=True):
                if domain_check in link['href']:
                    if '<img' in link.contents[0].encode('utf-8').strip():
                        results[url] = 'Image Link'
                    else:
                        results[url] = link.contents[0].encode('utf-8').strip()                  
 
        except Exception as e:
            print 'Error with {}'.format(url), e
 
        finally:
            q.task_done() 
 
for i in range(50): # aka number of threads    t1 = Thread(target = grab_data_from_queue) # target is the above function
    t1.start() # start the thread
 
q.join() 
 
with open('output.csv','wb') as f:
    w = csv.writer(f)
    w.writerow(['Link', 'Anchor Text'])
    w.writerows(results.items())
f.close()

The script runs off a text file “urls.txt” (line 14) and spits out a new “output.csv” (you might want to run it in a different directory) – line 50.

I ran a little test against 1,000 URL with Screaming Frog and my script both matching each other finding 530 and 531 live links respectively.

On line 46 (highlighted) there is currently the number 50, this represents the number of threads you are running on, it’s best to run little tests to see where your thread limit is (before the speed starts to slow down), the bash “time” command will come in handy here.

So, we now have our final file, this will either be from our final script or from the Screaming Frog export, either way we just want to create a PivotTable of our results, at this point we should be fine just using Excel simply do a little slicing so you have a new column that is stripped to domain level.

You should have something looking like this (give or take the anchor text column).

link_removals_domains

If you are struggling with this, just copy & paste the Link column into a new sheet, highlight the column, click Data > Text to Columns >
Delimited > Other and input / and click Finish. Now just copy the domain section back into your original sheet.

Now you want to create the PivotTable, simply highlight the entire sheet, Click Insert > PivotTable and click OK. Now within the new PivotTable sheet, select the columns ‘Domain‘ and ‘Link‘ and you should hopefully have something that looks like this.

link_removals_pivot_table

You should now have a very manageable list of links in comparison to when you first started, all the Live URL are listed under their respective domains for easy sorting and all you do now is just add a new ‘Decisions’ column and start your manual review. You can obviously break this up into smaller sheets to have multiple team members working on it if it’s still pretty big etc.

Disclaimers!!

This is not the same process as we use in house, I knocked together the above scripts just for this post, so they’re not perfect, for example, the last script, along with Screaming Frog will likely trip Cloud Flare protection, we have more advanced ways and means to get round this, that is not covered here.

I’m also not one of the developers at Bronco so yes my code likely can be improved… but probably not that much 😉

Finally, not a disclaimer, if you want us to work on your link removal campaign please get in-touch or just follow our news letter below!

Bronco - Digital Marketing Agency
Making your inbox more interesting
Looking to keep up to date, or find out those things we can’t mention on the blog? Then sign up to our semi-regular newsletter. Don’t worry, we won’t spam you.

1 Lonely Comment

Get in Touch

Things are better when they’re made simpler. That’s why the David Naylor blog is now just that; a blog. No sales pages, no contact form - just interesting* info about SEO.

If you’d like to find out more about the Digital Marketing services we do provide then head over to Bronco (our main company website) to get in touch.

Get in Touch Today * Interestingness not guaranteed
Part of the Bronco family