You’re getting great traffic on your website. You’re even seeing an uptick in sales. But there’s that nagging feeling in the back of your mind that there should be more conversions, right? I mean, after all, there’s a lot of traffic coming in, and most of it is useful, but could there be something else going on? Is all of this genuine? Can you really trust the data you’re seeing in Google Analytics?

The answer to this isn’t a simple yes or no, sadly. Google Analytics is not a perfect tracking engine. It only bases attribution on the last known source of the visit. That may be fine for most purposes but it’s not quite the true picture it should be. And how do you know that someone isn’t harvesting your links and making money off them by promoting them themselves?

This article should help you stop that happening. I know it’s a bit technical but sometimes you have to roll your sleeves up and just get dirty. Thankfully for you, you’re not alone in this. Many have trodden the path you’re now on (and perhaps you’re revisiting just to be 100% sure). The path may be a little tricky but the results (as with all things such as this) are worthwhile.

We’re going to block people from visiting our website.

What? Why? Why on earth would you even want to do that? Well, you would for two reasons:

  1. You want the truth (and yes, you CAN handle the Truth)
  2. You don’t want people making money off what you sell or are offering because that’s denying you potential revenue, right?

So knowing these two guiding principles we’re going to get started on this now.

A little Technical Background

Google AnalyticsYou run your website on a Web Server. Typically this is a product called Apache that you’ve only ever heard thrown around in meetings with the Web Developers. You know it’s important but you’re not 100% certain on what it does and how it does it. Don’t worry, you don’t have to become technical.

Apache has a file that it knows about called .htaccess. This file is Apache’s list of rules on the door, so to speak. It has rules on there about what’s not allowed to happen and what is allowed to happen. We’re going to modify that to get rid of bad traffic (not all bad traffic because traffic moves, as you would expect, so we’re just going to deal with the obvious bad traffic we’re getting).

This file probably already contains the references for what to do if a page is not found, is forbidden, is Unauthorised, or your server has an error. Don’t worry if it doesn’t. As long as it isn’t going to cost you too much to get them included, what you should probably do is have another talk with your Web Developers to find out why, and ask them if they need to be included.

Somewhere in there you will find some lines that look like this (or similar to it):

order allow,deny

deny from 255.0.0.0

deny from 10.1.10.0

allow from all

This is telling us that Traffic should be allowed from everywhere except for the deny rules (they’re about access from locations in the network the Web Server is on).

Getting into it

What we are looking for and what almost certainly already exists in your .htaccess file is a section called RewriteEngine.  RewriteEngine is responsible for a number of things:

– Changing how URLs are presented on your site

– Blocking access from specific IP addresses or sites

– Preventing other sites re-using your CSS, your Images and other collateral that you may have.

Now we understand what it is doing, let’s look at a sample to block one of these referrer sites:

RewriteEngine On

Options +FollowSymlinks

RewriteCond %{HTTP_REFERER} ^https?://([^.]+.)*blackhatworth.com [NC,OR]

RewriteRule .* – [F]

Oh god, that looks complicated. It is, and it isn’t. Let’s break it down line by line:

– RewriteEngine On

This says to start rewriting URLs based on the set of rules we’re going to follow below

– Options +FollowSymlinks

This says Do not play with any of the URLs that the webserver should present.
To put this another way, if you want the person to go to https://www.mydomain.com/productpages/index.html then that is where it should go, even if that location doesn’t physically exist on the server. This is because it’s probably written into that clever code that the Web Developers have written, and we don’t want to undo that just for anyone.

The line below is the important bit. This is the one we’re going to copy and make lots of versions of, to stop all those malicious referrals.

– RewriteCond %{HTTP_REFERER} ^https?://([^.]+.)*blackhatworth.com [NC,OR]

This RewriteCond directive says if the referring site is any variation of blackhatworth.com then we are to take action based on the rule below. The %{HTTP-REFERER} part of this rule says that the traffic will be coming from the next part of the string and all those peculiar characters are Regexes (Regular Expressions, for the less technically minded) to the main string. Regular Expressions are simple pattern matches. We’re simply saying here that we’ll ignore any/all variations of blackhatworth.com, examples of which include:

http://blackhatworth.com

https://blackhatworth.com

https://www.blackhatworth.com

http://refer.blackhatworth.com

And so on…

The [NC] and [OR] parts of this directive simply state

– [NC] – Not Case Sensitive

– [OR[ – Or is allowed here in case there are multiple lines (there will be, honestly)

So although that line looks absolutely awful, we now understand what it’s doing. It’s saying If you see the referral as coming from any version of blackhatworth.com then it will be used by the RewriteRule specified after this.

– RewriteRule .* – [F]

Ths RewriteRule simply says whatever conditions match above, present a 403 – Forbidden page. This is the convention. In instances where you’re getting bad traffic, telling them that this is forbidden is probably the best course of action.

So to sum up this block says:

“If any traffic is coming from blackhatworth.com (or any version of that), then tell them it’s forbidden to present that page”

Putting it together

Given what we know now, we could do with finding out about these bad actors. Where can we find them? How do we know if we’ve got them all? For my mind, this page on www.htaccess-guide.com is the best place to start:

http://www.htaccess-guide.com/blocking-offline-browsers-and-bad-bots/

That page alone should target a good 60-70% of the traffic (if not more). There is a difference here inasmuch as rather than the traffic we’re looking at the User Agent. This is the exchange between your webserver and whomever is asking for a page, to say “Hey, what kind of computer are you?”. This is done so that your webserver presents the page in the correct way for the Agent viewing it. For most of the traffic it’ll look like a Webpage in a Browser on a Desktop or Laptop Computer, or maybe a Mobile phone, or Tablet. But because we know that these traffic referrals come from Sites and from other locations or referrers, we can use the way they announce themselves to tell them that they can’t see those pages. Incidentally, the use of the [L]  tag there is to say “If you’ve matched above, tell them that the page is forbidden and stop after that”.

In this way, we’re stopping all of that traffic.

A quick Google search will tell you that there are many sites out there with very good attribution for blocking these sorts of things. I’d recommend taking some time to read some of the content there too, just to familiarise yourself with some of the ideas.

In Summary, then, here are some key pointers for you to take away.

– Google Analytics can show you good and bad traffic and cannot distinguish which is which.
– To protect your Intellectual Property and Copyright, blocking these sites is worth the time and effort.
– Once this has been done, you’re looking at clean traffic to your website, with no loss of earnings and clean referral paths you know you can trust.

Author

marcus-webb

Marcus Webb
Senior Technical Support Executive
Pure360