In the first three parts of this series, we
looked at modelling out pageview events to include accurate time spent and scroll depth, then classifying
sessions based on what we know about where the user came from. Now we’re going to look at what we know about
One of the best things about having raw behavioural event data in an SQL database is that you have the ability
to update your knowledge about a users’ behaviour after the behaviour is recorded. The classic case for this is
when a user logs in or otherwise identifies themselves: now know the identity behind the previous behaviour.
If your application has some way of identifying users–such as forcing login or email click-throughs–you can
merge user behaviour across multiple devices, such as a desktop browser and a mobile browser.
In our company’s case we don’t have a good way to identify individual users. At some point we’ll start a
newsletter and drip feed nurturing process which will allow us to pass identifiers into the click-through
URLs. Until then, most of the information we can get is based on
Clearbit data which is created
through a lookup of the IP address and takes a stab at guessing company behind the browser.
Bug in sessions post
A quick aside: while writing this component on users I discovered a bug in the
sessions piece. The part of my big CASE statement dealing
with sessions that don’t have a referrer should be at the bottom of the big statement, otherwise the clause looking
for utm querystring components (mtk_* columns) won’t ever execute in cases without a referrer. Using UTM strings
is specifically designed to deal with lack of a referrer situations.
I’ve updated the post so if you want to look for it, look at the location of this piece of code:
Create a lookup table on domain_userid
For this installment we create a lookup_users table keyed off the first-party cookie which is stored in
Labelling internal traffic
The first component looks for any users who’ve behaved in a manner that might be considered “internal” traffic.
For us that means they’ve sent in data with a host of localhost or they’ve appeared from one of our internal
In larger companies you might also put a data collection pixel on your intranet homepage, internal systems
and the like. Any browser that’s ever visited those locations is considered “internal”. You want to tackle
labelling internal browsers from multiple angles to be sure you capture everything. It’s entirely likely
that developers will use a browser that never sees the intranet or internal systems to develop and test,
so you also want to look at IP ranges and any other opportunities too. If you use them, think about any
automated tests running and how you can label those.
Clearbit company details
The second component looks for the latest Clearbit data available for the user. This isn’t going to be perfect.
I looked up my own domain_userid and saw a bunch of different places I’d been including client sites and
hotels but it will give you some idea. More complex logic might score users with multiple Clearbit identifiers.
Potentially you could just give a count of unique Clearbit companies for the user to drill down when there’s
Pulling it all together
Here’s a simple query that shows recent site visitors ranked by uniques and time spent on the site, traffic source
categories and any Clearbit identification information we’ve been able to get for the IP address. You can see the
output here with the company details blurred out to protect the innocent. This query pulls together data from
all the lookup tables we’ve created in this series: pageviews, sessions and users.
So we’ve now got techniques to build up a custom view of our audience behaviours using the three different scopes of interest.
In the next post we’ll look at automating this so that it updates every time a new batch of data is added to Redshift, then
some convenience views to make querying simple for everyday use and to expose to your end users.