In the first four parts of this series, we modelled out:
Now it’s time to move the model into production and start using it. To do this we’ll use Snowplow’s SQL Runner tool at the end of each Snowplow batch load into Redshift.
In my previous posts I’ve created a series of
lookup_* tables with my modelled data. To keep this clean you
probably want to create a new schema specifically for your modelled data. This simplifies the separation of
raw, unopinionated event-level data with modelled, opinionated data that applies your business rules.
Once everyone has got used to it, you can remove access to the raw data for most of your users, ensuring
everyone is working from the same business rules.
By convention, Snowplow uses
derived schemas for data modelling steps.
scratchis for ephemeral data used in intermediate processing steps but not actually part of the output. It should be discarded as a final processing step.
derivedis for your model output data.
SQL Runner is a pretty simple tool that takes a series of SQL scripts and executes them on command against a
specified database with specified credentials. The important piece of information
from the documentation is the
Queries within a Step will execute in parallel while individual steps are executed in the order in which they appear. If your models build on one another, you’ll need to separate them into Steps to operate in sequence. The model I’ve build out in this series doesn’t have any dependencies between the steps so you can easily have them run in parallel within the same Step.
The syntax for SQL Runner’s playbooks is pretty simple. For our model playbook we’re just executing the following:
If you’re a Snowplow Managed Service customer, SQL Runner is handled for you. The process is quite simple:
From then on, any changes you make on the
master branch for existing playbooks will update within about 45
minutes. You’ll need to follow the process above again if you create a new playbook running on a different
If you’re using an Open Source Snowplow pipeline, you’ll have to set up and manage scheduling yourself, probably
within the same script you’re kicking off to do enrichment and loads into the database from
cron now. A good place
to start is the Guide for devops.
My scripts execute across the entire events dataset we have. Our site isn’t high traffic so this isn’t a huge imposition,
executing in about a minute on our Redshift database. If you have decent traffic volumes, you’ll want to scope down
the reprocessing of models somewhat. If your queries involve any kind of lookback, that will take some careful tuning
of your query windows. Be particularly careful if you’re calculating sessions in some way different to
COUNT(DISTINCT domain_sessionid) as a session can easily cross a batch boundary.
The joins you need to make for this to be useful are a bit complicated to do routinely, so it makes sense to create
a nice simple view for your
pageview events that incorporates these elements on every row. These views become
something you can expose to a visualisation tool like Tableau to enable easy exploration while incorporating
engagement metrics, your business rules for traffic source attribution and information about users.
Below is the view I’ve created for our data set. For convenience I’m also including the timezone conversion we routinely use to convert the query to Sydney time. For performance reasons, you might end up writing out the result of this view into a table itself as an additional step. There would be a storage space cost but it should be more performant too.
Now let’s use the models and views we’ve built and start analysing some content.
This query gives you some handy engagement metrics including bucketing pageviews into scroll depth groups.
As our site is aimed at B2B marketing, identifying the companies coming to our site is an interesting thing to do. I’m excluding traffic from ourselves and our friends at Snowplow and might increase the exclusions as we keep exploring the data.
So that’s the end of our giant odyssey into Snowplow data models. We haven’t covered everything that can go into
a data model here. In particular, one of the best uses of data models is to map identities across all instances
where you see a particular
domain_userid back in history. To do this you’d use the same technique as I’ve used
here for companies, using whatever identification point (or points!) you have. We, unfortunately, don’t have an
easy way to reliably identify people.
You might now like to have a read of the Snowplow web data model, which can be a bit daunting at first but probably makes a whole lot more sense now that you’ve looked through a simpler model.
Subscribe to this blog via RSS.