One of the important features of Snowplow is that you can build your own custom data models to suit your unique analytical requirements in a space-efficient and performant way. It’s important that your data model can evolve and grow in complexity as your business grows and your needs get more advanced.
This is the first in a series of blog posts taking your through the thought process of building a data model. Good background reading for this series is the Snowplow blog post on event modeling.
Here at Snowflake Analytics we’re putting some effort into using our own Snowplow pipeline inside the business. At the moment we’re kind of like the proverbial plumber’s house with non-functioning pipes
My main focus at the moment is our sales process: tracking prospects as they come in, qualifying leads and then converting them into customers. Content like this blog forms an important part of that process, exposing our message to our target market of businesses that want to do advanced analytics.
Site visitors’ content consumption patterns give us some insight into their interests and expertise. We want to feed these insights into our sales pipeline. Where possible we can identify people or their companies and, based on their content consumption, we can decide if it’s worth trying to track down the individuals who are at the right point in the thought process to be a good candidate for our services.
The data that drops into the
is event-level data. To make it useful we need to apply some logic to it. You can get a lot done with simple queries
to show you basic metrics, for example:
This will get you some quick metrics replicating what you’d get in a tool like Google Analytics.
There are some limitations baked into the query above too. First up, you’re accepting the default definitions
has some limitations and will
differ from the numbers Google Analytics generates
in some odd ways.
domain_userid is useful for a single domain site but if you have multiple domains and want
to merge those, or you’ve got identified users and you’d like to deduplicate those across devices, or you use
AMP which doesn’t get that cookie set, you need more.
Of course you’re also missing any kind of engagement metrics. One of Snowplow’s best features is its ability to measure accurate time spent. This, or even just replicating the way GA and Adobe do it, will require something more complex and a chunk more load on your database.
If you start seeing lots of data volume, these queries will start being onerous quite quickly. For standard end-user reporting you want to have the fastest performance possible and avoid putting lots of load on your database recalculating this stuff every time. Summary tables are the answer here.
Keeping all your raw data around in a database can get expensive quite quickly, especially given there’s a bunch
of replicated data that you probably don’t need often. Columns like
useragent are repeated for every single
event, page pings happen often to give you good time spent resolution, and there’s the deprecated fields like
br_gears which aren’t useful to anyone.
Best to get rid of those and only keep the fields and rows you need, merging the result of all your calculations. Your modelled data might include a unified row for each pageview with some additional columns representing time spent, scroll depth and without any extraneous columns you don’t need.
Working from a uniform set of business rules makes everyone’s life easier. We’ve all experienced the fun of explaining why two different reports created by different groups end up with different results. By limiting access to the raw data and consolidating upstream you can set and maintain these business rules and common calculations in one place. For example, how you define:
By moving these into a central data model, everyone works from the same source and you get less confusion. The business model can be updated as things evolve.
Related to centralisation, you can hide complexity and let your analyst and reporting users get on with their
work while you worry about the complex stuff. For example, the
Snowplow AMP tracker has some tricky limitations.
It produces no page pings, so calculations of time spent will need to be somewhat arbitrary or use the GA/Adobe method.
How do you merge these with your more accurate measurements from normal web pages and inside apps? AMP also doesn’t
set a first-party cookie to become
domain_userid. How will you deal with unique browser metrics for these users?
By building a model with a common definition, your reporting users can see a simple single row per pageview data model, with columns representing the output of complex logic for things like user deduplication, time spent, scroll depth and conversions. Their queries can be simple aggregations and filtering, while your data model holds all the complex conditional logic to deal with edge cases.
While working on the data model for this blog post I came across another weird edge case that’s worth sharing.
The rows below have the same
pageview_id, some have the same
domain_userid and they get 25
pageviews for each combination, despite being in places as far apart as Indonesia and California.
These are almost certainly bots of some kind, quite a lot have Google’s fingerprints on them. It can be quite difficult to work out if you want to count these, and whether some subset you want to count and another you don’t want to count. These are complexities you want to decide on once, then hide from everyone forever after.
It’s quite common to see tracking you’ve built breaks because a web developer changed something unexpectedly. This happens a lot if you use page URLs to build a Goal Funnel in Google Analytics. The URL for the conversion step changes and you end up with a gap in data. By modelling out your data you can seamlessly switch across from the old to the new definition and your analytical users need never know, simply seeing a boolean for whether a conversion occured.
An important thing to think about is the scope of each object in your data model and what fits in each. Pageviews summarise things about the pageview itself. Sessions hold things that change once per session such as traffic source information. Users hold everything we know about the user, and potentially other ways we can identify users.
Pulling all this together, we might come up with a data model that looks like this. To build a complete
view, you’d select from the
atomic.events table, then join that to
com_snowplowanalytics_snowplow_web_page_1 to get the pageview identifier. From there I can join on the
pageview, session and user tables to get a complete picture. It’s probably easiest to create a view that
atomic.events and bolts on pageview-, session- and user-scope details for every row.
Eventually you would write out new data tables that replace the need to ever hit atomic.events directly,
containing the items you want from
atomic.events and the identifiers for the applicable pageview, session
and user objects. You wouldn’t copy across page pings because all you need is summarised in the pageview table.
Again, you might create a view that does the joins to give you one row per pageview with everything an analyst
In the next part of this series I’ll show you how to build an initial pageview scope data model to analyse content engagement metrics, specifically accurate time spent and scroll depth based off page pings.
Further editions will go through the session and user scope models. Finally we’ll pull it all together by automating the model and creating a derived modelled data table.
Subscribe to this blog via RSS.