In the first part of this series on data modelling we went through the background for building a data model. In this edition we’ll go through the steps to create a basic pageview model that incorporates page pings so we can see accurate time spent and scroll depth for each pageview.
Using this model you can then aggregate the pageview rows and create reports to explore the traffic to your site and build a report like the one shown above which gives us Pageviews, Sessions, Unique Browsers, Average Time Spent and Average Scroll Depth metrics for each page.
This model takes advantage of the
web page context
to all events within the pageview that subsequently shows up in the table
We’ll create a table that aggregates all the page_ping data against the
Let’s go through what’s happening here.
We create a table
lookup_pageview using a query that joins
atomic.events keyed by
collector_tstamp, the usual way of joining context tables to the core event
The output table has a time_spent_seconds column with a time spent calculation based off the
page ping setting
we use of
window.snowplow('enableActivityTracking', 5, 5);
Next we ge the maximum values for the pageview of the scroll depth and document height in pixels, then make a calculation of the percentage that means.
That’s really it! Quite simple really.
To build the report output shown above, you can then do a query with a three-way join between
atomic.com_snowplowanalytics_snowplow_web_page_1 and our just-created
Eventually you would write out new data tables that replace the need to ever hit atomic.events directly,
containing the items you want from
atomic.events and the identifiers for the applicable pageview, session
and user objects. You wouldn’t copy across page pings because all you need is summarised in the pageview table.
Again, you might create a view that does the joins to give you one row per pageview with everything an analyst
Below is a query to bucket page views by scroll depth quartile, handy for getting a broader view of user interactions:
You probably want to create a schema especially for data models instead of dropping this table into
atomic. We use a
models schema for this.
The script above recreates for all pageviews every time it runs. To automate this, you’ll want to trigger it off SQL Runner after each batch runs. We don’t have enormous traffic volumes so recreating every time isn’t a big deal. If you have high traffic volumes, you might want to use a smaller window and keep old data around.
To do this you’ll need to use a look-back window, something like 24 hours, to account for pageviews that are still happening across the batch boundary and only update those within that scope. This would have the helpful side effect of doing less crunching on every batch which is probably preferable. I wanted to keep the example simple.
In the next chapter of this series we’ll look at a session model. Sessions are interesting because they have a traffic source: how the user got to your site. The classifications of these sources are very company-specific, so you often end up with a massive CASE statement to put them in appropriate buckets, much like Google Analytics has its Channel Groupings.
Further editions will go through the session and user scope models. Finally we’ll pull it all together by automating the model and creating a derived modelled data table.
Subscribe to this blog via RSS.