Modelling your Snowplow event data: Part 2 Pageviews

Wednesday 15 August, 2018 | By: Simon Rumble

In the first part of this series on data modelling we went through the background for building a data model. In this edition we’ll go through the steps to create a basic pageview model that incorporates page pings so we can see accurate time spent and scroll depth for each pageview.

Using this model you can then aggregate the pageview rows and create reports to explore the traffic to your site and build a report like the one shown above which gives us Pageviews, Sessions, Unique Browsers, Average Time Spent and Average Scroll Depth metrics for each page.

Create a lookup table on pageview_id

This model takes advantage of the web page context configuration option of the JavaScript tracker. This automatically creates a UUID for each pageview which is attached to all events within the pageview that subsequently shows up in the table atomic.com_snowplowanalytics_snowplow_web_page_1. We’ll create a table that aggregates all the page_ping data against the pageview_id.

-- DROP only needed when you recreate after the first time
DROP TABLE lookup_pageview;
CREATE TABLE lookup_pageview
  DISTKEY(2)
  SORTKEY(2, 1)
  AS (
  SELECT
    atomic.com_snowplowanalytics_snowplow_web_page_1.id AS pageview_id,
    (COUNT(*) * 5)-5                                    AS time_spent_seconds,
    MAX(pp_yoffset_max)                                 AS max_pp_yoffset_max,
    MAX(doc_height)                                     AS max_doc_height,
    CASE WHEN MAX(pp_yoffset_max) = 0 THEN 0 ELSE (MAX(CAST(pp_yoffset_max AS float)) / MAX(CAST(doc_height AS float))) END AS scroll_depth_percent
  FROM atomic.events
    JOIN atomic.com_snowplowanalytics_snowplow_web_page_1
      ON atomic.events.event_id = atomic.com_snowplowanalytics_snowplow_web_page_1.root_id AND
         atomic.events.collector_tstamp = snowplow.atomic.com_snowplowanalytics_snowplow_web_page_1.root_tstamp
  WHERE
    app_id = 'snowflake-analytics.com'
    AND event IN ('pageview', 'page_ping')
  GROUP BY 1
);

Let’s go through what’s happening here.

We create a table lookup_pageview using a query that joins atomic.com_snowplowanalytics_snowplow_web_page_1 to atomic.events keyed by event_id and collector_tstamp, the usual way of joining context tables to the core event table.

The output table has a time_spent_seconds column with a time spent calculation based off the page ping setting we use of window.snowplow('enableActivityTracking', 5, 5);

Next we ge the maximum values for the pageview of the scroll depth and document height in pixels, then make a calculation of the percentage that means.

That’s really it! Quite simple really.

Reporting

To build the report output shown above, you can then do a query with a three-way join between atomic.events, atomic.com_snowplowanalytics_snowplow_web_page_1 and our just-created lookup_pageview pv.

SELECT
  ev.page_urlpath,
  SUM(CASE WHEN ev.event = 'page_view' THEN 1 ELSE 0 END) AS pageviews,
  COUNT(DISTINCT ev.domain_sessionid) AS sessions,
  COUNT(DISTINCT ev.domain_userid) AS UBs,
  AVG(pv.time_spent_seconds) AS time_spent_average,
  SUM(pv.time_spent_seconds) AS time_spent_total,
  ROUND(AVG(pv.scroll_depth_percent)*100) AS scroll_depth_percent_average
FROM atomic.events ev, atomic.com_snowplowanalytics_snowplow_web_page_1 pg, lookup_pageview pv
WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '120 days'
  AND ev.event_id = pg.root_id
  AND pg.id = pv.pageview_id
  AND ev.app_id = 'snowflake-analytics.com'
GROUP BY 1
ORDER BY 2 DESC

Eventually you would write out new data tables that replace the need to ever hit atomic.events directly, containing the items you want from atomic.events and the identifiers for the applicable pageview, session and user objects. You wouldn’t copy across page pings because all you need is summarised in the pageview table. Again, you might create a view that does the joins to give you one row per pageview with everything an analyst could need.

Below is a query to bucket page views by scroll depth quartile, handy for getting a broader view of user interactions:

SELECT
  ev.page_urlpath,
  SUM(CASE WHEN scroll_depth_percent = 0 THEN 1 ELSE 0 END) AS "0%",
  SUM(CASE WHEN scroll_depth_percent > 0 AND scroll_depth_percent <= 0.25 THEN 1 ELSE 0 END) AS "25%",
  SUM(CASE WHEN scroll_depth_percent > 0.25 AND scroll_depth_percent <= 0.5 THEN 1 ELSE 0 END) AS "50%",
  SUM(CASE WHEN scroll_depth_percent > 0.5 AND scroll_depth_percent <= 0.75 THEN 1 ELSE 0 END) AS "75%",
  SUM(CASE WHEN scroll_depth_percent > 0.75 THEN 1 ELSE 0 END) AS "100%"
FROM atomic.events ev, atomic.com_snowplowanalytics_snowplow_web_page_1 pg, lookup_pageview pv
WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '120 days'
  AND ev.event_id = pg.root_id
  AND pg.id = pv.pageview_id
  AND ev.app_id = 'snowflake-analytics.com'
  AND event = 'page_view'
GROUP BY 1
ORDER BY COUNT(*) DESC

Automation and moving to production

You probably want to create a schema especially for data models instead of dropping this table into atomic. We use a models schema for this.

The script above recreates for all pageviews every time it runs. To automate this, you’ll want to trigger it off SQL Runner after each batch runs. We don’t have enormous traffic volumes so recreating every time isn’t a big deal. If you have high traffic volumes, you might want to use a smaller window and keep old data around.

To do this you’ll need to use a look-back window, something like 24 hours, to account for pageviews that are still happening across the batch boundary and only update those within that scope. This would have the helpful side effect of doing less crunching on every batch which is probably preferable. I wanted to keep the example simple.

Next steps

In the next chapter of this series we’ll look at a session model. Sessions are interesting because they have a traffic source: how the user got to your site. The classifications of these sources are very company-specific, so you often end up with a massive CASE statement to put them in appropriate buckets, much like Google Analytics has its Channel Groupings.

Further editions will go through the session and user scope models. Finally we’ll pull it all together by automating the model and creating a derived modelled data table.

Continue to Part 3: Sessions

About

We exist to make organisations better understand their businesses by enabling all decision makers in a company to work with the same version of the truth.

Social Links