Accurate time spent: A killer feature of Snowplow Analytics

Web analytics tools commonly have a Time Spent metric. Understanding how long people have spent reading a page is a really valuable thing for some businesses. For publishers, the quality of engagement with content is vital, given they're effectively selling the attention of their readers.

What many people don't realise is that the way this metric is calculated is critically flawed on the most popular web analytics tools, Adobe Analytics and Google Analytics. These tools calculate Time Spent by counting the time between pageview events.

How is Time Spent calculated?

In this example we see a user who:

  • Lands on a page and reads it for 30 seconds
  • Moves to another page and reads it for 45 seconds
  • Moves to another page and reads it for 90 seconds
  • Leaves the site

So the actual Time Spent is 2 minutes 45 seconds. However, Google Analytics and Adobe Analytics will allocate zero seconds to the final pageview, because these tools only work from the difference in timestamps between pageview events and there is no pageview event following the final page. So Google Analytics and Adobe Analytics will record 1 minute 15 seconds for this session.

The pathological example of this problem is for a single page session:

In this example, the user enters the site, views the content for 30 seconds and then leaves. Traditional web analytics tools will record zero seconds against this session as there is only the single pageview event.

Many publishers now receive a huge amount of traffic in the form of single-page visits, primarily coming from aggregators and social networks. This means despite the fact your content may be receiving significant attention, your analytics will be showing very low Time Spent and a high bounce rate.

How Snowplow does this

Snowplow's JavaScript tracker contains an option called Activity Tracking which enables a Page Ping that will be sent every n seconds. In this diagram I've used 5 seconds as the ping interval starting 5 seconds after the page loads:


So once the page loads, a ping is sent every five seconds recording that the page is still open. That gives an accuracy of at least five seconds to calculating the actual time a user spent.

Other tools

For tools that use the traditional mechanism of measuring Time Spent, there are some workarounds to get better numbers. Though none are ideal. The biggest problem is that there is no reliable mechanism to ensure a pixel is sent out when a user leaves your site.


Chartbeat uses a similar approach for data collection, although they attempt to measure actual engagement by monitoring user activity in the window as well. It would be interesting to apply this approach to the Snowplow page ping. Shouldn't be too hard to update the tracker to support this.

Google Analytics

  • Riveted plugin, as described here, tracks actual user engagements
  • Simo Ahava, always worth following, has a few different approaches to the problem

Adobe Analytics

  • Adobe's calculation method has changed a couple of times, so be sure you understand that.
  • Some have attempted to trap clicks on exit links and the unload event in the browser and then send a custom link to give the tool another event to extend the session for the last pageview, but these methods aren't reliable across browsers and platforms.
  • I have a conceptual approach of how to do this very accurately in Adobe Analytics without costing enormous amounts. Drop me a note if you'd like to know more.

How to analyse?

Page pings create a large number of event rows inside your Snowplow database, consistent with the way Snowplow does things. That opens up a bunch of different ways for you to calculate the Time Spent metric. In the next blog, Mike will go through different approaches for modelling the page ping data to analyse Time Spent.

Until then you can check out Snowplow's Measuring content page performance article.

Welcome to our first hire, Raad

We're pleased to welcome our first hire, Badruddin Kamal or Raad for short. He started this morning after blitzing our hiring process and impressing us enormously.

Raad's background includes a Masters of Computing at ANU, working for a big mobile payments company and some really cool projects including using drones to track wildlife for biology researchers.

Welcome aboard Raad. Snowflake is growing fast and we're all very excited to have you.


We have amazing clients

Last night we hosted a bunch of our customers, partners and friends for dinner and drinks in our new office. We had hoped to enjoy the lovely terrace here, but sadly the weather wasn't accommodating. We enjoyed cooking for you all and everyone seemed to have a good time.

Thanks for a great first six months of Snowflake Analytics. We're looking forward to helping you all get the most out of 2017.

We had a few recipe requests. The lentil salad was this one. The coleslaw came from the Cornersmith cookbook and the other two from Moro cookbooks. See below.

Decoding Snowplow real-time bad rows (Thrift)

In this tutorial we’ll look at decoding the bad rows data that comes out of Snowplow real time. In the real time pipeline bad rows that are inserted into Elasticsearch (and S3) are stored as base64’d binary serialized Thrift records. We’ll walk step by step the instructions in Python as to how to first decode, and then deserialize these records.

You can find the code for this tutorial without the accompanying text on Github here.

Before we start you’ll need 3 things

1. A copy of Python 3

2. The thriftpy library which can be installed using pip install thriftpy

3. A local copy of the collector payload thrift file which can be downloaded here

If you’re running real-time you can either take a sample payload from your Elasticsearch index (the line property) from a record in bad rows - otherwise we’ll use a sample event I’ve generated below.

import base64
import thriftpy
from thriftpy.protocol import TCyBinaryProtocolFactory
from thriftpy.utils import deserialize, serialize

First up we’ll import the libraries we need. base64 is part of the standard library and the thriftpy imports are from the thriftpy library which has been installed above. Let’s define our example payload below.


Now let’s use the b64decode method to decode the payload.

decoded_payload = base64.b64decode(sample_payload)

b"\x0b\x01@\x00\x00\x00\x02/i\x0b\x00d\x00\x00\x00\t127.0.0.1\n\x00\xc8\x00\x00\x01X\xdb\x9e7h\x0bzi\x00\x00\x00Aiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0\x0b\x01J\x00\x00\x01hstm=1481153329000&e=pv&\x0b\x01,\x00\x00\x00mMozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36\x0b\x01\x90\x00\x00\x00!\x0b\x00\xd2\x00\x00\x00\x05UTF-8\x0b\x016\x00\x00\x00\x1f\x0b\x01\x9a\x00\x00\x00$4d10d3d0-c2b4-473e-a483-812799e824d1\x0b\x00\xdc\x00\x00\x00\x11ssc-0.7.0-kinesis\x0f\x01^\x0b\x00\x00\x00\x0b\x00\x00\x00'Host:\x00\x00\x00\x1dAccept: image/webp, */*;q=0.8\x00\x00\x00$Accept-Encoding: gzip, deflate, sdch\x00\x00\x007Accept-Language: en-AU, en;q=0.8, en-US;q=0.6, en;q=0.4\x00\x00\x00\x14Cookie: sp=abcd-1234\x00\x00\x00'Referer:\x00\x00\x00zUser-Agent:  Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36\x00\x00\x00\x1aX-Forwarded-For:\x00\x00\x00\x15X-Forwarded-Port: 443\x00\x00\x00\x18X-Forwarded-Proto: https\x00\x00\x00\x16Connection: keep-alive\x00"

You’ll see that some fields are recognisable here. We can make out a query string and some headers but there’s a lot of data in between that makes this a bit difficult to read and parse. This is our binary serialized Thrift record that we’ll deserialize next.

collector = thriftpy.load("collector-payload.thrift")
collector_payload = collector.CollectorPayload()
raw_payload = deserialize(collector_payload, decoded_payload, TCyBinaryProtocolFactory())

In the cell above we’re doing quite a bit. First we’re loading the Snowplow provided .thrift file which acts as a specification for both serializing and deserializing the data. Next we’re initialising a structure based on the CollectorPayload struct and finally we are deserializing our base64 decoded object providing the class, base64 decoded payload and protocol factory to deserialize the object.


CollectorPayload(querystring='stm=1481153329000&e=pv&', collector='ssc-0.7.0-kinesis', ipAddress='', timestamp=1481153329000, path='/i', contentType=None, refererUri='', schema='iglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0', encoding='UTF-8', hostname='', userAgent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', body=None, headers=['Host:', 'Accept: image/webp, */*;q=0.8', 'Accept-Encoding: gzip, deflate, sdch', 'Accept-Language: en-AU, en;q=0.8, en-US;q=0.6, en;q=0.4', 'Cookie: sp=abcd-1234', 'Referer:', 'User-Agent:  Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'X-Forwarded-For:', 'X-Forwarded-Port: 443', 'X-Forwarded-Proto: https', 'Connection: keep-alive'], networkUserId='4d10d3d0-c2b4-473e-a483-812799e824d1')

Excellent! That’s looking a lot more readable already. If you haven’t already noticed, the properties of our CollectorPayload class match the fields defined in our collector-payload.thrift record. Let’s access some of them directly.

querystring = raw_payload.querystring
network_userid = raw_payload.networkUserId


We can now access these properties directly but at this stage we’ve just got the raw querystring - you’ll recognise this as the network request that is sent to the collector when using the Snowplow Javascript library (tna=js-3.6.0). Let’s split this out so we can look at the individual key value pairs that are in the payload.

from urllib import parse
params = parse.parse_qs(querystring)

{'aid': ['snowflake'],
 'cd': ['32'],
 'cookie': ['1'],
 'cs': ['UTF-8'],
 'ds': ['200x26295'],
 'dtm': ['1479927857201'],
 'duid': ['4d10d3d0-c2b4-473e-a483-812799e824d1'],
 'e': ['pv'],
 'eid': ['c259c25e-f94b-4d2c-a11c-0d278c2e6d1a'],
 'fp': ['129112332'],
 'lang': ['en-AU'],
 'p': ['web'],
 'res': ['360x640'],
 'sid': ['7fb97fC6-6e0f-4022-adbd-017431520ddb'],
 'stm': ['1481153329000'],
 'tna': ['js-3.6.0'],
 'tv': ['js-2.6.0'],
 'tz': ['Australia/Sydney'],
 'url': [''],
 'vid': ['4'],
 'vp': ['0x-56']}

If you’re familiar with Python you’ll note that our parse_qs function returns a dictionary containing values in the form of a list. We can access this reasonably easy by - in this case let’s extract out the app_id for the event.

app_id = params.get('aid')[0]


In the above sample we’re using the built in get method (on a dictionary) to access the aid property rather than app_id you can find a list of the mappings from the URL parameter to the friendly name here. The [0] above access the first element in that list for the aid key. In our example each of these lists only contains one value so we don’t need to worry about addressing additional values.

That concludes this tutorial. Please post below if you’ve got queries or parts of the post require additional clarification - this is a quite involved process.

In our next tutorial we’ll cover how we can leverage this simple process of decoding a single bad row - to multiple bad rows from a run and looping this data back into Redshift so we can query it!

Monitoring Snowplow bad rows using Lambda and Cloudwatch

In this tutorial we'll use Amazon Lambda and Amazon Cloudwatch to set up monitoring for the number of bad rows that are inserted into Elasticsearch over a period of time. This allows us to set an alert for the threshold of bad rows, and generates an email or notification when this threshold has been exceeded. Snowplow users on the realtime pipeline will find this most useful, however users running loads in batch can also adapt this monitoring.

We'll use the following services throughout this tutorial so check first that they are available in your AWS region of choice:

  • Cloudwatch rules
  • Lambda

Below is an overview of what we'll be setting up.

A Cloudwatch rule will trigger a Lambda function every 5 minutes. This Lambda function will query our Elasticsearch cluster, and return the number of rows in the bad index for the last 5 minutes. Finally, we'll send this result to our own custom metric to allow for monitoring and alerting using Cloudwatch alarms.

Lambda execution role

First, let's set up an execution role for our AWS Lambda function. This execution role will allow access to the appropriate resources, namely being able to submit a metric to Cloudwatch.

  1. Navigate to IAM in the AWS console.
  2. Select 'Policy' and 'Create Policy' in this interface.
  3. Select 'Create Your Own Policy'
  4. Name the policy document and provide a description. Under the 'Policy document' we'll use the following configuration.
    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "arn:aws:logs:*:*:*"
            "Effect": "Allow",
            "Action": [
            "Resource": "*"


This will allow the Lambda function to log to Cloudwatch for debugging as well as pushing data for the bad rows metric we will define shortly.

Next, we'll create a role for the Lambda function to use, and attach this policy.

  1. In IAM select 'Role' and 'Create New Role'.
  2. Under 'AWS Service Roles' select 'AWS Lambda'
  3. Search and select the policy you created above, review the policy and hit 'Create Role'.

Creating our Lambda function to retrieve the number of bad rows

  1. In the AWS Lambda dashboard select 'Create a Lambda function`
  2. Select the blueprint as 'Blank function' as we'll be writing our own code for this function.
  3. For the moment, leave triggers blank and hit 'Next'.
  4. Name your function, e.g., snowplow_bad_rows_count
  5. In this example we will use the 'Python 2.7' runtime.
  6. For the function code copy and paste the function below.
import json
import urllib2
import boto3
# import required libraries

def lambda_handler(event, context):
    host = '' # your elasticsearch host
    port = 9200 # your elasticsearch port
    index_name = 'bad-alias' # your elasticsearch bad index
    url = 'http://{0}:{1}/{2}/_count'.format(host, port, index_name)
    data = '{"query": {"range": {"failure_tstamp": {"gt": "now-5m"}}}}' # query to filter data
    request = urllib2.urlopen(url, data) # connect to Elasticsearch
    result = json.loads( # read and decode JSON response
    bad_row_count = result.get('count') # select the count property
    request.close() # close the connection
    client = boto3.client('cloudwatch') # setup a cloudwatch client
    data = [
                'MetricName': 'bad_rows_count',
                'Value': bad_row_count,
                'Unit': 'Count'
    # prepare a payload to send with the name 'bad_rows_count'
    client.put_metric_data(Namespace='snowplow', MetricData=data) # send the data to Cloudwatch using the 'snowplow' namespace
    return bad_row_count # return the number of bad rows in the last 5 minutes

In the above snippet change the host variable to point to your Elasticsearch cluster. This is likely to be either a load balancer if you're running Elasticsearch on EC2 instances, or the Elasticsearch endpoint if using AWS Elasticsearch Service.

You should also change port, to the appropriate port which is likely to be either 9200 (default) or 80 is using Elasticsearch Service or port-forwarding on a load balancer.

Finally, enter the name of the index. If you're unsure of what this is by adding /_cat/indices to the end of your Elasticsearch URL which will list all indices in your Elasticsearch cluster.

  • To modify the period in which Elasticsearch looks for bad events change the 'now-5m' string to the period required, e.g., for 1 hour use 'now-1h'.
  • To modify the name of the metric, change the 'MetricName' in the data payload
  • To modify the event namespace, change the Namespace in the put_metric_data call.
  1. The handler field can be left the same.
  2. Under 'Existing role' select the role we created in the previous steps.
  3. The memory and timeout of advanced settings can be left the same.
  4. Select an appropriate VPC. If running an Elasticsearch cluster within a VPC you may need to use this VPC, however if you're cluster is accessible externally 'No VPC' may work. 
  5. Review the function and select 'Create function'.
  6. Let's quickly test the function to ensure that it is able to connect to Elasticsearch successfully and produces the expected output.
  7. Select 'Test' in the interface, and 'Hello World' as the sample event template. The data we use as input does not matter as our event only produces output.
  8. Select 'Save and test'
  9. Your function should successfully execute and return the number of bad rows in the Elasticsearch index selected. You can verify this using the Kibana interface if required.
  10. If the function times out, double check the address and settings for your Elasticsearch cluster, the function may not be able to connect.

Create our Cloudwatch trigger

  1. In the Cloudwatch interface, select 'Rules' under 'Events' from the left-hand pane.
  2. Select 'Create Rule'
  3. For the event source select 'Schedule', and for the moment we'll leave it as the default of fixed rate of every 5 minutes. This sets the option of when, and how often our Lambda function will run and our metric will update. More information about the cron syntax AWS uses can be found here.
  4. Under 'Targets', select 'Lambda function' and the name of the Lambda function you created above.
  5. Select 'Configure details'
  6. Add a meaningful name and description for this rule and leave the enabled checkbox selected. Select 'Create Rule'.

Check our metric is working in Cloudwatch

  1. In the Cloudwatch interface scroll down in the left-hand panel to the bottom and under 'Custom Metrics...' select 'snowplow', your metric name should appear here and by clicking the checkbox you should see a point of data every 5 minutes. If you've only just set this up you may only see one data point.

To create an alarm

  1. Select the metric and select the 'Create Alarm' button in the right hand panel under 'Actions'.
  2. Set a meaningful name and description for the alarm, and a threshold to trigger at based upon the number of bad rows you expect to have in a 5 minute period. Anything above this threshold will trigger the alarm.
  3. Under Actions, you can select a notification list (if you have a SNS topic set up) or alternately if you select 'New list' you can enter a comma delimited list of email addresses.
  4. This will notify users when the alarm is triggered based on the criteria you have set.


Running this monitoring will cost approximately $0.69 a month depending on your usage, frequency and availability zone. For many users some of this cost may be included in the AWS free tier.

$0.50 per custom metric
$0.10 per alarm
~$0.09 for Cloudwatch PutMetricData (9000 requests/month, assuming rule triggers every 5 minutes)

If you have any questions or comments feel free to leave them in the comment section below, or in the Snowplow discourse forum.

Measuring podcasts

We have a few radio clients who use podcasts and audio streaming extensively. Measuring this stuff is hard, especially when the listens can happen in podcast software over which you have little control.

So I'm looking forward to listening to the latest Digital Analytics Power Hour episode with Steve Mulder from NPR talking about exactly this issue. It's always a great podcast and this is a particularly relevant topic right now. Check it out!

Appearing on the Digital Analytics Power Hour

A couple weeks ago I recorded an episode of The Digital Analytics Power Hour with Mike and Tim. The topic of discussion is open source analytics and we spend most of the episode talking about Snowplow.

It was a load of fun and I'm looking forward to seeing Tim again when he's in Sydney later this year.

It's due to be posted on 16th August, but I imagine that will be US timezones.