Snowplow Inspector extension updated to colour code and allow filtering

Monday 16 July, 2018. | By: Simon Rumble

Snowplow usage continues to go up and to the right, with new and interesting use cases proliferating. Alongside this we’ve seen a steady increase in usage of our debug tool, Snowplow Inspector.

We’ve started to see web sites with multiple Snowplow trackers on them for different use cases as businesses integrate the Open Source version into vertical-specific products. This can lead to a confusing view in the debugger with a lot of beacons appearing in the debugger and no easy way to differentiate them. Jethro has been hard at work improving the tool to handle these cases.

Colour Coding

Different colours are used for each unique combination of App ID and Collector

You’ll notice now that beacons are colour-coded in the list of beacons on the left-hand side. Each unique combination of app_id and collector gets a unique colour so you can quickly differentiate the different destinations. Mousing over the beacon in the left-hand column quickly shows you the Collector and App ID. As you scan your eye down the list of beacons, you can visually group them.


Filter by any of the fields in the debugger

In addition, if you want to filter them down by Collector or App ID, you can type that into the new filtering box. This filter actually works for any of the fields you see in the debugger (though not the Base 64-encoded ones just yet).

Let us know your thoughts

If you find any bugs or have any feature requests for the debug too, please create an issue in GitHub.

[Read More]

Setting up an AWS Athena datasource in Jetbrains DataGrip

Friday 29 June, 2018. | By: Simon Rumble

Download the JDBC driver from AWS and place it in the DataGrip JDBC driver directory. On Linux this was ~/.DataGrip2018.1/config/jdbc-drivers/.

Install Athena JDBC driver

File > Data Sources to open the Data Sources panel and click ‘+ > Driver’. Name it AWS Athena. Here’s the confusing bit: skip down to Additional Files and click + and find the location of your driver. Now back up above that select the driver Class com.simba.athena.jdbc.Driver. Click OK.

Okay so you now have the JDBC driver installed, so you need to create a connection to your Athena instance.

Create a connection for Athena File > Data Sources to open the Data Sources panel and click + > AWS Athena to create a new connection using the driver. Enter the access key and secret in the User and Password fields respectively.

Set the URL to jdbc:awsathena:// with your region string in there.

Create a connection for Athena Now click Advanced and set the two required keys, AwsRegion and S3OutputLocation. The region should be the same as you used above.

Create a connection for Athena S3OutputLocation is available in the AWS Console for Athena by clicking Settings.

Now click Test and you should get Success. You can now use Athena in DataGrip.

[Read More]

Event roundup and slides: MeasureCamp Auckland

Monday 18 June, 2018. | By: Simon Rumble

Last Saturday Mike and I made it across to the first MeasureCamp Auckland. We’ve sponsored all the previous events in Sydney and Melbourne so we were thrilled to help out the Auckland team, who did a fantastic job with a really good venue, excellent food and a really diverse crowd keen to get into it.

MeasureCamp physical board The board filled up very quickly. Clearly lots of people knew the drill and had prepared talks and sessions. I had some difficult decisions to make when sessions clashed but I didn’t go to a dud session all day!

MeasureCamp digital board At previous MeasureCamps a consistent problem has been access to the physical board. The board needs to be reachable for people to put up their talks, but that also means the people crowding around deciding on their sessions block the view of those behind them. It’s also difficult for people with visual impairments to read.

Mike has been working on a digital board. Initially as a fork of Vitaliy Matiyash’s code but eventually rebuilding it in Vue.js. The back-end is a Google Spreadsheet, allowing quite a bit of flexibility while keeping the board in sync with the physical board.

MeasureCamp digital board real-time stats from Snowplow data

Late the Friday night before MeasureCamp Auckland, Mike and I sat up instrumenting and improving the user interface. We were able to build a quick real-time dashboard in Kibana using the Snowplow instrumentation. For future MeasureCamps this should provide some helpful feedback on the popularity of sessions, allowing organisers to shift things around to ensure the most popular sessions are in the biggest rooms.

We have some other plans around automatically tracking the cards as they move around using a camera and some kind of unique marking on each card. Perhaps we could have microcontroller-drive screens at each room showing what’s on and up next in each location. Lots of ideas!

The code will be going up on Github shortly.

Johann de Boer on TV advertising impact measurement Johann did a really interesting talk about tracking television (and radio) impacts, and the limitations of the technique. This is something I’ve always wanted to play around with, tying a detailed as-run log of TV spots in different regions into an increase in visits to your site.

Myriad lunch options Mmmm donuts Food and drink at the event were spectacular. Organisers made heroic efforts to accomodate the incredible mosaic of food requirements. Free coffee in the morning was great too!

Bowling at the after party Finally we ended up at the afterparty, a bar that included a couple of bowling alleys. Loads of fun, loads of pizza and a seemingly endless bar tab. Great venue for it!

Slides from talks

Here’s the slides for my talks. Feel free to get in touch if you have any questions about anything.

Digital Advertising: How does it work?

My always-popular session going into the nuts and bolts of display advertising, how it works and why. The last slide has some handy links to learn more.

Snowplow Analytics: Introduction and demo

A quick introduction to how Snowplow Analytics works and why you might want to use it. There was some great discussion in this session.

[Read More]

GDPR emails: some early, some late

Monday 28 May, 2018. | By: Simon Rumble

In case you’ve been living in a cave, the EU’s General Data Protection Regulation came into force on Friday last week. This is Europe’s latest crack at regulating consumer privacy in the digital era, and it’s definitely a very pro-consumer regulation. It’s going to have some big impacts on the business models of a bunch of businesses.

Part of the regulation requires that EU consumers give informed consent to personal data collection and use so companies around the world have been scrambling to update their privacy policies. For people whose data is already collected, they’re sending out emails desperately trying to get permission to continue doing whatever it is they were doing.

GDPR emails by day

Last week I decided I was getting enough of these emails that it’d be interesting to chart how many of them I’d received. As you can see there was a flurry of emails just before the deadline, and quite a large amount of them that completely missed the deadline. The last one so far came from Blockchain on Sunday morning Sydney time.

There are some caveats here. When these emails first started turning up, I was mostly just binning them. So any received more than 30 days before I changed the way I dealt with them are probably gone. Similarly a bunch ended up in the spam folder so older ones would have been deleted.

[Read More]

Snowplow Inspector debug tool now validates schemas

Thursday 15 March, 2018. | By: Simon Rumble

Snowplow Inspector use around the world Last June we released our debug tool Snowplow Inspector and we’ve watched as hundreds of Snowplow users around the world have used it to make sense of their Snowplow tracking.

Latest version

Snowplow Inspector new design Jethro has been hard at work improving the debugger. First he ported it to Typescript for better maintainability, then he reworked the design to be more readable and useful. Following that he’s added a really useful new feature.

Schema validation

Snowplow Inspector new design The exciting new feature in the latest version is the ability to validate custom schemas within the debug tool. Custom schemas allow you to define your own event model to record details of events and the context in which they occur. Rather than shoehorn your event model into a static set of Custom Dimensions, you can model things using JSON, enabling really expressive data models that map very closely to your applications’ internal representation of things.

Custom data models require that you define a schema that tells Snowplow what is allowed and not allowed in the model. If an event doesn’t validate against the schema, it will end up being sent into the Bad Rows. You want to get things working in your development, staging and test environments before pushing changes to production and ending up with a lot of important events in Bad Rows.

Get set to validate your custom schemas

Out of the box Snowplow Inspector points at the standard Snowplow schemas. Once you start creating your own schemas you’ll want to point at your schema repository. Right-click on the Snowplow Inspector icon in your browser toolbar (it may be hidden under the hamburger icon ☰) and select Options. You can also click on Unrecognised in the debugger itself.

Snowplow Inspector options

Enter schema repository URLs for any custom repositories you use, one per line. You’ll need to enable access to the repository from wherever you’re running Chrome with IP whitelisting or similar.

Install the extension

To install the extension, go to the Chrome store and click Add to Chrome.

Feedback and changes

The extension is released under the GNU Public License, so you’re free to fork it and make changes. Check out the Github repo. If you find any bugs or have any feature requests, create a Github issues. Jethro is remarkably responsive to feature requests!

[Read More]

Setting up a BigQuery datasource in Jetbrains DataGrip

Monday 19 February, 2018. | By: Mike Robins

DataGrip is one of the most valuable tools for our engineers for exploring and querying a myriad of different database technologies. DataGrip doesn’t yet come bundled with a BigQuery driver so in this post we’ll explore how to setup a custom data source so that you can connect to BigQuery using DataGrip.

In this post we’ll use:

You will also require a Google service account that has access to BigQuery. If you don’t have one setup you can find instructions here.

Take note of the email address and the associated key pair with this account. You’ll need both of these pieces to setup the BigQuery connection.

Ensure that this service account has sufficient permissions to access BigQuery. If you aren’t sure if this is the case navigate to IAM, select your project and edit the role associated to the service account (e.g., BigQuery -> BigQuery Data Editor/Viewer)

Setting up the BigQuery driver

Add a new data source by clicking on the ‘+’ icon, select ‘Driver’ - the second last option in the drop down menu.

Enter ‘BigQuery’ as the name of the data source.

Under ‘Additional files’ click the ‘+’ icon and browse and select the all the jar files (ending in *.jar). There should be 7 in all.

Under ‘Class’ select com.simba.googlebigquery.jdbc42.Driver

Click the ‘+’ icon in URL templates and add the following URL template:

Name: default Template: jdbc:bigquery://[Host]:[Port];ProjectId=[Project];OAuthType=[AuthValue];

BigQuery driver setup

This concludes setting up the JDBC driver for BigQuery.

Setting up a BigQuery data source

Next let’s set up a data source to ensure we can connect.

  1. In the data sources window hit ‘+’ and select BigQuery.
  2. Enter the name of your datasource in the Name field.
  3. Leave the user and password fields blank.
  4. In the URL field customise and paste the following JDBC connection string:

where PROJECTID is your Google project ID, OAuthType=0 (for service account authorisation), EMAIL is the email address (ending in associated with the service account and KEYPATH is the absolute path to your keyfile (e.g., /Users/mike/Downloads/example.json) - both .json or .p12 keys will work here.


Hit the ‘Test Connection’ and if everything has been set up correctly Datagrip should notify you that the connection has succeeded.


Error: HttpTransport IO error

Check that your OAuthServiceAcctEmail and OAuthPvtKeyPath are set correctly and you’ve specified the absolute path to your key.


There a number of other options that can be customised for the JDBC driver by selecting ‘Advanced’ in the Data Source connection.

It is also possible to use other authentication methods to connect to BigQuery including:

Google User Account using OAuth 2 (OAuthType=1) Using pregenerated access and refresh tokens (OAuthType=2) Using the Application Default Credentials in your environment (OAuthType=3)

For more information about setting up these authentication methods please refer to the Simba JDBC Driver for Google BigQuery installation guide from the zip file earlier in the post.

[Read More]

Why would you push Google Analytics data into Snowplow?

Wednesday 14 February, 2018. | By: Simon Rumble

In late January the Snowplow team released R99 of Snowplow which includes a really interesting feature: with a small JavaScript change, you can mirror all the data being sent into Google Analytics into a Snowplow collector.

It’s a pretty awesome feature and got a lot of people talking. GTM legend Simo Ahava got really excited about it and wrote a couple of posts. Cogs started turning in the minds of a bunch of other people in the analytics community too, seeing where this might be useful.

If you need a recap of how Snowplow differs from GA and Adobe Analytics, check out How does Snowplow Analytics compare to other vendors?

Raw data

Pretty much everyone who has done digital analytics has come up with a use case where they wish they could get to the raw, underlying events happening. It might be because you have a change of definitions you want to turn into a continuous timeline for your users rather than explaining it half a dozen times a week. You might want to re-categorise things that were categorised at collection time. Or you might want to get down and dirty with every individual users’ event stream.

There’s a few ways to get to event-level data in Google Analytics. One is to pay Google a chunk of money and upgrade to GA360. It’s a good product, and if you want some of the other features it could be worth it but it’s a lot of money to pay for just getting access to your own raw data.

Other mechanisms extract data from the APIs, but they’re kinda hacky and won’t scale up to high traffic volumes as you’ll start hitting API limits.

The final approach is to duplicate all the data being sent into Google Analytics into something else. Older versions of the Google Analytics Urchin script used to explicitely support this, as Urchin had a self-hosted version available. That’s been thrown out, but it’s still possible to duplicate the payload and send it somewhere else.

Duplicate your Google Analytics pixels to Snowplow

Snowplow isn’t the first group to try this, but the important different with the Snowplow pipeline is you’re getting a robust, scalable, tested and very flexible pipeline. It also does a bunch of enrichment activities that make the data more useful, like User Agent classification, IP to geography lookups and setting a third-party pixel for cross domain use where that works.


Snowplow’s real-time pipeline collects data and makes it available in as little as a few seconds. Not aggregations, not isolated dimensions, and not fifteen minutes later. The whole lot, enriched and with everything you’ve sent available. For some use cases, this is absolutely essential. If you want to react to the user’s actions within the same session, you need real-time. Using the GA adapter, you can take your existing Snowplow pipeline and have an instant real-time upgrade without changing anything in your tagging.

Test out Snowplow with real data

Another reason this feature is interesting is for people who are keen to dip their toes in Snowplow but the prospect of defining and building all the events and data models from scratch is a bit daunting. If your GA implementation is pretty solid, you can be up-and-running really quickly and start exploring what it’s like working with event-level data.

Next you can spend some time upgrading specific events to a richer event model, building custom schemas and seeing where that leads. It’s a nice gentle introduction, and you should see some value out of it pretty much immediately.

GA360 features for cheap

The Google Analytics script doesn’t know whether you’ve paid for GA360 or not, so you can send in a full 200 custom dimensions but you’ll only see the first 20 in the UI. The Snowplow version of that data won’t have the same limitation. So you can send in 200 dimensions and get them back out in the event-level data.

Of course, Snowplow has a much richer event model than the flat 200 dimensions. You can model things like arrays of objects in Snowplow, which ranged from clunky to impossible in GA.

There’s no sampling in Snowplow data. You have all the raw events at your fingertips, so performance is dictated by the size of your data, the complexity of your query and the amount of hardware you throw at the problem.

Redshift, Postgres, Hive, Snowflake databases

Snowplow allows you to output into more databases: Redshift, Postgres, Hive and Snowflake Computing are all options for your data. BigQuery is awesome, but if you’re already using one of these databases it might be a better fit.


Sometimes you just want a copy of everything. Y’know, because backup. This is a quick and easy way to get that.

[Read More]

Presentation: Real World Big Data Architectures

Monday 4 December, 2017. | By: Simon Rumble

A few weeks back Nicholas Tan gave a presentation at the Future of Financial Services conference about architectural designs in the real world to get value from data. Nick most recently was responsible for News Corp’s large-scale Snowplow Analytics rollout and has just started at Macquarie Group. Check out his presentation.

Nicholas Tan

Nick has some great things to say about Snowplow, its cost model and the use cases for it in financial services. Of course we love that he references the Digital Analytics Hierarchy of Needs.

Snowplow cost model

[Read More]

Snowplow R95 with ZSTD support released

Tuesday 14 November, 2017. | By: Simon Rumble

Back in July, we did a bunch of work to quantify the benefits of ZSTD in the Redshift database, resulting in this blog post from Mike. The results were a clear, and massive with at least 50% reductions in storage use, improvement in nearly all use cases. We started migrating our customers to using ZSTD wherever possible so they could benefit from this huge improvement.

These changes have now been incorporated into the latest upstream Snowplow open source release. Migration for existing users is a little tricky, requiring a deep copy. New users will benefit from these latest changes straight away.

Get in touch if you have any questions.

[Read More]

Slides from Measurecamp Sydney

Monday 6 November, 2017. | By: Simon Rumble

Snowflake Analytics sponsored the always awesome Measurecamp Sydney unconference last weekend. As usual it was an incredibly high-value event with really great sessions and even better informal chats between the sessions. Such a great event and we’re looking forward to Melbourne early next year.

We gave a few talks there on various topics of interest so here’s the slides.

Display advertising: How does it work?

I’ve given this talk at least 30 times now and have been updating it for a very long time as the ecosystem evolves. Always really interesting questions about how the display ecosystem works.

Microcontrollers: Collect data from your environment

My latest obsession is these tiny, cheap little devices that allow you to collect data from and interact with the world around you. I’m throwing a few around my house to collect temperature data and send it into Snowplow for later analysis.

Enabling data-driven Innovation: Defining and developing analytics maturity

Narbeh’s talk about the philosophy behind Snowplow Analytics and how solid data collection enables and drives innovation.

[Read More]

The Digital Analytics Hierarchy of Needs

Thursday 21 September, 2017. | By: Simon Rumble

A few weeks ago I discoverd Monica Rugati’s fantastic Data Science Hierarchy of Needs. It’s a data science-centric riff on Maslow’s Hierarchy of Needs, a classic concept in pyschology. I’ve found myself using Rugati’s diagram and the concept in conversations with colleagues, partners, customers and friends ever since, as a way to explain the challenges we face in this Digital Analytics space.

Business people get excited about the latest buzzwords: Big Data, Artificial Intelligence, Deep Learning etc. Before you can break out TensorFlow and start doing bleeding-edge data science, you need to ensure you’ll be working on data that reflects reality.

In practice most new clients we audit have substantial problems with behavioural event data tracking and require at least some work to get to the minimum baseline of accurately tracking everything a user does on their web site and apps.

And so, let’s have a look at the Digital Analytics Hierarchy of Needs:

Digital Analytics Hierarchy of Needs

Where am I?

We see a full spectrum between near perfect and absolute disaster zones but the advice is generally much the same: you need to get your data collection in order if you want to do the cool stuff. Along the way you’ll unlock new insights and new capabilities. Done right, you’ll also give yourself awesome flexibility for future improvements and modifications to how you do Digital Analytics.

What does “done right” mean? That usually starts with tag management and a data layer. That’s a topic for another blog post, coming soon.

How long is this going to take?

To get to the minimum baseline, you don’t have to down tools and spend a year reworking. There’s always interim steps that will bring useful results and help you get insights. It all starts with getting a good idea of the current state and building a plan to get you to the top of the Digital Analytics nerdvana pyramid.

So how do I get to the top?

We tend to start our customer engagements with a data audit to find out where things stand and what work needs to be done to unlock the cool stuff. It’s a pretty quick and painless process, and there’s invariably a list of quick wins that bring immediate value, along with a roadmap of bigger pieces to work on.

Get in touch with us if you’d like to kick off the discussion.

[Read More]

Snowflake Analytics welcomes Conrad Yiu as an advisory board member

Wednesday 30 August, 2017. | By: Narbeh Yousefian

We are delighted to announce Conrad Yiu has joined Snowflake Analytics as an Advisory Board member. Conrad brings over 20+ years of experience as an entrepreneur, venture investor and business builder.

After life in professional services in Sydney and London, his first start-up was a mobile technology company in Cambridge in the late 90’s which is where he also met & worked with Alex Dean, one of the co-founders of Snowplow Insights.

Conrad returned to Australia in 2005, where he was Director of Corporate Development, leading investments and M&A for Newscorp’s digital division. He left in 2011 to start ArdenPoint, an investment firm that co-founded Temple & Webster (ASX:TPW), Fluent Retail and ParcelPoint. In addition to his current portfolio he is a venture partner at Tempus Partners and started a new watch company, Mountain Active Co.

[Read More]

A quick script to ZSTD all your shredded tables

Friday 25 August, 2017. | By: Simon Rumble

Mike’s recent post about compressing Snowplow tables works great for, with clients seeing compression down to 30% of the original size or so. But what about all your shredded tables?

For now you have to manually convert the output from igluctl while we wait for our pull request to make it into a release, from then on this will be automatic. There’s also a pull request in to change the default compression for too.

Run the following code from the command line from the root of your schema definitions and it’ll automatically convert everything relevant to ZSTD. This was written by one of our staff.


[Read More]

How does Snowplow Analytics compare to other vendors?

Wednesday 9 August, 2017. | By: Simon Rumble

Tonight Snowflake Analytics team members Mike and Narbeh are debating the merits of Snowplow Analytics with representatives of Google Analytics and Adobe Analytics at Web Analytics Wednesday. The head-to-head aspect of it is meant to be lighthearted, but it’s forced us to think about some of the ways Snowplow Analytics is a better match for many types of digital analytics problems.

So without further ado, here’s a rundown of features where Snowplow has unique qualities over Google and Adobe Analytics.

Feature Snowplow Insights Google Analytics Adobe Analytics
Real-time Some Some
Unsampled data and performance
Custom structured data models
Personally Identifiable Information
SQL interface
Cross-domain tracking and user identification
Data sovereignty
Change goals with historical data
Accurate time spent metric
Custom enrichments
Incoming Webhooks
High-cardinality (none) (Low-Traffic)
IP addresses and user-agent strings


Snowplow Analytics can operate in real-time mode with latency down to about 5 seconds from collection to availability. Google and Adobe Analytics offer this with a subset of dimensions and events, but Snowplow gives you access to the entire enriched event stream, including all the richness of custom structured data models. See also: Custom structured data models.

Unsampled data and performance

Because Snowplow runs in your own cloud infrastructure, you have control over the inevitable price-performance tradeoffs involved in collecting, processing and analysing large event volumes. You can decide to have full, unsampled data available speedily by throwing more infrastructure at the problem, or have it available less quickly to reduce cost. See also: High cardinality.

Custom structured data models

Snowplow’s event model enables customers to use their existing data models and provide it as context to the events without transforming them. Other tools require you to shoehorn them into their flat dimension/metric model.

This is perhaps best illustrated with an example. Let’s say you want to record an action, say a page print, for a recipe on your content site.

You could record this as an event in Google Analytics, but the only context you can provide comes from the URL of the page itself, and potentially some custom dimensions. Custom dimensions can’t hold multiple values, so you can only model something like an ingredients list as a string, which is really ugly once you get to reporting it. And truncates any string over 150 bytes. Nice!

Adobe Analytics has the concept of List Props and List Vars which would enable this, but you have to shoehorn the result into Adobe’s model. That means no nesting of things, like a list containing other items like recommended recipes.

By contrast, here’s how you would attach a self-describing context for recipes onto a Structured Event in a Snowplow implementation. Note how the model is readable and looks just like how you might store it in your CMS. These contexts can be attached to any kind of event you like, including things like pageviews and social shares.

window.snowplow('trackStructEvent', 'article','print','Chicken Chow Mein recipe', null, null, 
          schema: 'iglu:com.snowflake-analytics/recipe/jsonschema/1-0-1',
          data: {
            "title": "Chicken Chow Mein recipe",
            "author": ["John Smith", "Mary Jones"],
            "category": "recipes",
            "cuisine": "Chinese",
            "course": "main",
            "ingredients": [ "chicken", "noodles", "vegetables" ],
            "content_tags": [ "chinese recipes", "chicken", "quick meals"]

Personally Identifiable Information

Google take a particularly hard line on Personally Identifiable Information. You can’t send it, and if they find it they’ll delete all your data. Ouch.

Adobe are a bit more lenient but still cautious.

With Snowplow, the data is collected and stored in your own cloud infrastructure. It’s still best practice to be quite careful with PII, and particularly with sensitive information, but that boundary line is up to the customer. That means if PII ends up inside your Snowplow data, it can be controlled under your own policies, not those of a third-party organisation. See also: Data sovereignty.

SQL interface

Snowplow’s data shows up in a database with an SQL interface, running in your own cloud infrastructure. You can connect it with your own systems, create joins with other tables, point your existing BI teams at it and generally use it like it’s your own data, because it is.

You also get to keep the raw event data, unprocessed and ready to re-process or re-examine with tools like AWS Athena or Presto.

Adobe Analytics will provide the raw data for you to ingest into a database, and these days they even give you the header row for that table, but it’s up to you to make sense of the data, stitch together visits and the like.

Google Analytics data, with the premium product, can be pushed automatically to BigQuery, which is great. BigQuery is a very capable SQL data analysis tool, and support for it is becoming much more common than in the past, so Google probably gets a pass on this, althrough currently the latency for data to appear is 4 hours while Snowplow can deliver hourly or possibly faster. See also: Change goals with historical data.

Cross-domain tracking and user identification

Snowplow gives you access to all the identifiers: first-party domain cookie, third-party domain cookie, all the ingredients for fingerprinting, IDFA, IDFV, AdId and as many versions of login information as you have. You can then make up your own mind about how you stitch together user identities and sessions.

Adobe Analytics does third-party domain cookies with fallback to first-party domain cookies. Google Analytics has a feature with extremely narrow utility (basically, if you have a hosted checkout, it might be useful). See also IP addresses and user-agent strings

Data sovereignty

This is simple: you get to decide where your data lives, and the security and access policies. Adobe will charge you extra to store your data in some specific jurisdictions, but you have to trust them on the security and access policies. Google, well it’s in “the cloud”. See also: Personally Identifiable Information.

Change goals with historical data

You’ve got full control of your data, so you can recrunch the numbers any time. It might be a big, slow query, but you can do it. That means if you decide to include an or statement in your goal and funnel after you’ve created it, it’s totally doable.

The Google Analytics premium offering has a beta dynamic funnels report, but those don’t show up as conversions.

Accurate time spent metric

The way Google Analytics and Adobe Analytics (and Nielsen and lots of others) calculates the time spent metric is badly flawed, and have got even worse with modern traffic patterns. Snowplow is better. I’ve dealt with this in detail here.

Custom enrichments

Snowplow allows you to query external databases to automatically enrich your data as it passes through the pipeline. For example, you could look up the current exchange rate from your actual bank to record alongside a transactions’ local value.

Incoming Webhooks

Real-time integrations with external systems can be done through Webhooks, meaning your Snowplow pipeline can have immediate access to things like email opens, logistics state changes, incoming phone calls and the like.


No (other) and (Low-Traffic) in Snowplow reports. See also: Unsampled data and performance.

IP addresses and user-agent strings

All the raw materials to be able to track users, do custom fingerprinting and analyse platforms are available. Other tools throw away these raw materials. We have customers looking at fraud patterns based on IP addresses. In the past we’ve looked for dark social traffic by finding the Facebook in-app browsers that aren’t sending through accurate Referrer headers. (Hint: look for “FBAN” in the user-agent). See also: Cross-domain tracking and user identification.

[Read More]

Make big data small again with Redshift ZSTD compression

Wednesday 12 July, 2017. | By: Mike Robins

A new compression option in Redshift allows you to make big storage savings, up to two-thirds in our tests, over the standard Snowplow setup. This guide shows how it works and how to get it happening.

In late 2016 Facebook open sourced a compression algorithm known as Zstandard that combines Lempel Ziv and tANS to achieve a compression ratio better than many algorithms with a slight tradeoff in speed.

In January 2017 AWS added ZSTD support to Redshift, so you can now use this powerful compression algorithm to reduce the storage size of your data. One immense perk of ZSTD is that it can be applied across all supported data types. It gives exceptional performance on long varchars: perfectly suited for large JSON strings that regularly appear in shredded tables.

We’ve had great success both experimentally and practically by applying ZSTD compression to multiple Snowplow tables. Experimentally across datasets ranging between 10 million and 5 billion rows we’ve achieved a mean compression ratio of ~3 meaning that the newly compressed table takes up approximately a third of the original table on disk when compared to the compression defaults in 0.8.0.

In particular we’ve found:

  • Negligible impact on speed of queries for (note this will be highly workload dependent so benchmarks here are less useful)
  • So far we haven’t experimented with the performance/space savings of compressing SORTKEYs. AWS recommend against compressing SORTKEYs (so use ENCODE RAW for collector_tstamp and root_tstamp).
  • ZSTD in almost all instances replaces LZO as the default compression method suggested by ANALYZE COMPRESSION.

Things to take into account

  • Depending on your workload you may want to run ANALYZE COMPRESSION on your table which will provide some recommendations by Redshift as to what the suggested column encodings are. One caveat of this approach is that you are limited to sampling 1 billion rows so if possible choose a sample data set that contains representative variability within columns.
  • If you have the opportunity we also recommend benchmarking common queries/jobs on an identical sample of data for a) the 0.8.0 compression defaults and b) the newly compressed tables.
  • You cannot modify compression on existing columns in a table so consider deep copying the data particularly if a large region of your table is unsorted1 as this will outperform a VACUUM. If performing a deep copy ensure you have sufficient disk space to complete the action: it’s difficult to know how much space is required but we opt for at least 50% of the cluster storage remaining. You may need to consider resizing the cluster and/or temporarily pausing your pipeline to complete this action.
  • If you decide to drop the original table ensure that you either DROP CASCADE or individually drop any dependencies that may rely on this table such as views or foreign key constraints from shredded or derived tables.
  • If you’ve dropped any foreign keys ensure that you recreate them and they reference the new object. Although Redshift does not enforce these constraints they are used by planner to generate optimised query plans.
  • As this is a new object you may also need to regrant permissions to users/groups as well as ensure the table owner is identical to the original table(s) so that Snowplow can continue to load data.

Getting started

  1. Analyze compression on your existing table by using ANALYZE COMPRESSION.
  2. If Redshift recommends using ZSTD for columns consider benchmarking a sample of data (e.g., 3 months) with the original column compression and ZSTD column compression
  3. If performance is better or equivalent deep copy data from the original table to a new atomic.events_new table. You may need to resize Redshift or free up additional disk space before performing this action.
  4. Verify that the two tables contain identical data by comparing a total row count as well as a row count per day.
  5. Drop the original table as well as any references2. ALTER the new table to atomic events3. Ensure that this new table has an identical owner.
  6. Sit back and enjoy that free disk space.

Need help?

There’s a few gotchas and edge cases involved with doing this. If you’d like help or advice on how to do it feel free to post in the comments below or get in touch.



 schema = 'atomic'
 size DESC


SELECT DISTINCT c_p.oid AS tbloid
,n_p.nspname AS schemaname
,c_p.relname AS NAME
,n_c.nspname AS refbyschemaname
,c_c.relname AS refbyname
,c_c.oid AS viewoid
FROM pg_class c_p
JOIN pg_depend d_p ON c_p.relfilenode = d_p.refobjid
JOIN pg_depend d_c ON d_p.objid = d_c.objid
JOIN pg_class c_c ON d_c.refobjid = c_c.relfilenode
LEFT JOIN pg_namespace n_p ON c_p.relnamespace = n_p.oid
LEFT JOIN pg_namespace n_c ON c_c.relnamespace = n_c.oid
 schemaname = 'atomic'


ALTER TABLE atomic.events_new RENAME TO;

[Read More]

Snowplow Inspector: a debug tool for Snowplow beacons

Tuesday 20 June, 2017. | By: Simon Rumble

Snowplow Insights is an amazingly flexible way to collect data, but with great flexibility comes some complexity. If you work on Snowplow implementations a lot, you’re likely familiar with Base 64 Decode and JSON Formatter when you’re digging into custom contexts and testing your implementation.

Snowplow Inspector

We wanted a tool to make debugging easier so we created Snowplow Inspector for the Chrome browser. It lives in the Developer Tools (☰ > More Tools > Developer Tools or Ctrl-Shift-I) under its own Snowplow tab. Here’s how it looks on the Snowplow site:

Snowplow Inspector screenshot

On the left-hand side of the panel you see the detected Snowplow pixels for each site you’re loading. Clicking on a pixel you’ll see the details of the tracking pixel, including the custom contexts and unstructured events decoded and formatted to be readable.

Install the extension

To install the extension, go to the Chrome store and click Add to Chrome.

Feedback and changes

The extension is released under the Gnu Public License, so you’re free to fork it and make changes. Check out the Github repo. If you find any bugs or have any feature requests, create a Github issues.

[Read More]

Accurate time spent: A killer feature of Snowplow Analytics

Tuesday 7 March, 2017. | By: Simon Rumble

Web analytics tools commonly have a Time Spent metric. Understanding how long people have spent reading a page is a really valuable thing for some businesses. For publishers, the quality of engagement with content is vital, given they’re effectively selling the attention of their readers.

What many people don’t realise is that the way this metric is calculated is critically flawed on the most popular web analytics tools, Adobe Analytics and Google Analytics. These tools calculate Time Spent by counting the time between pageview events.

How is Time Spent calculated?

Snowplow Analytics time spent metric

In this example we see a user who:

  • Lands on a page and reads it for 30 seconds
  • Moves to another page and reads it for 45 seconds
  • Moves to another page and reads it for 90 seconds
  • Leaves the site

So the actual Time Spent is 2 minutes 45 seconds. However, Google Analytics and Adobe Analytics will allocate zero seconds to the final pageview, because these tools only work from the difference in timestamps between pageview events and there is no pageview event following the final page. So Google Analytics and Adobe Analytics will record 1 minute 15 seconds for this session.

The pathological example of this problem is for a single page session:

General Analytics time spent metric done poorly

In this example, the user enters the site, views the content for 30 seconds and then leaves. Traditional web analytics tools will record zero seconds against this session as there is only the single pageview event.

Many publishers now receive a huge amount of traffic in the form of single-page visits, primarily coming from aggregators and social networks. This means despite the fact your content may be receiving significant attention, your analytics will be showing very low Time Spent and a high bounce rate.

How Snowplow does this

How Snowplow Analytics tracks this

Snowplow’s JavaScript tracker contains an option called Activity Tracking which enables a Page Ping that will be sent every n seconds. In this diagram I’ve used 5 seconds as the ping interval starting 5 seconds after the page loads:


So once the page loads, a ping is sent every five seconds recording that the page is still open. That gives an accuracy of at least five seconds to calculating the actual time a user spent.

Other tools

For tools that use the traditional mechanism of measuring Time Spent, there are some workarounds to get better numbers. Though none are ideal. The biggest problem is that there is no reliable mechanism to ensure a pixel is sent out when a user leaves your site.


Chartbeat uses a similar approach for data collection, although they attempt to measure actual engagement by monitoring user activity in the window as well. It would be interesting to apply this approach to the Snowplow page ping. Shouldn’t be too hard to update the tracker to support this.


Riveted plugin, as described here, tracks actual user engagements Simo Ahava, always worth following, has a few different approaches to the problem


  • Adobe’s calculation method has changed a couple of times, so be sure you understand that.
  • Some have attempted to trap clicks on exit links and the unload event in the browser and then send a custom link to give the tool another event to extend the session for the last pageview, but these methods aren’t reliable across browsers and platforms.
  • I have a conceptual approach of how to do this very accurately in Adobe Analytics without costing enormous amounts. Drop me a note if you’d like to know more.

How to analyse?

Page pings create a large number of event rows inside your Snowplow database, consistent with the way Snowplow does things. That opens up a bunch of different ways for you to calculate the Time Spent metric. In the next blog, Mike will go through different approaches for modelling the page ping data to analyse Time Spent.

Until then you can check out Snowplow’s Measuring content page performance article.

[Read More]

Decoding Snowplow real-time bad rows (Thrift)

Wednesday 14 December, 2016. | By: Mike Robins

In this tutorial we’ll look at decoding the bad rows data that comes out of Snowplow real time. In the real time pipeline bad rows that are inserted into Elasticsearch (and S3) are stored as base64’d binary serialized Thrift records. We’ll walk step by step the instructions in Python as to how to first decode, and then deserialize these records.

You can find the code for this tutorial without the accompanying text on Github.

Before we start you’ll need 3 things.

  1. A copy of Python 3
  2. The thriftpy library which can be installed using pip install thriftpy
  3. A local copy of the Snowplow Analytics collector payload thrift file

If you’re running real-time you can either take a sample payload from your Elasticsearch index (the line property) from a record in bad rows - otherwise we’ll use a sample event I’ve generated below.

import base64
import thriftpy
from thriftpy.protocol import TCyBinaryProtocolFactory
from thriftpy.utils import deserialize, serialize

First up we’ll import the libraries we need. base64 is part of the standard library and the thriftpy imports are from the thriftpy library which has been installed above. Let’s define our example payload below.


Now let’s use the b64decode method to decode the payload.

decoded_payload = base64.b64decode(sample_payload)

b"\x0b\x01@\x00\x00\x00\x02/i\x0b\x00d\x00\x00\x00\t127.0.0.1\n\x00\xc8\x00\x00\x01X\xdb\x9e7h\x0bzi\x00\x00\x00Aiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0\x0b\x01J\x00\x00\x01hstm=1481153329000&e=pv&\x0b\x01,\x00\x00\x00mMozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36\x0b\x01\x90\x00\x00\x00!\x0b\x00\xd2\x00\x00\x00\x05UTF-8\x0b\x016\x00\x00\x00\x1f\x0b\x01\x9a\x00\x00\x00$4d10d3d0-c2b4-473e-a483-812799e824d1\x0b\x00\xdc\x00\x00\x00\x11ssc-0.7.0-kinesis\x0f\x01^\x0b\x00\x00\x00\x0b\x00\x00\x00'Host:\x00\x00\x00\x1dAccept: image/webp, */*;q=0.8\x00\x00\x00$Accept-Encoding: gzip, deflate, sdch\x00\x00\x007Accept-Language: en-AU, en;q=0.8, en-US;q=0.6, en;q=0.4\x00\x00\x00\x14Cookie: sp=abcd-1234\x00\x00\x00'Referer:\x00\x00\x00zUser-Agent:  Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36\x00\x00\x00\x1aX-Forwarded-For:\x00\x00\x00\x15X-Forwarded-Port: 443\x00\x00\x00\x18X-Forwarded-Proto: https\x00\x00\x00\x16Connection: keep-alive\x00"

You’ll see that some fields are recognisable here. We can make out a query string and some headers but there’s a lot of data in between that makes this a bit difficult to read and parse. This is our binary serialized Thrift record that we’ll deserialize next.

collector = thriftpy.load("collector-payload.thrift")
collector_payload = collector.CollectorPayload()
raw_payload = deserialize(collector_payload, decoded_payload, TCyBinaryProtocolFactory())

In the cell above we’re doing quite a bit. First we’re loading the Snowplow provided .thrift file which acts as a specification for both serializing and deserializing the data. Next we’re initialising a structure based on the CollectorPayload struct and finally we are deserializing our base64 decoded object providing the class, base64 decoded payload and protocol factory to deserialize the object.


CollectorPayload(querystring='stm=1481153329000&e=pv&', collector='ssc-0.7.0-kinesis', ipAddress='', timestamp=1481153329000, path='/i', contentType=None, refererUri='', schema='iglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0', encoding='UTF-8', hostname='', userAgent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', body=None, headers=['Host:', 'Accept: image/webp, */*;q=0.8', 'Accept-Encoding: gzip, deflate, sdch', 'Accept-Language: en-AU, en;q=0.8, en-US;q=0.6, en;q=0.4', 'Cookie: sp=abcd-1234', 'Referer:', 'User-Agent:  Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'X-Forwarded-For:', 'X-Forwarded-Port: 443', 'X-Forwarded-Proto: https', 'Connection: keep-alive'], networkUserId='4d10d3d0-c2b4-473e-a483-812799e824d1')

Excellent! That’s looking a lot more readable already. If you haven’t already noticed, the properties of our CollectorPayload class match the fields defined in our collector-payload.thrift record. Let’s access some of them directly.

querystring = raw_payload.querystring
network_userid = raw_payload.networkUserId


We can now access these properties directly but at this stage we’ve just got the raw querystring - you’ll recognise this as the network request that is sent to the collector when using the Snowplow Javascript library (tna=js-3.6.0). Let’s split this out so we can look at the individual key value pairs that are in the payload.

from urllib import parse
params = parse.parse_qs(querystring)

{'aid': ['snowflake'],
 'cd': ['32'],
 'cookie': ['1'],
 'cs': ['UTF-8'],
 'ds': ['200x26295'],
 'dtm': ['1479927857201'],
 'duid': ['4d10d3d0-c2b4-473e-a483-812799e824d1'],
 'e': ['pv'],
 'eid': ['c259c25e-f94b-4d2c-a11c-0d278c2e6d1a'],
 'fp': ['129112332'],
 'lang': ['en-AU'],
 'p': ['web'],
 'res': ['360x640'],
 'sid': ['7fb97fC6-6e0f-4022-adbd-017431520ddb'],
 'stm': ['1481153329000'],
 'tna': ['js-3.6.0'],
 'tv': ['js-2.6.0'],
 'tz': ['Australia/Sydney'],
 'url': [''],
 'vid': ['4'],
 'vp': ['0x-56']}

If you’re familiar with Python you’ll note that our parse_qs function returns a dictionary containing values in the form of a list. We can access this reasonably easy by - in this case let’s extract out the app_id for the event.

app_id = params.get('aid')[0]


In the above sample we’re using the built in get method (on a dictionary) to access the aid property rather than app_id you can find a list of the mappings from the URL parameter to the friendly name here. The [0] above access the first element in that list for the aid key. In our example each of these lists only contains one value so we don’t need to worry about addressing additional values.

That concludes this tutorial. Please post below if you’ve got queries or parts of the post require additional clarification - this is a quite involved process.

In our next tutorial we’ll cover how we can leverage this simple process of decoding a single bad row - to multiple bad rows from a run and looping this data back into Redshift so we can query it!

[Read More]

Monitoring Snowplow bad rows using Lambda and Cloudwatch

Wednesday 19 October, 2016. | By: Mike Robins

In this tutorial we’ll use Amazon Lambda and Amazon Cloudwatch to set up monitoring for the number of bad rows that are inserted into Elasticsearch over a period of time. This allows us to set an alert for the threshold of bad rows, and generates an email or notification when this threshold has been exceeded. Snowplow users on the realtime pipeline will find this most useful, however users running loads in batch can also adapt this monitoring.

We’ll use the following services throughout this tutorial so check first that they are available in your AWS region of choice:

  • Cloudwatch rules
  • Lambda

Below is an overview of what we’ll be setting up.

A Cloudwatch rule will trigger a Lambda function every 5 minutes. This Lambda function will query our Elasticsearch cluster, and return the number of rows in the bad index for the last 5 minutes. Finally, we’ll send this result to our own custom metric to allow for monitoring and alerting using Cloudwatch alarms.

Lambda execution role

First, let’s set up an execution role for our AWS Lambda function. This execution role will allow access to the appropriate resources, namely being able to submit a metric to Cloudwatch.

  1. Navigate to IAM in the AWS console.
  2. Select ‘Policy’ and ‘Create Policy’ in this interface.
  3. Select ‘Create Your Own Policy’
  4. Name the policy document and provide a description. Under the ‘Policy document’ we’ll use the following configuration.
    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "arn:aws:logs:*:*:*"
            "Effect": "Allow",
            "Action": [
            "Resource": "*"

Lambda bad rows AWS IAM Policy

This will allow the Lambda function to log to Cloudwatch for debugging as well as pushing data for the bad rows metric we will define shortly.

Next, we’ll create a role for the Lambda function to use, and attach this policy.

  1. In IAM select ‘Role’ and ‘Create New Role’.
  2. Under ‘AWS Service Roles’ select ‘AWS Lambda’
  3. Search and select the policy you created above, review the policy and hit ‘Create Role’.

Creating our Lambda function to retrieve the number of bad rows

  1. In the AWS Lambda dashboard select ‘Create a Lambda function`
  2. Select the blueprint as ‘Blank function’ as we’ll be writing our own code for this function.
  3. For the moment, leave triggers blank and hit ‘Next’.
  4. Name your function, e.g., snowplow_bad_rows_count
  5. In this example we will use the ‘Python 2.7’ runtime.
  6. For the function code copy and paste the function below.
import json
import urllib2
import boto3
# import required libraries

def lambda_handler(event, context):
    host = '' # your elasticsearch host
    port = 9200 # your elasticsearch port
    index_name = 'bad-alias' # your elasticsearch bad index

    url = 'http://{0}:{1}/{2}/_count'.format(host, port, index_name)
    data = '{"query": {"range": {"failure_tstamp": {"gt": "now-5m"}}}}' # query to filter data

    request = urllib2.urlopen(url, data) # connect to Elasticsearch
    result = json.loads( # read and decode JSON response
    bad_row_count = result.get('count') # select the count property
    request.close() # close the connection

    client = boto3.client('cloudwatch') # setup a cloudwatch client
    data = [
                'MetricName': 'bad_rows_count',
                'Value': bad_row_count,
                'Unit': 'Count'
    # prepare a payload to send with the name 'bad_rows_count'
    client.put_metric_data(Namespace='snowplow', MetricData=data) # send the data to Cloudwatch using the 'snowplow' namespace
    return bad_row_count # return the number of bad rows in the last 5 minutes

In the above snippet change the host variable to point to your Elasticsearch cluster. This is likely to be either a load balancer if you’re running Elasticsearch on EC2 instances, or the Elasticsearch endpoint if using AWS Elasticsearch Service.

You should also change port, to the appropriate port which is likely to be either 9200 (default) or 80 is using Elasticsearch Service or port-forwarding on a load balancer.

Finally, enter the name of the index. If you’re unsure of what this is by adding /_cat/indices to the end of your Elasticsearch URL which will list all indices in your Elasticsearch cluster.

  1. To modify the period in which Elasticsearch looks for bad events change the 'now-5m' string to the period required, e.g., for 1 hour use 'now-1h'.
  2. To modify the name of the metric, change the 'MetricName' in the data payload
  3. To modify the event namespace, change the Namespace in the put_metric_data call.
  4. The handler field can be left the same.
  5. Under ‘Existing role’ select the role we created in the previous steps.
  6. The memory and timeout of advanced settings can be left the same.
  7. Select an appropriate VPC. If running an Elasticsearch cluster within a VPC you may need to use this VPC, however if you’re cluster is accessible externally ‘No VPC’ may work.
  8. Review the function and select ‘Create function’.
  9. Let’s quickly test the function to ensure that it is able to connect to Elasticsearch successfully and produces the expected output.
  10. Select ‘Test’ in the interface, and ‘Hello World’ as the sample event template. The data we use as input does not matter as our event only produces output.
  11. Select ‘Save and test’
  12. Your function should successfully execute and return the number of bad rows in the Elasticsearch index selected. You can verify this using the Kibana interface if required.
  13. If the function times out, double check the address and settings for your Elasticsearch cluster, the function may not be able to connect.

Create our Cloudwatch trigger

  1. In the Cloudwatch interface, select ‘Rules’ under ‘Events’ from the left-hand pane.
  2. Select ‘Create Rule’
  3. For the event source select ‘Schedule’, and for the moment we’ll leave it as the default of fixed rate of every 5 minutes. This sets the option of when, and how often our Lambda function will run and our metric will update. More information about the cron syntax AWS uses can be found here.
  4. Under ‘Targets’, select ‘Lambda function’ and the name of the Lambda function you created above.
  5. Select ‘Configure details’
  6. Add a meaningful name and description for this rule and leave the enabled checkbox selected. Select ‘Create Rule’.

Check our metric is working in Cloudwatch

  1. In the Cloudwatch interface scroll down in the left-hand panel to the bottom and under ‘Custom Metrics…’ select ‘snowplow’, your metric name should appear here and by clicking the checkbox you should see a point of data every 5 minutes. If you’ve only just set this up you may only see one data point.

To create an alarm

  1. Select the metric and select the ‘Create Alarm’ button in the right hand panel under ‘Actions’.
  2. Set a meaningful name and description for the alarm, and a threshold to trigger at based upon the number of bad rows you expect to have in a 5 minute period. Anything above this threshold will trigger the alarm.
  3. Under Actions, you can select a notification list (if you have a SNS topic set up) or alternately if you select ‘New list’ you can enter a comma delimited list of email addresses.
  4. This will notify users when the alarm is triggered based on the criteria you have set.


Running this monitoring will cost approximately $0.69 a month depending on your usage, frequency and availability zone. For many users some of this cost may be included in the AWS free tier.

  • $0.50 per custom metric
  • $0.10 per alarm
  • ~$0.09 for Cloudwatch PutMetricData (9000 requests/month, assuming rule triggers every 5 minutes)

If you have any questions or comments feel free to leave them in the comment section below, or in the Snowplow discourse forum.

[Read More]


We exist to make organisations better understand their businesses by enabling all decision makers in a company to work with the same version of the truth.

Social Links