Why would you push Google Analytics data into Snowplow?

Wednesday 14 February, 2018 | By: Simon Rumble

In late January the Snowplow team released R99 of Snowplow which includes a really interesting feature: with a small JavaScript change, you can mirror all the data being sent into Google Analytics into a Snowplow collector.

It’s a pretty awesome feature and got a lot of people talking. GTM legend Simo Ahava got really excited about it and wrote a couple of posts. Cogs started turning in the minds of a bunch of other people in the analytics community too, seeing where this might be useful.

If you need a recap of how Snowplow differs from GA and Adobe Analytics, check out How does Snowplow Analytics compare to other vendors?

Raw data

Pretty much everyone who has done digital analytics has come up with a use case where they wish they could get to the raw, underlying events happening. It might be because you have a change of definitions you want to turn into a continuous timeline for your users rather than explaining it half a dozen times a week. You might want to re-categorise things that were categorised at collection time. Or you might want to get down and dirty with every individual users’ event stream.

There’s a few ways to get to event-level data in Google Analytics. One is to pay Google a chunk of money and upgrade to GA360. It’s a good product, and if you want some of the other features it could be worth it but it’s a lot of money to pay for just getting access to your own raw data.

Other mechanisms extract data from the APIs, but they’re kinda hacky and won’t scale up to high traffic volumes as you’ll start hitting API limits.

The final approach is to duplicate all the data being sent into Google Analytics into something else. Older versions of the Google Analytics Urchin script used to explicitely support this, as Urchin had a self-hosted version available. That’s been thrown out, but it’s still possible to duplicate the payload and send it somewhere else.

Duplicate your Google Analytics pixels to Snowplow

Snowplow isn’t the first group to try this, but the important different with the Snowplow pipeline is you’re getting a robust, scalable, tested and very flexible pipeline. It also does a bunch of enrichment activities that make the data more useful, like User Agent classification, IP to geography lookups and setting a third-party pixel for cross domain use where that works.

Real-time

Snowplow’s real-time pipeline collects data and makes it available in as little as a few seconds. Not aggregations, not isolated dimensions, and not fifteen minutes later. The whole lot, enriched and with everything you’ve sent available. For some use cases, this is absolutely essential. If you want to react to the user’s actions within the same session, you need real-time. Using the GA adapter, you can take your existing Snowplow pipeline and have an instant real-time upgrade without changing anything in your tagging.

Test out Snowplow with real data

Another reason this feature is interesting is for people who are keen to dip their toes in Snowplow but the prospect of defining and building all the events and data models from scratch is a bit daunting. If your GA implementation is pretty solid, you can be up-and-running really quickly and start exploring what it’s like working with event-level data.

Next you can spend some time upgrading specific events to a richer event model, building custom schemas and seeing where that leads. It’s a nice gentle introduction, and you should see some value out of it pretty much immediately.

GA360 features for cheap

The Google Analytics script doesn’t know whether you’ve paid for GA360 or not, so you can send in a full 200 custom dimensions but you’ll only see the first 20 in the UI. The Snowplow version of that data won’t have the same limitation. So you can send in 200 dimensions and get them back out in the event-level data.

Of course, Snowplow has a much richer event model than the flat 200 dimensions. You can model things like arrays of objects in Snowplow, which ranged from clunky to impossible in GA.

There’s no sampling in Snowplow data. You have all the raw events at your fingertips, so performance is dictated by the size of your data, the complexity of your query and the amount of hardware you throw at the problem.

Redshift, Postgres, Hive, Snowflake databases

Snowplow allows you to output into more databases: Redshift, Postgres, Hive and Snowflake Computing are all options for your data. BigQuery is awesome, but if you’re already using one of these databases it might be a better fit.

Backup

Sometimes you just want a copy of everything. Y’know, because backup. This is a quick and easy way to get that.

About

We exist to make organisations better understand their businesses by enabling all decision makers in a company to work with the same version of the truth.

Social Links