MeasureCamp Sydney: Event roundup and slides

Wednesday 24 October, 2018. | By: Simon Rumble

The amazing analytics unconference MeasureCamp Sydney was held on Saturday 20 October and it was the best yet! We’ve been involved from the beginning as sponsors and organisers because we love the format and community, so it was great to be part of the third edition.

Highlight talks

What’s an unconference?

An Unconference is a little unusual. There’s no schedule organised in advance, no keynotes flown in, no vendor presentations. Instead the schedule is decided on the day, with people writing up topics they’re passionate about on the session board. And because it’s on a Saturday you know the people attending are all real keen beans!

Food was amazing

Sponsors cover the costs of the venue, food, drinks and things like the T-shirt handed out to everyone. We’ve proudly sponsored every MeasureCamp in three cities so far. We also have an open offer to sponsor any MeasureCamp in APAC, so get in touch if you’re planning one (come on Singapore!).

Digital session board

MeasureCamp session board

On the day everyone who wants to run a session puts cards up on the board. Cards get moved around as people resolve schedule clashes or the organisers move high-demand sessions into bigger spaces.

Mike built a digital session board to mirror the physical board again (he first built one for the June 2018 Auckland event) because there’s often limited space in front of the physical board and people struggle to see what’s on. We’ve also had requests from visually-impaired people to make the session board more accessible.

Manual data entry of the session board

This time around Mike and Sam worked on a clever combination of technologies to automatically convert the session card content into the digital version and track the cards as they get moved around the board. The concept was good though next time we’ll make sure there’s better lighting and the camera won’t be blocked by people standing in front of it! Mike and Sam had to revert to manual data entry to make it work.

Real-time statistics from Snowplow

One component that did work well, as it had in Auckland, was the Snowplow instrumentation on the digital board. We had real-time data on the most liked sessions. We plan to improve on this for future MeasureCamps so organisers can see which sessions have high demand and might need a bigger room. The app probably needs to include notifications for users to give a reason to “like” a session. It’ll also be handy to notify people when sessions move around and what’s coming up.

Talks and conversations

Conversations are the lifeblood of MeasureCamp

Much of the value of MeasureCamp is gained from meeting great people and the interesting discussions that happen. It’s such an incredible group of people: passionate and enthusiastic practitioners keen about analytics enough to give up their Saturday, and in some cases travel long distances. I had some excellent conversations about current analytics problems and enjoyed a bunch of great presentations.

Junta Sekimori demonstrates a novel use of Snowplow

One of the highlights for me was learning from Junta Sekimori about how broadcaster and publisher Seven West Media is using Snowplow. Applications send a superset of data in events to Snowplow, then Lamba Functions translate and send them to external systems OzTam (Australia’s TV ratings sytem), Google Analytics and DMP Lotame.

Adilson Mendonca dispenses SQL wisdom

Another great session was Sensei Adilson dispensing his SQL wisdom. His slides are up on Slideshare.

Slides from my talk

I did my usual talk on digital advertising, which I’ve been sharing for about six years. So far I’ve given it in 3 countries and It doesn’t get old because there are still people who want to learn how this stuff works.

Digital Advertising: How does it work?

My always-popular session going into the nuts and bolts of display advertising, how it works and why. The last slide has some handy links to learn more.

[Read More]

Paper over your mistakes with data models

Tuesday 9 October, 2018. | By: Simon Rumble

Mistakes happen. In the data world, your ugly mistakes live on forever. It’s not just the embarrassment that’s a problem though. Gaps and obvious errors in historical data distract your stakeholders from more important matters. Explaining the anomalies and getting your data users to focus on things you don’t know about is tiring for everyone.

If you’re using a data model on event-level data such as Snowplow data, you can make a large number of data anomalies disappear for the majority of your users. This is where the use of your opinionated data model provides values over the unopinionated underlying raw data.

I messed up

Data error

While working on a new version of our content data model to include the content metadata attached to our blog posts, I realised I’d dropped a letter from the content tag Snowplow Analytics. I’d further compounded the error by copying the same mistake across my entire 5-part blog series on data models. Doh!

What happens when you're not paying attention

So now we see this hilariously terrible output every time we report by content tags. A constant reminder of the fact I should pay more attention. (We’ll leave the proliferation of poor content tagging for another day.)

The fix

Fortunately this was really easy to fix! Content metadata information is recorded using a custom schema that ends up in its own table. My view folds this information into the pageview rows. All I need to do is add a really simple replace to fix my typo.

Where the query ordinarily pulls the value straight out, I’ve instead replaced my typo. It’s also important to document the change so that future maintainers understand what bug I’m papering over!

    -- Fix for Simon's stupid content tag typo
    REPLACE(page_tags, 'Analytcs', 'Analytics') AS page_tags,

As simple as that, fixed and my dignity restored.

Dignity restored

[Read More]

Quantifying content velocity in Snowplow

Monday 8 October, 2018. | By: Simon Rumble

Adam Greco is something of a legend in the Adobe Analytics space. I’ve been reading his blog posts and learning from him since I first started using Omniture back in 2007 or so. He literally wrote the book on Omniture and then Adobe SiteCatalyst. The reason his blog was so useful is that very few people were writing about advanced digital analytics at the time. Between Adam and Ben Gaines, I learnt much of what I know about the then-emerging discipline.

Adam also recently come on board as a strategic advisor for Snowplow. That’s super exciting because now there’s someone with extensive experience helping the Snowplow team refine their offering.

Content velocity

A few weeks ago Adam posted a description of a really interesting report he’d built to use Adobe Analytics data. Quantifying Content Velocity is a really great approach to understanding how quickly different types of content get in front of an audience.

Adam's content velocity report

I love this view because you can clearly see how quickly your content finds an audience. Some content will be taken up really quickly, spreading around the Internet like a virus: “news” type content, or something that manages to hit the Zeitgeist and become a viral hit. Other content takes time to build an audience and become an established resource for whatever it’s talking about: long-tail SEO content that gets more traffic over time.

Adam’s approach is a bit clunky in a few ways so it got me thinking about how we’d build something similar to use Snowplow data.

Adobe Analytics approach

There’s a couple of problems I have with the Adobe approach that I’d like to improve upon.

Client-side clock

The first problem is a long-term problem I’ve had with Adobe Analytics. The time a particular event is recorded isn’t available as an explicit dimension in Adobe Analytics. This problem has been around forever with Adobe and before that Omniture. Apparently these dimensions are available in Analytics Workspace, but not in the rest of the platform.

There are a bunch of ways you can use a timestamp dimension. You can build a time-of-day report to look at when users are visiting: hour of day, day of week etc. The only way to do this is to use an old hack, I’m not sure if it was Adam or Ben who originally came up with it, that pushes the day part you’re interested in into a dimension for later use. There’s two problems with this: every day part you want consumes a custom dimension (props and eVars) and it relies on the clients having an accurate clock. I’m sorry, but clients don’t have accurate clocks in the real world!

I’ve seen the clock thing at scale with large companies and the client clocks can be miles out. A quick query of our own site shows a few, but I know from querying much larger sites that this is very common. In real-world examples I’ve seen clocks that are years wrong, in both directions.

SELECT DATEDIFF('day', collector_tstamp, dvce_sent_tstamp), COUNT(DISTINCT domain_userid)
FROM atomic.events
WHERE app_id = 'snowflake-analytics.com'
  AND (DATEDIFF('day', dvce_sent_tstamp, collector_tstamp) > 1
  OR DATEDIFF('day', dvce_sent_tstamp, collector_tstamp) < -1)
GROUP BY 1
ORDER BY 1

Broken clocks on our site

You can see in this query that we’re taking advantage of some of the many timestamp fields Snowplow records, so this will be handling any issues with queuing of events to send later. See this post for more information about the timestamp fields.

For the record, Google Analytics has a pretty good set of time-based dimensions that you can use for this and not consume custom dimensions.

Automation

Another problem with Adam’s approach is that it’s kinda clunky. I’m sure that with a bit of work you could automate this and make it a bit neater but it’s very manual. Snowplow has the advantage of being able to use the very expressive SQL query language and then push the result into whatever visualisation tool you want to use.

With a somewhat complex SQL query we can actually work out the publish date for a piece of content automatically by looking for the first time we see Snowplow data for it. To make this truly accurate though, we need to filter out any hits that come in while the author is working on the piece but prior to publishing. We can take advantage of the user_internal column I created in the blog post on data modelling users.

Snowplow approach

Time parting

Here’s an example of a simple time parting report showing the day of the week people consume our content. You can see we’re pretty popular on weekdays and less so on weekends.

SELECT DATE_part(dayofweek, CONVERT_TIMEZONE('Australia/Sydney', derived_tstamp)) AS day_of_week,
    COUNT(*) AS page_view
FROM atomic.events
WHERE app_id = 'snowflake-analytics.com'
    AND event = 'page_view'
GROUP BY 1
ORDER BY 1

Day of week report

Content velocity

I was a bit daunted by the query I needed to make here so I turned to the ever-reliable Jethro to recreate Adam’s report in SQL. Working together we came up with this query. It contains two nested SELECT statements to find the first instance of a particular URL being seen to work out the publish date, then calculates a value for how long the individual pageview event is after the publish date.

SELECT page_urlpath,
       velocity_bucket,
       page_views
FROM (SELECT page_urlpath, CONVERT_TIMEZONE('Australia/Sydney', derived_tstamp) :: date - published_date :: date velocity_bucket, count(*) page_views
                  FROM atomic.events
                         LEFT JOIN derived.user USING (domain_userid)
                         LEFT JOIN (SELECT page_urlpath, CONVERT_TIMEZONE('Australia/Sydney', min(derived_tstamp)) published_date
                                    FROM atomic.events
                                           LEFT JOIN derived.user USING (domain_userid)
                                    WHERE user_internal IS NOT TRUE
                                    GROUP BY 1) published USING (page_urlpath)
                  WHERE app_id = 'snowflake-analytics.com'
                    AND event = 'page_view'
                    AND page_urlpath LIKE '/blog/data-modeling%'
                    AND user_internal IS NOT TRUE
                  GROUP BY 1, 2
                  ORDER BY 1, 2)
GROUP BY page_urlpath, velocity_bucket, page_views
ORDER BY page_urlpath, velocity_bucket

From here I was able to create a nice line chart in Tableau by using velocity_bucket as the X axis.

Content velocity using Snowplow and Tableau

Optimization

This is a very expensive query, going through every pageview row in the database to find the first instance of the URL. Rather than calculate this every time you run the report, you could merge it into the data model for each pageview. The approach of using the first pageview to signify publishing is also problematic unless you very carefully exclude all pageviews while the content is being written and edited. A better approach would be to derive the publish date from the CMS metadata, along with other metadata about each article.

You’d then make this available within the pageviews view I created in the final part of the data modelling blog series.

[Read More]

Modelling your Snowplow event data: Part 5 Automation

Tuesday 18 September, 2018. | By: Simon Rumble

In the first four parts of this series, we modelled out:

  • Pageviews: accurate time spent incorporating page pings and maximum scroll depth
  • Sessions: traffic sources and initial landing details
  • Users: filtering internal traffic and looking up things we know about the users

Now it’s time to move the model into production and start using it. To do this we’ll use Snowplow’s SQL Runner tool at the end of each Snowplow batch load into Redshift.

Schema for models

In my previous posts I’ve created a series of lookup_* tables with my modelled data. To keep this clean you probably want to create a new schema specifically for your modelled data. This simplifies the separation of raw, unopinionated event-level data with modelled, opinionated data that applies your business rules. Once everyone has got used to it, you can remove access to the raw data for most of your users, ensuring everyone is working from the same business rules.

By convention, Snowplow uses scratch and derived schemas for data modelling steps.

  • scratch is for ephemeral data used in intermediate processing steps but not actually part of the output. It should be discarded as a final processing step.
  • derived is for your model output data.

SQL Runner

SQL Runner is a pretty simple tool that takes a series of SQL scripts and executes them on command against a specified database with specified credentials. The important piece of information from the documentation is the difference between :queries: and :steps:.

Queries within a Step will execute in parallel while individual steps are executed in the order in which they appear. If your models build on one another, you’ll need to separate them into Steps to operate in sequence. The model I’ve build out in this series doesn’t have any dependencies between the steps so you can easily have them run in parallel within the same Step.

Configuring your scripts

The syntax for SQL Runner’s playbooks is pretty simple. For our model playbook we’re just executing the following:

:targets:
  - :name: "Snowflake Analytics data model"
    :type: redshift
    :host: redshift-endpoint # The endpoint as shown in the Redshift console
    :database: snowplow # Name of database
    :port: 5439 # Default Redshift port
    :username: datamodeling
    :password: {{secret "redshift-datamodeling-password"}}
    :ssl: true # SSL disabled by default
:steps:
  - :name: "Rebuild pageview, session and user models in parallel"
    :queries:
      - :name: pageview
        :file: model-pageview.sql
      - :name: session
        :file: model-session.sql
      - :name: user
        :file: model-user.sql

Automation

Managed Service customers

If you’re a Snowplow Managed Service customer, SQL Runner is handled for you. The process is quite simple:

  1. Clone your pipeline’s snowplow-proservices pipeline repository
  2. Create an sql-runner directory in jobs/main
  3. Place the playbook YAML file in a directory called playbooks
  4. Place your SQL scripts in a directory called sql
  5. Commit your changes to a named branch on the repository and push back upstream
  6. Raise a support ticket with Snowplow advising of the branch you’ve created and the schedule on which you’d like the new data model to run, for example “daily at UTC midnight” or “after every batch”
  7. Snowplow will implement your new data model scripts

From then on, any changes you make on the master branch for existing playbooks will update within about 45 minutes. You’ll need to follow the process above again if you create a new playbook running on a different schedule.

Open Source users

If you’re using an Open Source Snowplow pipeline, you’ll have to set up and manage scheduling yourself, probably within the same script you’re kicking off to do enrichment and loads into the database from cron now. A good place to start is the Guide for devops.

Optimisations

Scope down to a smaller time range

My scripts execute across the entire events dataset we have. Our site isn’t high traffic so this isn’t a huge imposition, executing in about a minute on our Redshift database. If you have decent traffic volumes, you’ll want to scope down the reprocessing of models somewhat. If your queries involve any kind of lookback, that will take some careful tuning of your query windows. Be particularly careful if you’re calculating sessions in some way different to COUNT(DISTINCT domain_sessionid) as a session can easily cross a batch boundary.

Creating useful SQL views

The joins you need to make for this to be useful are a bit complicated to do routinely, so it makes sense to create a nice simple view for your pageview events that incorporates these elements on every row. These views become something you can expose to a visualisation tool like Tableau to enable easy exploration while incorporating engagement metrics, your business rules for traffic source attribution and information about users.

Below is the view I’ve created for our data set. For convenience I’m also including the timezone conversion we routinely use to convert the query to Sydney time. For performance reasons, you might end up writing out the result of this view into a table itself as an additional step. There would be a storage space cost but it should be more performant too.

CREATE VIEW derived.view_pageviews AS (
  SELECT
    CONVERT_TIMEZONE('Australia/Sydney', derived_tstamp) AS derived_tstamp_sydney,
    ev.*,
    pv.pageview_id,
    pv.time_spent_seconds,
    pv.max_pp_yoffset_max,
    pv.max_doc_height,
    pv.scroll_depth_percent,
    se.first_refr_urlhost,
    se.first_page_referrer,
    se.first_refr_urlscheme,
    se.first_refr_urlport,
    se.first_refr_urlpath,
    se.first_refr_urlfragment,
    se.first_refr_medium,
    se.first_refr_source,
    se.first_refr_term,
    se.first_mkt_campaign,
    se.first_mkt_medium,
    se.first_mkt_source,
    se.first_page_urlpath,
    se.last_page_urlpath,
    se.channel,
    us.first_session,
    us.last_session,
    us.last_domain,
    us.last_name,
    us.last_legal_name,
    us.last_description,
    us.last_type,
    us.last_linkedin_handle,
    us.last_logo,
    us.last_metrics_alexa_global_rank,
    us.last_metrics_annual_revenue,
    us.last_metrics_employees,
    us.last_metrics_market_cap,
    us.last_crunchbase_handle,
    us.user_internal
  FROM
    atomic.events ev,
    atomic.com_snowplowanalytics_snowplow_web_page_1 pg,
    derived.pageview pv,
    derived.session se,
    derived.user us
  WHERE
    ev.app_id = 'snowflake-analytics.com'
    AND ev.event = 'page_view'
    AND ev.event_id = pg.root_id
    AND pg.id = pv.pageview_id
    AND ev.domain_sessionid = se.domain_sessionid
    AND ev.domain_userid = us.domain_userid
) WITH NO SCHEMA BINDING

Querying the data

Now let’s use the models and views we’ve built and start analysing some content.

Engagement metrics

This query gives you some handy engagement metrics including bucketing pageviews into scroll depth groups.

SELECT
  page_urlpath,
  SUM(CASE WHEN scroll_depth_percent = 0 THEN 1 END) AS "0%",
  SUM(CASE WHEN scroll_depth_percent > 0 AND scroll_depth_percent <= 0.25 THEN 1 END) AS "25%",
  SUM(CASE WHEN scroll_depth_percent > 0.25 AND scroll_depth_percent <= 0.5 THEN 1 END) AS "50%",
  SUM(CASE WHEN scroll_depth_percent > 0.5 AND scroll_depth_percent <= 0.75 THEN 1 END) AS "75%",
  SUM(CASE WHEN scroll_depth_percent > 0.75 THEN 1 END) AS "100%",
  ROUND(MEDIAN(scroll_depth_percent)*100) AS "median scroll depth percent",
  AVG(time_spent_seconds) AS "average time spent seconds",
  SUM(time_spent_seconds) AS "total time spent"
FROM derived.view_pageviews
WHERE
  page_urlpath LIKE '/blog/data-modeling%'
  AND user_internal IS NOT TRUE
GROUP BY 1
ORDER BY COUNT(*) DESC

Content engagement metrics

Bucketed scroll depth report

Looking at identified companies

As our site is aimed at B2B marketing, identifying the companies coming to our site is an interesting thing to do. I’m excluding traffic from ourselves and our friends at Snowplow and might increase the exclusions as we keep exploring the data.

SELECT
  COALESCE(last_domain, user_ipaddress) AS last_domain,
  last_crunchbase_handle,
  last_linkedin_handle,
  last_name,
  SUM(time_spent_seconds) AS time_spent_seconds,
  COUNT(*) AS pageviews,
  COUNT(DISTINCT domain_sessionid) AS sessions,
  COUNT(DISTINCT domain_userid) AS unique_browsers,
  ROUND(AVG(scroll_depth_percent) * 100) AS avg_scroll_depth
FROM
  derived.view_pageviews
  WHERE
    user_internal IS NOT TRUE
    AND COALESCE(last_domain, '') NOT IN ('snowplowanalytics.com')
    AND derived_tstamp_sydney > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '90 days'
GROUP BY 1, 2, 3, 4
ORDER BY 8 DESC, 6 DESC, 5 DESC

Next steps?

So that’s the end of our giant odyssey into Snowplow data models. We haven’t covered everything that can go into a data model here. In particular, one of the best uses of data models is to map identities across all instances where you see a particular domain_userid back in history. To do this you’d use the same technique as I’ve used here for companies, using whatever identification point (or points!) you have. We, unfortunately, don’t have an easy way to reliably identify people.

You might now like to have a read of the Snowplow web data model, which can be a bit daunting at first but probably makes a whole lot more sense now that you’ve looked through a simpler model.

[Read More]

Modelling your Snowplow event data: Part 4 Users

Wednesday 29 August, 2018. | By: Simon Rumble

In the first three parts of this series, we looked at modelling out pageview events to include accurate time spent and scroll depth, then classifying sessions based on what we know about where the user came from. Now we’re going to look at what we know about users.

One of the best things about having raw behavioural event data in an SQL database is that you have the ability to update your knowledge about a users’ behaviour after the behaviour is recorded. The classic case for this is when a user logs in or otherwise identifies themselves: now know the identity behind the previous behaviour. If your application has some way of identifying users–such as forcing login or email click-throughs–you can merge user behaviour across multiple devices, such as a desktop browser and a mobile browser.

In our company’s case we don’t have a good way to identify individual users. At some point we’ll start a newsletter and drip feed nurturing process which will allow us to pass identifiers into the click-through URLs. Until then, most of the information we can get is based on Clearbit data which is created through a lookup of the IP address and takes a stab at guessing company behind the browser.

Bug in sessions post

A quick aside: while writing this component on users I discovered a bug in the sessions piece. The part of my big CASE statement dealing with sessions that don’t have a referrer should be at the bottom of the big statement, otherwise the clause looking for utm querystring components (mtk_* columns) won’t ever execute in cases without a referrer. Using UTM strings is specifically designed to deal with lack of a referrer situations.

I’ve updated the post so if you want to look for it, look at the location of this piece of code:

        WHEN first_refr_urlhost IS NULL
          THEN 'Unknown'

Create a lookup table on domain_userid

For this installment we create a lookup_users table keyed off the first-party cookie which is stored in domain_userid.

Labelling internal traffic

The first component looks for any users who’ve behaved in a manner that might be considered “internal” traffic. For us that means they’ve sent in data with a host of localhost or they’ve appeared from one of our internal IP addresses.

In larger companies you might also put a data collection pixel on your intranet homepage, internal systems and the like. Any browser that’s ever visited those locations is considered “internal”. You want to tackle labelling internal browsers from multiple angles to be sure you capture everything. It’s entirely likely that developers will use a browser that never sees the intranet or internal systems to develop and test, so you also want to look at IP ranges and any other opportunities too. If you use them, think about any automated tests running and how you can label those.

Clearbit company details

The second component looks for the latest Clearbit data available for the user. This isn’t going to be perfect. I looked up my own domain_userid and saw a bunch of different places I’d been including client sites and hotels but it will give you some idea. More complex logic might score users with multiple Clearbit identifiers. Potentially you could just give a count of unique Clearbit companies for the user to drill down when there’s uncertainty.

DROP TABLE IF EXISTS lookup_users;
CREATE TABLE lookup_users
  DISTKEY(1)
  SORTKEY(1)
  AS (

    WITH internal_users AS (
        SELECT DISTINCT
          domain_userid,
          TRUE user_internal
        FROM atomic.events
        WHERE page_urlhost IN ('127.0.0.1', 'localhost')
              OR
              user_ipaddress IN ('52.123.123.123', '150.123.123.123')
    ),
        clearbit_data AS (
          SELECT
            DISTINCT
            domain_userid,
            FIRST_VALUE(domain_sessionid)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) first_session,
            LAST_VALUE(domain_sessionid)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_session,
            LAST_VALUE(domain IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_domain,
            LAST_VALUE(name IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_name,
            LAST_VALUE(legal_name IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_legal_name,
            LAST_VALUE(description IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_description,
            LAST_VALUE(type IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_type,
            LAST_VALUE("linkedin.handle" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_linkedin_handle,
            LAST_VALUE(logo IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_logo,
            LAST_VALUE("metrics.alexa_global_rank" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_alexa_global_rank,
            LAST_VALUE("metrics.annual_revenue" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_annual_revenue,
            LAST_VALUE("metrics.employees" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_employees,
            LAST_VALUE("metrics.market_cap" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_market_cap,
            LAST_VALUE("crunchbase.handle" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_crunchbase_handle

          FROM atomic.events
            LEFT JOIN atomic.com_clearbit_enrichment_company_1
              ON event_id = root_id
                 AND collector_tstamp = root_tstamp
          WHERE app_id = 'snowflake-analytics.com'
          ORDER BY collector_tstamp
      )
    SELECT *
    FROM clearbit_data
      LEFT JOIN internal_users
      USING (domain_userid)
  )

Pulling it all together

Here’s a simple query that shows recent site visitors ranked by uniques and time spent on the site, traffic source categories and any Clearbit identification information we’ve been able to get for the IP address. You can see the output here with the company details blurred out to protect the innocent. This query pulls together data from all the lookup tables we’ve created in this series: pageviews, sessions and users.

Example user activity report

SELECT ev.geo_country, ev.geo_city, NVL(us.last_domain, ev.user_ipaddress) AS last_domain, us.last_name, se.channel, SUM(pv.time_spent_seconds) AS time_spent_seconds, COUNT(*) AS pageviews, COUNT(DISTINCT ev.domain_sessionid) AS sessions, COUNT(DISTINCT ev.domain_userid) AS unique_browsers, ROUND(AVG(pv.scroll_depth_percent) * 100) AS avg_scroll_depth
FROM
  atomic.events ev, atomic.com_snowplowanalytics_snowplow_web_page_1 pg, lookup_pageview pv, lookup_sessions se, lookup_users us
  WHERE
    ev.app_id = 'snowflake-analytics.com'
    AND ev.event = 'page_view'
    AND CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '7 days'
    AND ev.event_id = pg.root_id
    AND pg.id = pv.pageview_id
    AND ev.domain_sessionid = se.domain_sessionid
    AND ev.domain_userid = us.domain_userid
    AND us.user_internal IS NOT TRUE
GROUP BY 1, 2, 3, 4, 5
ORDER BY 9 DESC, 6 DESC

Next steps

So we’ve now got techniques to build up a custom view of our audience behaviours using the three different scopes of interest. In the next post we’ll look at automating this so that it updates every time a new batch of data is added to Redshift, then some convenience views to make querying simple for everyday use and to expose to your end users.

Continue to Part 5: Automation

[Read More]

Modelling your Snowplow event data: Part 3 Sessions

Wednesday 22 August, 2018. | By: Simon Rumble

In the first two parts of this series, we looked at modelling out pageview events to include accurate time spent and scroll depths. Now we’ll roll up sessions.

A session or visit to a web site is when a user comes to your site, does some stuff, then left. The key feature of a session is that the user came from somewhere, did some stuff, then left. The “some stuff” is a bunch of pageviews and events. The “somewhere” here is a traffic source, for example if they came from a Google search.

Create a lookup table on domain_sessionid

This model pulls out the various parameters from the first referrer, uses the Campaign attribution enrichment fields and also has a monster CASE statement classifying traffic sources on a number of dimensions but mostly using first_refr_urlhost.

The giant CASE statement defining channel is probably the piece you want to most dig into and play with for your purposes. You’ll notice that I classify a first_refr_urlhost of NULL as Unknown in contrast to Direct or Typed/Bookmarked. This is a very conscious decision as Direct and Typed/Bookmarked give the invalid impression that there is some certainty about the source when in fact we just don’t know where the session came from.

DROP TABLE IF EXISTS lookup_sessions;
CREATE TABLE lookup_sessions
  DISTKEY(1)
  SORTKEY(1)
  AS (
    WITH sessions AS
    (SELECT DISTINCT
      collector_tstamp,
      domain_sessionid,
      refr_urlhost,
      page_referrer,
      mkt_campaign,
      refr_urlscheme,
      refr_urlhost,
      refr_urlport,
      refr_urlpath,
      refr_urlquery,
      refr_urlfragment,
      refr_medium,
      refr_source,
      refr_term,
      mkt_content,
      mkt_medium,
      mkt_source,
      page_urlpath,
      FIRST_VALUE(refr_urlhost)
      OVER (
        PARTITION BY domain_sessionid
        ORDER BY domain_sessionid, collector_tstamp
        rows between unbounded preceding and unbounded following ) AS first_refr_urlhost,
       FIRST_VALUE(page_referrer)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_page_referrer,
       FIRST_VALUE(refr_urlscheme)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_urlscheme,
       FIRST_VALUE(refr_urlport)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_urlport,
       FIRST_VALUE(refr_urlpath)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_urlpath,
       FIRST_VALUE(refr_urlquery)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_urlquery,
       FIRST_VALUE(refr_urlfragment)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_urlfragment,
       FIRST_VALUE(refr_medium)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_medium,
       FIRST_VALUE(refr_source)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_source,
       FIRST_VALUE(refr_term)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_term,
       FIRST_VALUE(mkt_campaign)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_mkt_campaign,
       FIRST_VALUE(mkt_content)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_mkt_content,
       FIRST_VALUE(mkt_medium)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_mkt_medium,
       FIRST_VALUE(mkt_source)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_mkt_source,
       FIRST_VALUE(page_urlpath)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_page_urlpath,
       LAST_VALUE(page_urlpath)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS last_page_urlpath
      FROM atomic.events
      WHERE app_id IN ('snowflake-analytics.com')
      ORDER BY
        1 DESC)
    SELECT DISTINCT
      domain_sessionid,
      first_refr_urlhost,
      first_page_referrer,
      first_refr_urlscheme,
      first_refr_urlport,
      first_refr_urlpath,
      first_refr_urlquery,
      first_refr_urlfragment,
      first_refr_medium,
      first_refr_source,
      first_refr_term,
      first_mkt_campaign,
      first_mkt_content,
      first_mkt_medium,
      first_mkt_source,
      first_page_urlpath,
      last_page_urlpath,
      CASE
        WHEN first_refr_urlhost LIKE '%snowflake-analytics.com'
          THEN 'internal'
        WHEN first_refr_urlhost = 'localhost'
          THEN 'localhost'
        WHEN first_refr_urlhost = '127.0.0.1'
          THEN 'localhost'

        -- Google
        WHEN first_mkt_source = 'adwords'
          THEN 'Adwords'
        WHEN first_refr_urlhost LIKE 'www.google.%'
          THEN 'Google'
        WHEN first_refr_urlhost = 'com.google.android.googlequicksearchbox'
          THEN 'Google'
        WHEN first_refr_urlhost = 'encrypted.google.com'
          THEN 'Google'
        WHEN first_refr_urlhost LIKE '%doubleclick.net'
          THEN 'Google Doubleclick'
        WHEN first_refr_urlhost = 'plus.google.com'
          THEN 'Google Plus'
        WHEN first_refr_urlhost = 'plus.url.google.com'
          THEN 'Google Plus'
        WHEN first_refr_urlhost = 'www.googleadservices.com'
          THEN 'Google Ad Services'
        WHEN first_refr_urlhost = 'cse.google.com'
          THEN 'Google Custom Search'

        -- Email platforms
        WHEN first_refr_urlhost = 'outlook.live.com'
          THEN 'Email'
        WHEN first_refr_urlhost = 'mailchi.mp'
          THEN 'Email'
        WHEN first_refr_urlhost = 'mail.google.com'
          THEN 'Email'
        WHEN first_refr_urlhost = 'com.google.android.gm'
          THEN 'Email'

        WHEN first_refr_urlhost LIKE '%ecosia.org'
          THEN 'Ecosia Search'
        WHEN first_refr_urlhost LIKE '%search.yahoo.com'
          THEN 'Yahoo Search'
        WHEN first_refr_urlhost = 't.co'
          THEN 'Twitter'
        WHEN first_refr_urlhost LIKE '%bing.com'
          THEN 'Bing'
        WHEN first_refr_urlhost = 'discourse.snowplowanalytics.com'
          THEN 'Discourse'
        WHEN first_refr_urlhost LIKE '%snowplowanalytics.com'
          THEN 'Snowplow'
        WHEN first_refr_urlhost = 'yandex.ru'
          THEN 'Yandex'
        WHEN first_refr_urlhost = 'www.linkedin.com'
          THEN 'LinkedIn'
        WHEN first_refr_urlhost = 'com.linkedin.android'
          THEN 'LinkedIn'
        WHEN first_refr_urlhost = 'lnkd.in'
          THEN 'LinkedIn'
        WHEN first_refr_urlhost = 'chrome.google.com'
          THEN 'Chrome'
        WHEN first_refr_urlhost = 'duckduckgo.com'
          THEN 'DuckDuckGo'
        WHEN first_refr_urlhost LIKE '%measurecamp.org'
          THEN 'MeasureCamp'
        WHEN first_refr_urlhost LIKE '%meetup.com'
          THEN 'Meetup'
        WHEN first_refr_urlhost LIKE '%wawsydney.com'
          THEN 'WAW Sydney'
        WHEN first_page_referrer LIKE '%meetup.com/Web-Analytics-Wednesday-Sydney%'
          THEN 'WAW Sydney'
        WHEN first_refr_urlhost LIKE '%wawmelbourne.com'
          THEN 'WAW Melbourne'
        WHEN first_page_referrer LIKE '%meetup.com/Web-Analytics-Wednesdays-Melbourne%'
          THEN 'WAW Melbourne'
        WHEN first_refr_urlhost LIKE '%meetup.com'
          THEN 'Meetup'
        WHEN first_refr_urlhost = 'com.Slack'
          THEN 'Slack'
        WHEN first_refr_urlhost LIKE '%reddit.com'
          THEN 'Reddit'
        WHEN first_refr_urlhost LIKE '%facebook.com'
          THEN 'Facebook'
        WHEN first_refr_urlhost LIKE '%baidu.com'
          THEN 'Baidu'
        WHEN first_refr_urlhost LIKE '%github.com'
          THEN 'Github'
        WHEN first_refr_urlhost LIKE '%getpocket.com'
          THEN 'Pocket'
        WHEN first_refr_urlhost LIKE '%evernote.com'
          THEN 'Evernote'
        WHEN first_refr_urlhost IS NULL
          THEN 'Unknown'
        ELSE 'other'
        END AS channel
      FROM sessions
      ORDER BY channel
  )

Reporting

To build the report shown above, we’re just looking at the channel categorizations and the associated sessions. Your app_id would be different.

-- Landing pages
SELECT se.channel, COUNT(DISTINCT ev.domain_sessionid) AS sessions
FROM atomic.events ev, lookup_sessions se
WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '7 days'
      AND EVENT = 'page_view'
  AND ev.domain_sessionid= se.domain_sessionid
  AND ev.app_id = 'snowflake-analytics.com'
GROUP BY 1
ORDER BY 2 DESC

A more advanced report breaks out traffic sources by landing pages, showing source then the first page in the session. Your app_id would be different.

SELECT se.channel,  se.first_page_urlpath, COUNT(DISTINCT ev.domain_sessionid) AS sessions, COUNT(*) AS pv
FROM atomic.events ev, lookup_sessions se
WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '7 days'
      AND EVENT = 'page_view'
  AND ev.domain_sessionid= se.domain_sessionid
  AND ev.app_id = 'snowflake-analytics.com'
GROUP BY 1, 2
ORDER BY 3 DESC

Next steps

In the next installation we’ll look at user data. The most interesting component of that model is the ability to identify users retrospectively and across devices.

Continue to Part 4: Users

[Read More]

Modelling your Snowplow event data: Part 2 Pageviews

Wednesday 15 August, 2018. | By: Simon Rumble

In the first part of this series on data modelling we went through the background for building a data model. In this edition we’ll go through the steps to create a basic pageview model that incorporates page pings so we can see accurate time spent and scroll depth for each pageview.

Using this model you can then aggregate the pageview rows and create reports to explore the traffic to your site and build a report like the one shown above which gives us Pageviews, Sessions, Unique Browsers, Average Time Spent and Average Scroll Depth metrics for each page.

Create a lookup table on pageview_id

This model takes advantage of the web page context configuration option of the JavaScript tracker. This automatically creates a UUID for each pageview which is attached to all events within the pageview that subsequently shows up in the table atomic.com_snowplowanalytics_snowplow_web_page_1. We’ll create a table that aggregates all the page_ping data against the pageview_id.

-- DROP only needed when you recreate after the first time
DROP TABLE lookup_pageview;
CREATE TABLE lookup_pageview
  DISTKEY(2)
  SORTKEY(2, 1)
  AS (
  SELECT
    atomic.com_snowplowanalytics_snowplow_web_page_1.id AS pageview_id,
    (COUNT(*) * 5)-5                                    AS time_spent_seconds,
    MAX(pp_yoffset_max)                                 AS max_pp_yoffset_max,
    MAX(doc_height)                                     AS max_doc_height,
    CASE WHEN MAX(pp_yoffset_max) = 0 THEN 0 ELSE (MAX(CAST(pp_yoffset_max AS float)) / MAX(CAST(doc_height AS float))) END AS scroll_depth_percent
  FROM atomic.events
    JOIN atomic.com_snowplowanalytics_snowplow_web_page_1
      ON atomic.events.event_id = atomic.com_snowplowanalytics_snowplow_web_page_1.root_id AND
         atomic.events.collector_tstamp = snowplow.atomic.com_snowplowanalytics_snowplow_web_page_1.root_tstamp
  WHERE
    app_id = 'snowflake-analytics.com'
    AND event IN ('pageview', 'page_ping')
  GROUP BY 1
);

Let’s go through what’s happening here.

We create a table lookup_pageview using a query that joins atomic.com_snowplowanalytics_snowplow_web_page_1 to atomic.events keyed by event_id and collector_tstamp, the usual way of joining context tables to the core event table.

The output table has a time_spent_seconds column with a time spent calculation based off the page ping setting we use of window.snowplow('enableActivityTracking', 5, 5);

Next we ge the maximum values for the pageview of the scroll depth and document height in pixels, then make a calculation of the percentage that means.

That’s really it! Quite simple really.

Reporting

To build the report output shown above, you can then do a query with a three-way join between atomic.events, atomic.com_snowplowanalytics_snowplow_web_page_1 and our just-created lookup_pageview pv.

SELECT
  ev.page_urlpath,
  SUM(CASE WHEN ev.event = 'page_view' THEN 1 ELSE 0 END) AS pageviews,
  COUNT(DISTINCT ev.domain_sessionid) AS sessions,
  COUNT(DISTINCT ev.domain_userid) AS UBs,
  AVG(pv.time_spent_seconds) AS time_spent_average,
  SUM(pv.time_spent_seconds) AS time_spent_total,
  ROUND(AVG(pv.scroll_depth_percent)*100) AS scroll_depth_percent_average
FROM atomic.events ev, atomic.com_snowplowanalytics_snowplow_web_page_1 pg, lookup_pageview pv
WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '120 days'
  AND ev.event_id = pg.root_id
  AND pg.id = pv.pageview_id
  AND ev.app_id = 'snowflake-analytics.com'
GROUP BY 1
ORDER BY 2 DESC

Eventually you would write out new data tables that replace the need to ever hit atomic.events directly, containing the items you want from atomic.events and the identifiers for the applicable pageview, session and user objects. You wouldn’t copy across page pings because all you need is summarised in the pageview table. Again, you might create a view that does the joins to give you one row per pageview with everything an analyst could need.

Below is a query to bucket page views by scroll depth quartile, handy for getting a broader view of user interactions:

SELECT
  ev.page_urlpath,
  SUM(CASE WHEN scroll_depth_percent = 0 THEN 1 ELSE 0 END) AS "0%",
  SUM(CASE WHEN scroll_depth_percent > 0 AND scroll_depth_percent <= 0.25 THEN 1 ELSE 0 END) AS "25%",
  SUM(CASE WHEN scroll_depth_percent > 0.25 AND scroll_depth_percent <= 0.5 THEN 1 ELSE 0 END) AS "50%",
  SUM(CASE WHEN scroll_depth_percent > 0.5 AND scroll_depth_percent <= 0.75 THEN 1 ELSE 0 END) AS "75%",
  SUM(CASE WHEN scroll_depth_percent > 0.75 THEN 1 ELSE 0 END) AS "100%"
FROM atomic.events ev, atomic.com_snowplowanalytics_snowplow_web_page_1 pg, lookup_pageview pv
WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '120 days'
  AND ev.event_id = pg.root_id
  AND pg.id = pv.pageview_id
  AND ev.app_id = 'snowflake-analytics.com'
  AND event = 'page_view'
GROUP BY 1
ORDER BY COUNT(*) DESC

Automation and moving to production

You probably want to create a schema especially for data models instead of dropping this table into atomic. We use a models schema for this.

The script above recreates for all pageviews every time it runs. To automate this, you’ll want to trigger it off SQL Runner after each batch runs. We don’t have enormous traffic volumes so recreating every time isn’t a big deal. If you have high traffic volumes, you might want to use a smaller window and keep old data around.

To do this you’ll need to use a look-back window, something like 24 hours, to account for pageviews that are still happening across the batch boundary and only update those within that scope. This would have the helpful side effect of doing less crunching on every batch which is probably preferable. I wanted to keep the example simple.

Next steps

In the next chapter of this series we’ll look at a session model. Sessions are interesting because they have a traffic source: how the user got to your site. The classifications of these sources are very company-specific, so you often end up with a massive CASE statement to put them in appropriate buckets, much like Google Analytics has its Channel Groupings.

Further editions will go through the session and user scope models. Finally we’ll pull it all together by automating the model and creating a derived modelled data table.

Continue to Part 3: Sessions

[Read More]

Modelling your Snowplow event data: Part 1

Wednesday 8 August, 2018. | By: Simon Rumble

One of the important features of Snowplow is that you can build your own custom data models to suit your unique analytical requirements in a space-efficient and performant way. It’s important that your data model can evolve and grow in complexity as your business grows and your needs get more advanced.

This is the first in a series of blog posts taking your through the thought process of building a data model. Good background reading for this series is the Snowplow blog post on event modeling.

Business context

Here at Snowflake Analytics we’re putting some effort into using our own Snowplow pipeline inside the business. At the moment we’re kind of like the proverbial plumber’s house with non-functioning pipes

My main focus at the moment is our sales process: tracking prospects as they come in, qualifying leads and then converting them into customers. Content like this blog forms an important part of that process, exposing our message to our target market of businesses that want to do advanced analytics.

Site visitors’ content consumption patterns give us some insight into their interests and expertise. We want to feed these insights into our sales pipeline. Where possible we can identify people or their companies and, based on their content consumption, we can decide if it’s worth trying to track down the individuals who are at the right point in the thought process to be a good candidate for our services.

Why model data?

Snowplow architecture stage 5: Data Modelling

The data that drops into the atomic.events table is event-level data. To make it useful we need to apply some logic to it. You can get a lot done with simple queries to show you basic metrics, for example:

SELECT
  page_urlpath,
  SUM(CASE WHEN event = 'page_view' THEN 1 ELSE 0 END) AS pageviews,
  COUNT(DISTINCT domain_sessionid) AS sessions,
  COUNT(DISTINCT domain_userid) AS unique_browsers
FROM atomic.events
WHERE
  app_id = 'snowflake-analytics.com'
  AND CONVERT_TIMEZONE('Australia/Sydney', derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '7 days'
GROUP BY 1
ORDER BY 2 DESC

This will get you some quick metrics replicating what you’d get in a tool like Google Analytics.

There are some limitations baked into the query above too. First up, you’re accepting the default definitions for sessions and unique_browsers. The domain_sessionid field is generated by the client JavaScript which has some limitations and will differ from the numbers Google Analytics generates in some odd ways. domain_userid is useful for a single domain site but if you have multiple domains and want to merge those, or you’ve got identified users and you’d like to deduplicate those across devices, or you use AMP which doesn’t get that cookie set, you need more.

Of course you’re also missing any kind of engagement metrics. One of Snowplow’s best features is its ability to measure accurate time spent. This, or even just replicating the way GA and Adobe do it, will require something more complex and a chunk more load on your database.

Remove extraneous data, save size, speed up queries

If you start seeing lots of data volume, these queries will start being onerous quite quickly. For standard end-user reporting you want to have the fastest performance possible and avoid putting lots of load on your database recalculating this stuff every time. Summary tables are the answer here.

Keeping all your raw data around in a database can get expensive quite quickly, especially given there’s a bunch of replicated data that you probably don’t need often. Columns like useragent are repeated for every single event, page pings happen often to give you good time spent resolution, and there’s the deprecated fields like br_gears which aren’t useful to anyone.

Best to get rid of those and only keep the fields and rows you need, merging the result of all your calculations. Your modelled data might include a unified row for each pageview with some additional columns representing time spent, scroll depth and without any extraneous columns you don’t need.

Centralise business rules upstream of reporting

Working from a uniform set of business rules makes everyone’s life easier. We’ve all experienced the fun of explaining why two different reports created by different groups end up with different results. By limiting access to the raw data and consolidating upstream you can set and maintain these business rules and common calculations in one place. For example, how you define:

  • time spent metric
  • unique browsers metric
  • bounces
  • traffic source classification
  • attribution rules
  • conversions

By moving these into a central data model, everyone works from the same source and you get less confusion. The business model can be updated as things evolve.

Keep complexity hidden

Related to centralisation, you can hide complexity and let your analyst and reporting users get on with their work while you worry about the complex stuff. For example, the Snowplow AMP tracker has some tricky limitations. It produces no page pings, so calculations of time spent will need to be somewhat arbitrary or use the GA/Adobe method. How do you merge these with your more accurate measurements from normal web pages and inside apps? AMP also doesn’t set a first-party cookie to become domain_userid. How will you deal with unique browser metrics for these users?

By building a model with a common definition, your reporting users can see a simple single row per pageview data model, with columns representing the output of complex logic for things like user deduplication, time spent, scroll depth and conversions. Their queries can be simple aggregations and filtering, while your data model holds all the complex conditional logic to deal with edge cases.

An example of the complexity rabbit hole

While working on the data model for this blog post I came across another weird edge case that’s worth sharing. The rows below have the same pageview_id, some have the same event_id and domain_userid and they get 25 pageviews for each combination, despite being in places as far apart as Indonesia and California.

Odd behaviour from Google agent

These are almost certainly bots of some kind, quite a lot have Google’s fingerprints on them. It can be quite difficult to work out if you want to count these, and whether some subset you want to count and another you don’t want to count. These are complexities you want to decide on once, then hide from everyone forever after.

Paper over definitional changes and breakage

It’s quite common to see tracking you’ve built breaks because a web developer changed something unexpectedly. This happens a lot if you use page URLs to build a Goal Funnel in Google Analytics. The URL for the conversion step changes and you end up with a gap in data. By modelling out your data you can seamlessly switch across from the old to the new definition and your analytical users need never know, simply seeing a boolean for whether a conversion occured.

Model scope

An important thing to think about is the scope of each object in your data model and what fits in each. Pageviews summarise things about the pageview itself. Sessions hold things that change once per session such as traffic source information. Users hold everything we know about the user, and potentially other ways we can identify users.

Scope of the different parts of your model

Pageview scope

  • Engagement metrics:
    • Time spent on page
    • Scroll depth
  • Events that happen on page:
    • in-view events
    • social share
    • form submissions
  • Conversions

Session scope

  • Attribution
    • Traffic sources
    • Classification of those sources

User scope

  • Identity from login, email clickthrough, other
  • Data tied to identity: CRM
  • IP-lookup data: Clearbit: may have multiple values

A candidate data model

Pulling all this together, we might come up with a data model that looks like this. To build a complete view, you’d select from the atomic.events table, then join that to com_snowplowanalytics_snowplow_web_page_1 to get the pageview identifier. From there I can join on the pageview, session and user tables to get a complete picture. It’s probably easiest to create a view that extends atomic.events and bolts on pageview-, session- and user-scope details for every row.

A potential final data model

Eventually you would write out new data tables that replace the need to ever hit atomic.events directly, containing the items you want from atomic.events and the identifiers for the applicable pageview, session and user objects. You wouldn’t copy across page pings because all you need is summarised in the pageview table. Again, you might create a view that does the joins to give you one row per pageview with everything an analyst could need.

Next steps

In the next part of this series I’ll show you how to build an initial pageview scope data model to analyse content engagement metrics, specifically accurate time spent and scroll depth based off page pings.

Further editions will go through the session and user scope models. Finally we’ll pull it all together by automating the model and creating a derived modelled data table.

Continue to Part 2: Pageviews

[Read More]

Snowplow Inspector extension updated to colour code and allow filtering

Monday 16 July, 2018. | By: Simon Rumble

Snowplow usage continues to go up and to the right, with new and interesting use cases proliferating. Alongside this we’ve seen a steady increase in usage of our debug tool, Snowplow Inspector.

We’ve started to see web sites with multiple Snowplow trackers on them for different use cases as businesses integrate the Open Source version into vertical-specific products. This can lead to a confusing view in the debugger with a lot of beacons appearing in the debugger and no easy way to differentiate them. Jethro has been hard at work improving the tool to handle these cases.

Colour Coding

Different colours are used for each unique combination of App ID and Collector

You’ll notice now that beacons are colour-coded in the list of beacons on the left-hand side. Each unique combination of app_id and collector gets a unique colour so you can quickly differentiate the different destinations. Mousing over the beacon in the left-hand column quickly shows you the Collector and App ID. As you scan your eye down the list of beacons, you can visually group them.

Filtering

Filter by any of the fields in the debugger

In addition, if you want to filter them down by Collector or App ID, you can type that into the new filtering box. This filter actually works for any of the fields you see in the debugger (though not the Base 64-encoded ones just yet).

Let us know your thoughts

If you find any bugs or have any feature requests for the debug too, please create an issue in GitHub.

[Read More]

Setting up an AWS Athena datasource in Jetbrains DataGrip

Friday 29 June, 2018. | By: Simon Rumble

Download the JDBC driver from AWS and place it in the DataGrip JDBC driver directory. On Linux this was ~/.DataGrip2018.1/config/jdbc-drivers/.

Install Athena JDBC driver

File > Data Sources to open the Data Sources panel and click ‘+ > Driver’. Name it AWS Athena. Here’s the confusing bit: skip down to Additional Files and click + and find the location of your driver. Now back up above that select the driver Class com.simba.athena.jdbc.Driver. Click OK.

Okay so you now have the JDBC driver installed, so you need to create a connection to your Athena instance.

Create a connection for Athena File > Data Sources to open the Data Sources panel and click + > AWS Athena to create a new connection using the driver. Enter the access key and secret in the User and Password fields respectively.

Set the URL to jdbc:awsathena://athena.ap-YOURREGION.amazonws.com:443/ with your region string in there.

Create a connection for Athena Now click Advanced and set the two required keys, AwsRegion and S3OutputLocation. The region should be the same as you used above.

Create a connection for Athena S3OutputLocation is available in the AWS Console for Athena by clicking Settings.

Now click Test and you should get Success. You can now use Athena in DataGrip.

[Read More]

Event roundup and slides: MeasureCamp Auckland

Monday 18 June, 2018. | By: Simon Rumble

Last Saturday Mike and I made it across to the first MeasureCamp Auckland. We’ve sponsored all the previous events in Sydney and Melbourne so we were thrilled to help out the Auckland team, who did a fantastic job with a really good venue, excellent food and a really diverse crowd keen to get into it.

MeasureCamp physical board The board filled up very quickly. Clearly lots of people knew the drill and had prepared talks and sessions. I had some difficult decisions to make when sessions clashed but I didn’t go to a dud session all day!

MeasureCamp digital board At previous MeasureCamps a consistent problem has been access to the physical board. The board needs to be reachable for people to put up their talks, but that also means the people crowding around deciding on their sessions block the view of those behind them. It’s also difficult for people with visual impairments to read.

Mike has been working on a digital board. Initially as a fork of Vitaliy Matiyash’s code but eventually rebuilding it in Vue.js. The back-end is a Google Spreadsheet, allowing quite a bit of flexibility while keeping the board in sync with the physical board.

MeasureCamp digital board real-time stats from Snowplow data

Late the Friday night before MeasureCamp Auckland, Mike and I sat up instrumenting and improving the user interface. We were able to build a quick real-time dashboard in Kibana using the Snowplow instrumentation. For future MeasureCamps this should provide some helpful feedback on the popularity of sessions, allowing organisers to shift things around to ensure the most popular sessions are in the biggest rooms.

We have some other plans around automatically tracking the cards as they move around using a camera and some kind of unique marking on each card. Perhaps we could have microcontroller-drive screens at each room showing what’s on and up next in each location. Lots of ideas!

The code will be going up on Github shortly.

Johann de Boer on TV advertising impact measurement Johann did a really interesting talk about tracking television (and radio) impacts, and the limitations of the technique. This is something I’ve always wanted to play around with, tying a detailed as-run log of TV spots in different regions into an increase in visits to your site.

Myriad lunch options Mmmm donuts Food and drink at the event were spectacular. Organisers made heroic efforts to accomodate the incredible mosaic of food requirements. Free coffee in the morning was great too!

Bowling at the after party Finally we ended up at the afterparty, a bar that included a couple of bowling alleys. Loads of fun, loads of pizza and a seemingly endless bar tab. Great venue for it!

Slides from talks

Here’s the slides for my talks. Feel free to get in touch if you have any questions about anything.

Digital Advertising: How does it work?

My always-popular session going into the nuts and bolts of display advertising, how it works and why. The last slide has some handy links to learn more.

Snowplow Analytics: Introduction and demo

A quick introduction to how Snowplow Analytics works and why you might want to use it. There was some great discussion in this session.

[Read More]

GDPR emails: some early, some late

Monday 28 May, 2018. | By: Simon Rumble

In case you’ve been living in a cave, the EU’s General Data Protection Regulation came into force on Friday last week. This is Europe’s latest crack at regulating consumer privacy in the digital era, and it’s definitely a very pro-consumer regulation. It’s going to have some big impacts on the business models of a bunch of businesses.

Part of the regulation requires that EU consumers give informed consent to personal data collection and use so companies around the world have been scrambling to update their privacy policies. For people whose data is already collected, they’re sending out emails desperately trying to get permission to continue doing whatever it is they were doing.

GDPR emails by day

Last week I decided I was getting enough of these emails that it’d be interesting to chart how many of them I’d received. As you can see there was a flurry of emails just before the deadline, and quite a large amount of them that completely missed the deadline. The last one so far came from Blockchain on Sunday morning Sydney time.

There are some caveats here. When these emails first started turning up, I was mostly just binning them. So any received more than 30 days before I changed the way I dealt with them are probably gone. Similarly a bunch ended up in the spam folder so older ones would have been deleted.

[Read More]

Snowplow Inspector debug tool now validates schemas

Thursday 15 March, 2018. | By: Simon Rumble

Snowplow Inspector use around the world Last June we released our debug tool Snowplow Inspector and we’ve watched as hundreds of Snowplow users around the world have used it to make sense of their Snowplow tracking.

Latest version

Snowplow Inspector new design Jethro has been hard at work improving the debugger. First he ported it to Typescript for better maintainability, then he reworked the design to be more readable and useful. Following that he’s added a really useful new feature.

Schema validation

Snowplow Inspector new design The exciting new feature in the latest version is the ability to validate custom schemas within the debug tool. Custom schemas allow you to define your own event model to record details of events and the context in which they occur. Rather than shoehorn your event model into a static set of Custom Dimensions, you can model things using JSON, enabling really expressive data models that map very closely to your applications’ internal representation of things.

Custom data models require that you define a schema that tells Snowplow what is allowed and not allowed in the model. If an event doesn’t validate against the schema, it will end up being sent into the Bad Rows. You want to get things working in your development, staging and test environments before pushing changes to production and ending up with a lot of important events in Bad Rows.

Get set to validate your custom schemas

Out of the box Snowplow Inspector points at the standard Snowplow schemas. Once you start creating your own schemas you’ll want to point at your schema repository. Right-click on the Snowplow Inspector icon in your browser toolbar (it may be hidden under the hamburger icon ☰) and select Options. You can also click on Unrecognised in the debugger itself.

Snowplow Inspector options

Enter schema repository URLs for any custom repositories you use, one per line. You’ll need to enable access to the repository from wherever you’re running Chrome with IP whitelisting or similar.

Install the extension

To install the extension, go to the Chrome store and click Add to Chrome.

Feedback and changes

The extension is released under the GNU Public License, so you’re free to fork it and make changes. Check out the Github repo. If you find any bugs or have any feature requests, create a Github issues. Jethro is remarkably responsive to feature requests!

[Read More]

Setting up a BigQuery datasource in Jetbrains DataGrip

Monday 19 February, 2018. | By: Mike Robins

DataGrip is one of the most valuable tools for our engineers for exploring and querying a myriad of different database technologies. DataGrip doesn’t yet come bundled with a BigQuery driver so in this post we’ll explore how to setup a custom data source so that you can connect to BigQuery using DataGrip.

In this post we’ll use:

You will also require a Google service account that has access to BigQuery. If you don’t have one setup you can find instructions here.

Take note of the email address and the associated key pair with this account. You’ll need both of these pieces to setup the BigQuery connection.

Ensure that this service account has sufficient permissions to access BigQuery. If you aren’t sure if this is the case navigate to IAM, select your project and edit the role associated to the service account (e.g., BigQuery -> BigQuery Data Editor/Viewer)

Setting up the BigQuery driver

Add a new data source by clicking on the ‘+’ icon, select ‘Driver’ - the second last option in the drop down menu.

Enter ‘BigQuery’ as the name of the data source.

Under ‘Additional files’ click the ‘+’ icon and browse and select the all the jar files (ending in *.jar). There should be 7 in all.

Under ‘Class’ select com.simba.googlebigquery.jdbc42.Driver

Click the ‘+’ icon in URL templates and add the following URL template:

Name: default Template: jdbc:bigquery://[Host]:[Port];ProjectId=[Project];OAuthType=[AuthValue];

BigQuery driver setup

This concludes setting up the JDBC driver for BigQuery.

Setting up a BigQuery data source

Next let’s set up a data source to ensure we can connect.

  1. In the data sources window hit ‘+’ and select BigQuery.
  2. Enter the name of your datasource in the Name field.
  3. Leave the user and password fields blank.
  4. In the URL field customise and paste the following JDBC connection string:

where PROJECTID is your Google project ID, OAuthType=0 (for service account authorisation), EMAIL is the email address (ending in .gserviceaccount.com) associated with the service account and KEYPATH is the absolute path to your keyfile (e.g., /Users/mike/Downloads/example.json) - both .json or .p12 keys will work here.

jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=PROJECTID;OAuthType=0;OAuthServiceAcctEmail==EMAIL;OAuthPvtKeyPath=KEYFILE

Hit the ‘Test Connection’ and if everything has been set up correctly Datagrip should notify you that the connection has succeeded.

Troubleshooting

Error: HttpTransport IO error

Check that your OAuthServiceAcctEmail and OAuthPvtKeyPath are set correctly and you’ve specified the absolute path to your key.

Tips

There a number of other options that can be customised for the JDBC driver by selecting ‘Advanced’ in the Data Source connection.

It is also possible to use other authentication methods to connect to BigQuery including:

Google User Account using OAuth 2 (OAuthType=1) Using pregenerated access and refresh tokens (OAuthType=2) Using the Application Default Credentials in your environment (OAuthType=3)

For more information about setting up these authentication methods please refer to the Simba JDBC Driver for Google BigQuery installation guide from the zip file earlier in the post.

[Read More]

Why would you push Google Analytics data into Snowplow?

Wednesday 14 February, 2018. | By: Simon Rumble

In late January the Snowplow team released R99 of Snowplow which includes a really interesting feature: with a small JavaScript change, you can mirror all the data being sent into Google Analytics into a Snowplow collector.

It’s a pretty awesome feature and got a lot of people talking. GTM legend Simo Ahava got really excited about it and wrote a couple of posts. Cogs started turning in the minds of a bunch of other people in the analytics community too, seeing where this might be useful.

If you need a recap of how Snowplow differs from GA and Adobe Analytics, check out How does Snowplow Analytics compare to other vendors?

Raw data

Pretty much everyone who has done digital analytics has come up with a use case where they wish they could get to the raw, underlying events happening. It might be because you have a change of definitions you want to turn into a continuous timeline for your users rather than explaining it half a dozen times a week. You might want to re-categorise things that were categorised at collection time. Or you might want to get down and dirty with every individual users’ event stream.

There’s a few ways to get to event-level data in Google Analytics. One is to pay Google a chunk of money and upgrade to GA360. It’s a good product, and if you want some of the other features it could be worth it but it’s a lot of money to pay for just getting access to your own raw data.

Other mechanisms extract data from the APIs, but they’re kinda hacky and won’t scale up to high traffic volumes as you’ll start hitting API limits.

The final approach is to duplicate all the data being sent into Google Analytics into something else. Older versions of the Google Analytics Urchin script used to explicitely support this, as Urchin had a self-hosted version available. That’s been thrown out, but it’s still possible to duplicate the payload and send it somewhere else.

Duplicate your Google Analytics pixels to Snowplow

Snowplow isn’t the first group to try this, but the important different with the Snowplow pipeline is you’re getting a robust, scalable, tested and very flexible pipeline. It also does a bunch of enrichment activities that make the data more useful, like User Agent classification, IP to geography lookups and setting a third-party pixel for cross domain use where that works.

Real-time

Snowplow’s real-time pipeline collects data and makes it available in as little as a few seconds. Not aggregations, not isolated dimensions, and not fifteen minutes later. The whole lot, enriched and with everything you’ve sent available. For some use cases, this is absolutely essential. If you want to react to the user’s actions within the same session, you need real-time. Using the GA adapter, you can take your existing Snowplow pipeline and have an instant real-time upgrade without changing anything in your tagging.

Test out Snowplow with real data

Another reason this feature is interesting is for people who are keen to dip their toes in Snowplow but the prospect of defining and building all the events and data models from scratch is a bit daunting. If your GA implementation is pretty solid, you can be up-and-running really quickly and start exploring what it’s like working with event-level data.

Next you can spend some time upgrading specific events to a richer event model, building custom schemas and seeing where that leads. It’s a nice gentle introduction, and you should see some value out of it pretty much immediately.

GA360 features for cheap

The Google Analytics script doesn’t know whether you’ve paid for GA360 or not, so you can send in a full 200 custom dimensions but you’ll only see the first 20 in the UI. The Snowplow version of that data won’t have the same limitation. So you can send in 200 dimensions and get them back out in the event-level data.

Of course, Snowplow has a much richer event model than the flat 200 dimensions. You can model things like arrays of objects in Snowplow, which ranged from clunky to impossible in GA.

There’s no sampling in Snowplow data. You have all the raw events at your fingertips, so performance is dictated by the size of your data, the complexity of your query and the amount of hardware you throw at the problem.

Redshift, Postgres, Hive, Snowflake databases

Snowplow allows you to output into more databases: Redshift, Postgres, Hive and Snowflake Computing are all options for your data. BigQuery is awesome, but if you’re already using one of these databases it might be a better fit.

Backup

Sometimes you just want a copy of everything. Y’know, because backup. This is a quick and easy way to get that.

[Read More]

Presentation: Real World Big Data Architectures

Monday 4 December, 2017. | By: Simon Rumble

A few weeks back Nicholas Tan gave a presentation at the Future of Financial Services conference about architectural designs in the real world to get value from data. Nick most recently was responsible for News Corp’s large-scale Snowplow Analytics rollout and has just started at Macquarie Group. Check out his presentation.

Nicholas Tan

Nick has some great things to say about Snowplow, its cost model and the use cases for it in financial services. Of course we love that he references the Digital Analytics Hierarchy of Needs.

Snowplow cost model

[Read More]

Snowplow R95 with ZSTD support released

Tuesday 14 November, 2017. | By: Simon Rumble

Back in July, we did a bunch of work to quantify the benefits of ZSTD in the Redshift database, resulting in this blog post from Mike. The results were a clear, and massive with at least 50% reductions in storage use, improvement in nearly all use cases. We started migrating our customers to using ZSTD wherever possible so they could benefit from this huge improvement.

These changes have now been incorporated into the latest upstream Snowplow open source release. Migration for existing users is a little tricky, requiring a deep copy. New users will benefit from these latest changes straight away.

Get in touch if you have any questions.

[Read More]

Slides from Measurecamp Sydney

Monday 6 November, 2017. | By: Simon Rumble

Snowflake Analytics sponsored the always awesome Measurecamp Sydney unconference last weekend. As usual it was an incredibly high-value event with really great sessions and even better informal chats between the sessions. Such a great event and we’re looking forward to Melbourne early next year.

We gave a few talks there on various topics of interest so here’s the slides.

Display advertising: How does it work?

I’ve given this talk at least 30 times now and have been updating it for a very long time as the ecosystem evolves. Always really interesting questions about how the display ecosystem works.

Microcontrollers: Collect data from your environment

My latest obsession is these tiny, cheap little devices that allow you to collect data from and interact with the world around you. I’m throwing a few around my house to collect temperature data and send it into Snowplow for later analysis.

Enabling data-driven Innovation: Defining and developing analytics maturity

Narbeh’s talk about the philosophy behind Snowplow Analytics and how solid data collection enables and drives innovation.

[Read More]

The Digital Analytics Hierarchy of Needs

Thursday 21 September, 2017. | By: Simon Rumble

A few weeks ago I discoverd Monica Rugati’s fantastic Data Science Hierarchy of Needs. It’s a data science-centric riff on Maslow’s Hierarchy of Needs, a classic concept in pyschology. I’ve found myself using Rugati’s diagram and the concept in conversations with colleagues, partners, customers and friends ever since, as a way to explain the challenges we face in this Digital Analytics space.

Business people get excited about the latest buzzwords: Big Data, Artificial Intelligence, Deep Learning etc. Before you can break out TensorFlow and start doing bleeding-edge data science, you need to ensure you’ll be working on data that reflects reality.

In practice most new clients we audit have substantial problems with behavioural event data tracking and require at least some work to get to the minimum baseline of accurately tracking everything a user does on their web site and apps.

And so, let’s have a look at the Digital Analytics Hierarchy of Needs:

Digital Analytics Hierarchy of Needs

Where am I?

We see a full spectrum between near perfect and absolute disaster zones but the advice is generally much the same: you need to get your data collection in order if you want to do the cool stuff. Along the way you’ll unlock new insights and new capabilities. Done right, you’ll also give yourself awesome flexibility for future improvements and modifications to how you do Digital Analytics.

What does “done right” mean? That usually starts with tag management and a data layer. That’s a topic for another blog post, coming soon.

How long is this going to take?

To get to the minimum baseline, you don’t have to down tools and spend a year reworking. There’s always interim steps that will bring useful results and help you get insights. It all starts with getting a good idea of the current state and building a plan to get you to the top of the Digital Analytics nerdvana pyramid.

So how do I get to the top?

We tend to start our customer engagements with a data audit to find out where things stand and what work needs to be done to unlock the cool stuff. It’s a pretty quick and painless process, and there’s invariably a list of quick wins that bring immediate value, along with a roadmap of bigger pieces to work on.

Get in touch with us if you’d like to kick off the discussion.

[Read More]

Snowflake Analytics welcomes Conrad Yiu as an advisory board member

Wednesday 30 August, 2017. | By: Narbeh Yousefian

We are delighted to announce Conrad Yiu has joined Snowflake Analytics as an Advisory Board member. Conrad brings over 20+ years of experience as an entrepreneur, venture investor and business builder.

After life in professional services in Sydney and London, his first start-up was a mobile technology company in Cambridge in the late 90’s which is where he also met & worked with Alex Dean, one of the co-founders of Snowplow Insights.

Conrad returned to Australia in 2005, where he was Director of Corporate Development, leading investments and M&A for Newscorp’s digital division. He left in 2011 to start ArdenPoint, an investment firm that co-founded Temple & Webster (ASX:TPW), Fluent Retail and ParcelPoint. In addition to his current portfolio he is a venture partner at Tempus Partners and started a new watch company, Mountain Active Co.

[Read More]

A quick script to ZSTD all your shredded tables

Friday 25 August, 2017. | By: Simon Rumble

Mike’s recent post about compressing Snowplow tables works great for atomic.events, with clients seeing compression down to 30% of the original size or so. But what about all your shredded tables?

For now you have to manually convert the output from igluctl while we wait for our pull request to make it into a release, from then on this will be automatic. There’s also a pull request in to change the default compression for atomic.events too.

Run the following code from the command line from the root of your schema definitions and it’ll automatically convert everything relevant to ZSTD. This was written by one of our staff.

sed -s -r -i -e '/root_(id|tstamp)/s/ENCODE (BYTEDICT|DELTA|DELTA32K|LZO|MOSTLY8|MOSTLY16|MOSTLY32|RUNLENGTH|TEXT255|TEXT32K|ZSTD)/ENCODE RAW/' -e '/root_(id|tstamp)/!s/ENCODE (BYTEDICT|DELTA|DELTA32K|LZO|MOSTLY8|MOSTLY16|MOSTLY32|RAW|RUNLENGTH|TEXT255|TEXT32K)/ENCODE ZSTD/' -- sql/com.yourcompany/*.sql

[Read More]

How does Snowplow Analytics compare to other vendors?

Wednesday 9 August, 2017. | By: Simon Rumble

Tonight Snowflake Analytics team members Mike and Narbeh are debating the merits of Snowplow Analytics with representatives of Google Analytics and Adobe Analytics at Web Analytics Wednesday. The head-to-head aspect of it is meant to be lighthearted, but it’s forced us to think about some of the ways Snowplow Analytics is a better match for many types of digital analytics problems.

So without further ado, here’s a rundown of features where Snowplow has unique qualities over Google and Adobe Analytics.

Feature Snowplow Insights Google Analytics Adobe Analytics
Real-time Some Some
Unsampled data and performance
Custom structured data models
Personally Identifiable Information
SQL interface
Cross-domain tracking and user identification
Data sovereignty
Change goals with historical data
Accurate time spent metric
Custom enrichments
Incoming Webhooks
High-cardinality (none) (Low-Traffic)
IP addresses and user-agent strings

Real-time

Snowplow Analytics can operate in real-time mode with latency down to about 5 seconds from collection to availability. Google and Adobe Analytics offer this with a subset of dimensions and events, but Snowplow gives you access to the entire enriched event stream, including all the richness of custom structured data models. See also: Custom structured data models.

Unsampled data and performance

Because Snowplow runs in your own cloud infrastructure, you have control over the inevitable price-performance tradeoffs involved in collecting, processing and analysing large event volumes. You can decide to have full, unsampled data available speedily by throwing more infrastructure at the problem, or have it available less quickly to reduce cost. See also: High cardinality.

Custom structured data models

Snowplow’s event model enables customers to use their existing data models and provide it as context to the events without transforming them. Other tools require you to shoehorn them into their flat dimension/metric model.

This is perhaps best illustrated with an example. Let’s say you want to record an action, say a page print, for a recipe on your content site.

You could record this as an event in Google Analytics, but the only context you can provide comes from the URL of the page itself, and potentially some custom dimensions. Custom dimensions can’t hold multiple values, so you can only model something like an ingredients list as a string, which is really ugly once you get to reporting it. And truncates any string over 150 bytes. Nice!

Adobe Analytics has the concept of List Props and List Vars which would enable this, but you have to shoehorn the result into Adobe’s model. That means no nesting of things, like a list containing other items like recommended recipes.

By contrast, here’s how you would attach a self-describing context for recipes onto a Structured Event in a Snowplow implementation. Note how the model is readable and looks just like how you might store it in your CMS. These contexts can be attached to any kind of event you like, including things like pageviews and social shares.

window.snowplow('trackStructEvent', 'article','print','Chicken Chow Mein recipe', null, null, 
    [
      {
          schema: 'iglu:com.snowflake-analytics/recipe/jsonschema/1-0-1',
          data: {
            "title": "Chicken Chow Mein recipe",
            "author": ["John Smith", "Mary Jones"],
            "category": "recipes",
            "cuisine": "Chinese",
            "course": "main",
            "ingredients": [ "chicken", "noodles", "vegetables" ],
            "content_tags": [ "chinese recipes", "chicken", "quick meals"]
        }
      }
    ]
);

Personally Identifiable Information

Google take a particularly hard line on Personally Identifiable Information. You can’t send it, and if they find it they’ll delete all your data. Ouch.

Adobe are a bit more lenient but still cautious.

With Snowplow, the data is collected and stored in your own cloud infrastructure. It’s still best practice to be quite careful with PII, and particularly with sensitive information, but that boundary line is up to the customer. That means if PII ends up inside your Snowplow data, it can be controlled under your own policies, not those of a third-party organisation. See also: Data sovereignty.

SQL interface

Snowplow’s data shows up in a database with an SQL interface, running in your own cloud infrastructure. You can connect it with your own systems, create joins with other tables, point your existing BI teams at it and generally use it like it’s your own data, because it is.

You also get to keep the raw event data, unprocessed and ready to re-process or re-examine with tools like AWS Athena or Presto.

Adobe Analytics will provide the raw data for you to ingest into a database, and these days they even give you the header row for that table, but it’s up to you to make sense of the data, stitch together visits and the like.

Google Analytics data, with the premium product, can be pushed automatically to BigQuery, which is great. BigQuery is a very capable SQL data analysis tool, and support for it is becoming much more common than in the past, so Google probably gets a pass on this, althrough currently the latency for data to appear is 4 hours while Snowplow can deliver hourly or possibly faster. See also: Change goals with historical data.

Cross-domain tracking and user identification

Snowplow gives you access to all the identifiers: first-party domain cookie, third-party domain cookie, all the ingredients for fingerprinting, IDFA, IDFV, AdId and as many versions of login information as you have. You can then make up your own mind about how you stitch together user identities and sessions.

Adobe Analytics does third-party domain cookies with fallback to first-party domain cookies. Google Analytics has a feature with extremely narrow utility (basically, if you have a hosted checkout, it might be useful). See also IP addresses and user-agent strings

Data sovereignty

This is simple: you get to decide where your data lives, and the security and access policies. Adobe will charge you extra to store your data in some specific jurisdictions, but you have to trust them on the security and access policies. Google, well it’s in “the cloud”. See also: Personally Identifiable Information.

Change goals with historical data

You’ve got full control of your data, so you can recrunch the numbers any time. It might be a big, slow query, but you can do it. That means if you decide to include an or statement in your goal and funnel after you’ve created it, it’s totally doable.

The Google Analytics premium offering has a beta dynamic funnels report, but those don’t show up as conversions.

Accurate time spent metric

The way Google Analytics and Adobe Analytics (and Nielsen and lots of others) calculates the time spent metric is badly flawed, and have got even worse with modern traffic patterns. Snowplow is better. I’ve dealt with this in detail here.

Custom enrichments

Snowplow allows you to query external databases to automatically enrich your data as it passes through the pipeline. For example, you could look up the current exchange rate from your actual bank to record alongside a transactions’ local value.

Incoming Webhooks

Real-time integrations with external systems can be done through Webhooks, meaning your Snowplow pipeline can have immediate access to things like email opens, logistics state changes, incoming phone calls and the like.

High-cardinality

No (other) and (Low-Traffic) in Snowplow reports. See also: Unsampled data and performance.

IP addresses and user-agent strings

All the raw materials to be able to track users, do custom fingerprinting and analyse platforms are available. Other tools throw away these raw materials. We have customers looking at fraud patterns based on IP addresses. In the past we’ve looked for dark social traffic by finding the Facebook in-app browsers that aren’t sending through accurate Referrer headers. (Hint: look for “FBAN” in the user-agent). See also: Cross-domain tracking and user identification.

[Read More]

Make big data small again with Redshift ZSTD compression

Wednesday 12 July, 2017. | By: Mike Robins

A new compression option in Redshift allows you to make big storage savings, up to two-thirds in our tests, over the standard Snowplow setup. This guide shows how it works and how to get it happening.

In late 2016 Facebook open sourced a compression algorithm known as Zstandard that combines Lempel Ziv and tANS to achieve a compression ratio better than many algorithms with a slight tradeoff in speed.

In January 2017 AWS added ZSTD support to Redshift, so you can now use this powerful compression algorithm to reduce the storage size of your data. One immense perk of ZSTD is that it can be applied across all supported data types. It gives exceptional performance on long varchars: perfectly suited for large JSON strings that regularly appear in shredded tables.

We’ve had great success both experimentally and practically by applying ZSTD compression to multiple Snowplow tables. Experimentally across datasets ranging between 10 million and 5 billion rows we’ve achieved a mean compression ratio of ~3 meaning that the newly compressed table takes up approximately a third of the original table on disk when compared to the compression defaults in atomic.events 0.8.0.

In particular we’ve found:

  • Negligible impact on speed of queries for atomic.events (note this will be highly workload dependent so benchmarks here are less useful)
  • So far we haven’t experimented with the performance/space savings of compressing SORTKEYs. AWS recommend against compressing SORTKEYs (so use ENCODE RAW for collector_tstamp and root_tstamp).
  • ZSTD in almost all instances replaces LZO as the default compression method suggested by ANALYZE COMPRESSION.

Things to take into account

  • Depending on your workload you may want to run ANALYZE COMPRESSION on your table which will provide some recommendations by Redshift as to what the suggested column encodings are. One caveat of this approach is that you are limited to sampling 1 billion rows so if possible choose a sample data set that contains representative variability within columns.
  • If you have the opportunity we also recommend benchmarking common queries/jobs on an identical sample of data for a) the 0.8.0 compression defaults and b) the newly compressed tables.
  • You cannot modify compression on existing columns in a table so consider deep copying the data particularly if a large region of your table is unsorted1 as this will outperform a VACUUM. If performing a deep copy ensure you have sufficient disk space to complete the action: it’s difficult to know how much space is required but we opt for at least 50% of the cluster storage remaining. You may need to consider resizing the cluster and/or temporarily pausing your pipeline to complete this action.
  • If you decide to drop the original atomic.events table ensure that you either DROP CASCADE or individually drop any dependencies that may rely on this table such as views or foreign key constraints from shredded or derived tables.
  • If you’ve dropped any foreign keys ensure that you recreate them and they reference the new atomic.events object. Although Redshift does not enforce these constraints they are used by planner to generate optimised query plans.
  • As this is a new object you may also need to regrant permissions to users/groups as well as ensure the table owner is identical to the original table(s) so that Snowplow can continue to load data.

Getting started

  1. Analyze compression on your existing table by using ANALYZE COMPRESSION.
  2. If Redshift recommends using ZSTD for columns consider benchmarking a sample of data (e.g., 3 months) with the original column compression and ZSTD column compression
  3. If performance is better or equivalent deep copy data from the original atomic.events table to a new atomic.events_new table. You may need to resize Redshift or free up additional disk space before performing this action.
  4. Verify that the two tables contain identical data by comparing a total row count as well as a row count per day.
  5. Drop the original table as well as any references2. ALTER the new table to atomic events3. Ensure that this new table has an identical owner.
  6. Sit back and enjoy that free disk space.

Need help?

There’s a few gotchas and edge cases involved with doing this. If you’d like help or advice on how to do it feel free to post in the comments below or get in touch.

Queries

1

 SELECT
 "database",
 "table",
 diststyle,
 sortkey1,
 sortkey1_enc,
 size,
 pct_used,
 unsorted,
 stats_off,
 tbl_rows
FROM
 svv_table_info
WHERE
 schema = 'atomic'
ORDER BY
 size DESC

2

SELECT DISTINCT c_p.oid AS tbloid
,n_p.nspname AS schemaname
,c_p.relname AS NAME
,n_c.nspname AS refbyschemaname
,c_c.relname AS refbyname
,c_c.oid AS viewoid
FROM pg_class c_p
JOIN pg_depend d_p ON c_p.relfilenode = d_p.refobjid
JOIN pg_depend d_c ON d_p.objid = d_c.objid
JOIN pg_class c_c ON d_c.refobjid = c_c.relfilenode
LEFT JOIN pg_namespace n_p ON c_p.relnamespace = n_p.oid
LEFT JOIN pg_namespace n_c ON c_c.relnamespace = n_c.oid
WHERE
 schemaname = 'atomic'

3

ALTER TABLE atomic.events_new RENAME TO atomic.events;

[Read More]

Snowplow Inspector: a debug tool for Snowplow beacons

Tuesday 20 June, 2017. | By: Simon Rumble

Snowplow Insights is an amazingly flexible way to collect data, but with great flexibility comes some complexity. If you work on Snowplow implementations a lot, you’re likely familiar with Base 64 Decode and JSON Formatter when you’re digging into custom contexts and testing your implementation.

Snowplow Inspector

We wanted a tool to make debugging easier so we created Snowplow Inspector for the Chrome browser. It lives in the Developer Tools (☰ > More Tools > Developer Tools or Ctrl-Shift-I) under its own Snowplow tab. Here’s how it looks on the Snowplow site:

Snowplow Inspector screenshot

On the left-hand side of the panel you see the detected Snowplow pixels for each site you’re loading. Clicking on a pixel you’ll see the details of the tracking pixel, including the custom contexts and unstructured events decoded and formatted to be readable.

Install the extension

To install the extension, go to the Chrome store and click Add to Chrome.

Feedback and changes

The extension is released under the Gnu Public License, so you’re free to fork it and make changes. Check out the Github repo. If you find any bugs or have any feature requests, create a Github issues.

[Read More]

Accurate time spent: A killer feature of Snowplow Analytics

Tuesday 7 March, 2017. | By: Simon Rumble

Web analytics tools commonly have a Time Spent metric. Understanding how long people have spent reading a page is a really valuable thing for some businesses. For publishers, the quality of engagement with content is vital, given they’re effectively selling the attention of their readers.

What many people don’t realise is that the way this metric is calculated is critically flawed on the most popular web analytics tools, Adobe Analytics and Google Analytics. These tools calculate Time Spent by counting the time between pageview events.

How is Time Spent calculated?

Snowplow Analytics time spent metric

In this example we see a user who:

  • Lands on a page and reads it for 30 seconds
  • Moves to another page and reads it for 45 seconds
  • Moves to another page and reads it for 90 seconds
  • Leaves the site

So the actual Time Spent is 2 minutes 45 seconds. However, Google Analytics and Adobe Analytics will allocate zero seconds to the final pageview, because these tools only work from the difference in timestamps between pageview events and there is no pageview event following the final page. So Google Analytics and Adobe Analytics will record 1 minute 15 seconds for this session.

The pathological example of this problem is for a single page session:

General Analytics time spent metric done poorly

In this example, the user enters the site, views the content for 30 seconds and then leaves. Traditional web analytics tools will record zero seconds against this session as there is only the single pageview event.

Many publishers now receive a huge amount of traffic in the form of single-page visits, primarily coming from aggregators and social networks. This means despite the fact your content may be receiving significant attention, your analytics will be showing very low Time Spent and a high bounce rate.

How Snowplow does this

How Snowplow Analytics tracks this

Snowplow’s JavaScript tracker contains an option called Activity Tracking which enables a Page Ping that will be sent every n seconds. In this diagram I’ve used 5 seconds as the ping interval starting 5 seconds after the page loads:

1
2
snowplow('enableActivityTracking',5,5);
snowplow('trackPageView');

So once the page loads, a ping is sent every five seconds recording that the page is still open. That gives an accuracy of at least five seconds to calculating the actual time a user spent.

Other tools

For tools that use the traditional mechanism of measuring Time Spent, there are some workarounds to get better numbers. Though none are ideal. The biggest problem is that there is no reliable mechanism to ensure a pixel is sent out when a user leaves your site.


CHARTBEAT

Chartbeat uses a similar approach for data collection, although they attempt to measure actual engagement by monitoring user activity in the window as well. It would be interesting to apply this approach to the Snowplow page ping. Shouldn’t be too hard to update the tracker to support this.


GOOGLE ANALYTICS

Riveted plugin, as described here, tracks actual user engagements Simo Ahava, always worth following, has a few different approaches to the problem


ADOBE ANALYTICS

  • Adobe’s calculation method has changed a couple of times, so be sure you understand that.
  • Some have attempted to trap clicks on exit links and the unload event in the browser and then send a custom link to give the tool another event to extend the session for the last pageview, but these methods aren’t reliable across browsers and platforms.
  • I have a conceptual approach of how to do this very accurately in Adobe Analytics without costing enormous amounts. Drop me a note if you’d like to know more.


How to analyse?

Page pings create a large number of event rows inside your Snowplow database, consistent with the way Snowplow does things. That opens up a bunch of different ways for you to calculate the Time Spent metric. In the next blog, Mike will go through different approaches for modelling the page ping data to analyse Time Spent.

Until then you can check out Snowplow’s Measuring content page performance article.

[Read More]

Decoding Snowplow real-time bad rows (Thrift)

Wednesday 14 December, 2016. | By: Mike Robins

In this tutorial we’ll look at decoding the bad rows data that comes out of Snowplow real time. In the real time pipeline bad rows that are inserted into Elasticsearch (and S3) are stored as base64’d binary serialized Thrift records. We’ll walk step by step the instructions in Python as to how to first decode, and then deserialize these records.

You can find the code for this tutorial without the accompanying text on Github.

Before we start you’ll need 3 things.

  1. A copy of Python 3
  2. The thriftpy library which can be installed using pip install thriftpy
  3. A local copy of the Snowplow Analytics collector payload thrift file

If you’re running real-time you can either take a sample payload from your Elasticsearch index (the line property) from a record in bad rows - otherwise we’ll use a sample event I’ve generated below.

1
2
3
4
import base64
import thriftpy
from thriftpy.protocol import TCyBinaryProtocolFactory
from thriftpy.utils import deserialize, serialize

First up we’ll import the libraries we need. base64 is part of the standard library and the thriftpy imports are from the thriftpy library which has been installed above. Let’s define our example payload below.

1
sample_payload = "CwFAAAAAAi9pCwBkAAAACTEyNy4wLjAuMQoAyAAAAVjbnjdoC3ppAAAAQWlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLnNub3dwbG93L0NvbGxlY3RvclBheWxvYWQvdGhyaWZ0LzEtMC0wCwFKAAABaHN0bT0xNDgxMTUzMzI5MDAwJmU9cHYmdXJsPWh0dHAlM0ElMkYlMkZzbm93Zmxha2UtYW5hbHl0aWNzLmNvbSZ0dj1qcy0yLjYuMCZ0bmE9anMtMy42LjAmYWlkPXNub3dmbGFrZSZwPXdlYiZ0ej1BdXN0cmFsaWElMkZTeWRuZXkmbGFuZz1lbi1BVSZjcz1VVEYtOCZyZXM9MzYweDY0MCZjZD0zMiZjb29raWU9MSZlaWQ9YzI1OWMyNWUtZjk0Yi00ZDJjLWExMWMtMGQyNzhjMmU2ZDFhJmR0bT0xNDc5OTI3ODU3MjAxJnZwPTB4LTU2JmRzPTIwMHgyNjI5NSZ2aWQ9NCZzaWQ9N2ZiOTdmQzYtNmUwZi00MDIyLWFkYmQtMDE3NDMxNTIwZGRiJmR1aWQ9NGQxMGQzZDAtYzJiNC00NzNlLWE0ODMtODEyNzk5ZTgyNGQxJmZwPTEyOTExMjMzMgsBLAAAAG1Nb3ppbGxhLzUuMCAoV2luZG93cyBOVCAxMC4wOyBXT1c2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzU0LjAuMjg0MC43MSBTYWZhcmkvNTM3LjM2CwGQAAAAIWNvbGxlY3Rvci5zbm93Zmxha2UtYW5hbHl0aWNzLmNvbQsA0gAAAAVVVEYtOAsBNgAAAB9odHRwczovL3Nub3dmbGFrZS1hbmFseXRpY3MuY29tCwGaAAAAJDRkMTBkM2QwLWMyYjQtNDczZS1hNDgzLTgxMjc5OWU4MjRkMQsA3AAAABFzc2MtMC43LjAta2luZXNpcw8BXgsAAAALAAAAJ0hvc3Q6IGNvbGxlY3Rvci5zbm93Zmxha2UtYW5hbHl0aWNzLmNvbQAAAB1BY2NlcHQ6IGltYWdlL3dlYnAsICovKjtxPTAuOAAAACRBY2NlcHQtRW5jb2Rpbmc6IGd6aXAsIGRlZmxhdGUsIHNkY2gAAAA3QWNjZXB0LUxhbmd1YWdlOiBlbi1BVSwgZW47cT0wLjgsIGVuLVVTO3E9MC42LCBlbjtxPTAuNAAAABRDb29raWU6IHNwPWFiY2QtMTIzNAAAACdSZWZlcmVyOiBodHRwOi8vc25vd2ZsYWtlLWFuYWx5dGljcy5jb20AAAB6VXNlci1BZ2VudDogIE1vemlsbGEvNS4wIChXaW5kb3dzIE5UIDEwLjA7IFdPVzY0KSBBcHBsZVdlYktpdC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvKSBDaHJvbWUvNTQuMC4yODQwLjcxIFNhZmFyaS81MzcuMzYAAAAaWC1Gb3J3YXJkZWQtRm9yOiAxMjcuMC4wLjEAAAAVWC1Gb3J3YXJkZWQtUG9ydDogNDQzAAAAGFgtRm9yd2FyZGVkLVByb3RvOiBodHRwcwAAABZDb25uZWN0aW9uOiBrZWVwLWFsaXZlAA=="

Now let’s use the b64decode method to decode the payload.

1
2
3
4
decoded_payload = base64.b64decode(sample_payload)
print(decoded_payload)

b"\x0b\x01@\x00\x00\x00\x02/i\x0b\x00d\x00\x00\x00\t127.0.0.1\n\x00\xc8\x00\x00\x01X\xdb\x9e7h\x0bzi\x00\x00\x00Aiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0\x0b\x01J\x00\x00\x01hstm=1481153329000&e=pv&url=http%3A%2F%2Fsnowflake-analytics.com&tv=js-2.6.0&tna=js-3.6.0&aid=snowflake&p=web&tz=Australia%2FSydney&lang=en-AU&cs=UTF-8&res=360x640&cd=32&cookie=1&eid=c259c25e-f94b-4d2c-a11c-0d278c2e6d1a&dtm=1479927857201&vp=0x-56&ds=200x26295&vid=4&sid=7fb97fC6-6e0f-4022-adbd-017431520ddb&duid=4d10d3d0-c2b4-473e-a483-812799e824d1&fp=129112332\x0b\x01,\x00\x00\x00mMozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36\x0b\x01\x90\x00\x00\x00!collector.snowflake-analytics.com\x0b\x00\xd2\x00\x00\x00\x05UTF-8\x0b\x016\x00\x00\x00\x1fhttps://snowflake-analytics.com\x0b\x01\x9a\x00\x00\x00$4d10d3d0-c2b4-473e-a483-812799e824d1\x0b\x00\xdc\x00\x00\x00\x11ssc-0.7.0-kinesis\x0f\x01^\x0b\x00\x00\x00\x0b\x00\x00\x00'Host: collector.snowflake-analytics.com\x00\x00\x00\x1dAccept: image/webp, */*;q=0.8\x00\x00\x00$Accept-Encoding: gzip, deflate, sdch\x00\x00\x007Accept-Language: en-AU, en;q=0.8, en-US;q=0.6, en;q=0.4\x00\x00\x00\x14Cookie: sp=abcd-1234\x00\x00\x00'Referer: http://snowflake-analytics.com\x00\x00\x00zUser-Agent:  Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36\x00\x00\x00\x1aX-Forwarded-For: 127.0.0.1\x00\x00\x00\x15X-Forwarded-Port: 443\x00\x00\x00\x18X-Forwarded-Proto: https\x00\x00\x00\x16Connection: keep-alive\x00"

You’ll see that some fields are recognisable here. We can make out a query string and some headers but there’s a lot of data in between that makes this a bit difficult to read and parse. This is our binary serialized Thrift record that we’ll deserialize next.

1
2
3
collector = thriftpy.load("collector-payload.thrift")
collector_payload = collector.CollectorPayload()
raw_payload = deserialize(collector_payload, decoded_payload, TCyBinaryProtocolFactory())

In the cell above we’re doing quite a bit. First we’re loading the Snowplow provided .thrift file which acts as a specification for both serializing and deserializing the data. Next we’re initialising a structure based on the CollectorPayload struct and finally we are deserializing our base64 decoded object providing the class, base64 decoded payload and protocol factory to deserialize the object.

1
2
3
print(raw_payload)

CollectorPayload(querystring='stm=1481153329000&e=pv&url=http%3A%2F%2Fsnowflake-analytics.com&tv=js-2.6.0&tna=js-3.6.0&aid=snowflake&p=web&tz=Australia%2FSydney&lang=en-AU&cs=UTF-8&res=360x640&cd=32&cookie=1&eid=c259c25e-f94b-4d2c-a11c-0d278c2e6d1a&dtm=1479927857201&vp=0x-56&ds=200x26295&vid=4&sid=7fb97fC6-6e0f-4022-adbd-017431520ddb&duid=4d10d3d0-c2b4-473e-a483-812799e824d1&fp=129112332', collector='ssc-0.7.0-kinesis', ipAddress='127.0.0.1', timestamp=1481153329000, path='/i', contentType=None, refererUri='https://snowflake-analytics.com', schema='iglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0', encoding='UTF-8', hostname='collector.snowflake-analytics.com', userAgent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', body=None, headers=['Host: collector.snowflake-analytics.com', 'Accept: image/webp, */*;q=0.8', 'Accept-Encoding: gzip, deflate, sdch', 'Accept-Language: en-AU, en;q=0.8, en-US;q=0.6, en;q=0.4', 'Cookie: sp=abcd-1234', 'Referer: http://snowflake-analytics.com', 'User-Agent:  Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'X-Forwarded-For: 127.0.0.1', 'X-Forwarded-Port: 443', 'X-Forwarded-Proto: https', 'Connection: keep-alive'], networkUserId='4d10d3d0-c2b4-473e-a483-812799e824d1')

Excellent! That’s looking a lot more readable already. If you haven’t already noticed, the properties of our CollectorPayload class match the fields defined in our collector-payload.thrift record. Let’s access some of them directly.

1
2
3
4
5
querystring = raw_payload.querystring
network_userid = raw_payload.networkUserId
print(querystring)

stm=1481153329000&e=pv&url=http%3A%2F%2Fsnowflake-analytics.com&tv=js-2.6.0&tna=js-3.6.0&aid=snowflake&p=web&tz=Australia%2FSydney&lang=en-AU&cs=UTF-8&res=360x640&cd=32&cookie=1&eid=c259c25e-f94b-4d2c-a11c-0d278c2e6d1a&dtm=1479927857201&vp=0x-56&ds=200x26295&vid=4&sid=7fb97fC6-6e0f-4022-adbd-017431520ddb&duid=4d10d3d0-c2b4-473e-a483-812799e824d1&fp=129112332

We can now access these properties directly but at this stage we’ve just got the raw querystring - you’ll recognise this as the network request that is sent to the collector when using the Snowplow Javascript library (tna=js-3.6.0). Let’s split this out so we can look at the individual key value pairs that are in the payload.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from urllib import parse
params = parse.parse_qs(querystring)
print(params)

{'aid': ['snowflake'],
 'cd': ['32'],
 'cookie': ['1'],
 'cs': ['UTF-8'],
 'ds': ['200x26295'],
 'dtm': ['1479927857201'],
 'duid': ['4d10d3d0-c2b4-473e-a483-812799e824d1'],
 'e': ['pv'],
 'eid': ['c259c25e-f94b-4d2c-a11c-0d278c2e6d1a'],
 'fp': ['129112332'],
 'lang': ['en-AU'],
 'p': ['web'],
 'res': ['360x640'],
 'sid': ['7fb97fC6-6e0f-4022-adbd-017431520ddb'],
 'stm': ['1481153329000'],
 'tna': ['js-3.6.0'],
 'tv': ['js-2.6.0'],
 'tz': ['Australia/Sydney'],
 'url': ['http://snowflake-analytics.com'],
 'vid': ['4'],
 'vp': ['0x-56']}

If you’re familiar with Python you’ll note that our parse_qs function returns a dictionary containing values in the form of a list. We can access this reasonably easy by - in this case let’s extract out the app_id for the event.

1
2
3
4
app_id = params.get('aid')[0]
print(app_id)

snowflake

In the above sample we’re using the built in get method (on a dictionary) to access the aid property rather than app_id you can find a list of the mappings from the URL parameter to the friendly name here. The [0] above access the first element in that list for the aid key. In our example each of these lists only contains one value so we don’t need to worry about addressing additional values.

That concludes this tutorial. Please post below if you’ve got queries or parts of the post require additional clarification - this is a quite involved process.

In our next tutorial we’ll cover how we can leverage this simple process of decoding a single bad row - to multiple bad rows from a run and looping this data back into Redshift so we can query it!

[Read More]

Monitoring Snowplow bad rows using Lambda and Cloudwatch

Wednesday 19 October, 2016. | By: Mike Robins

In this tutorial we’ll use Amazon Lambda and Amazon Cloudwatch to set up monitoring for the number of bad rows that are inserted into Elasticsearch over a period of time. This allows us to set an alert for the threshold of bad rows, and generates an email or notification when this threshold has been exceeded. Snowplow users on the realtime pipeline will find this most useful, however users running loads in batch can also adapt this monitoring.

We’ll use the following services throughout this tutorial so check first that they are available in your AWS region of choice:

  • Cloudwatch rules
  • Lambda

Below is an overview of what we’ll be setting up.

A Cloudwatch rule will trigger a Lambda function every 5 minutes. This Lambda function will query our Elasticsearch cluster, and return the number of rows in the bad index for the last 5 minutes. Finally, we’ll send this result to our own custom metric to allow for monitoring and alerting using Cloudwatch alarms.

Lambda execution role

First, let’s set up an execution role for our AWS Lambda function. This execution role will allow access to the appropriate resources, namely being able to submit a metric to Cloudwatch.

  1. Navigate to IAM in the AWS console.
  2. Select ‘Policy’ and ‘Create Policy’ in this interface.
  3. Select ‘Create Your Own Policy’
  4. Name the policy document and provide a description. Under the ‘Policy document’ we’ll use the following configuration.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": "*"
        }
    ]
}

Lambda bad rows AWS IAM Policy

This will allow the Lambda function to log to Cloudwatch for debugging as well as pushing data for the bad rows metric we will define shortly.

Next, we’ll create a role for the Lambda function to use, and attach this policy.

  1. In IAM select ‘Role’ and ‘Create New Role’.
  2. Under ‘AWS Service Roles’ select ‘AWS Lambda’
  3. Search and select the policy you created above, review the policy and hit ‘Create Role’.

Creating our Lambda function to retrieve the number of bad rows

  1. In the AWS Lambda dashboard select ‘Create a Lambda function`
  2. Select the blueprint as ‘Blank function’ as we’ll be writing our own code for this function.
  3. For the moment, leave triggers blank and hit ‘Next’.
  4. Name your function, e.g., snowplow_bad_rows_count
  5. In this example we will use the ‘Python 2.7’ runtime.
  6. For the function code copy and paste the function below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import json
import urllib2
import boto3
# import required libraries

def lambda_handler(event, context):
    host = 'your-hostname-here.com' # your elasticsearch host
    port = 9200 # your elasticsearch port
    index_name = 'bad-alias' # your elasticsearch bad index

    url = 'http://{0}:{1}/{2}/_count'.format(host, port, index_name)
    data = '{"query": {"range": {"failure_tstamp": {"gt": "now-5m"}}}}' # query to filter data

    request = urllib2.urlopen(url, data) # connect to Elasticsearch
    result = json.loads(request.read()) # read and decode JSON response
    bad_row_count = result.get('count') # select the count property
    request.close() # close the connection

    client = boto3.client('cloudwatch') # setup a cloudwatch client
    data = [
            {
                'MetricName': 'bad_rows_count',
                'Value': bad_row_count,
                'Unit': 'Count'
            }
            ]
    # prepare a payload to send with the name 'bad_rows_count'
    client.put_metric_data(Namespace='snowplow', MetricData=data) # send the data to Cloudwatch using the 'snowplow' namespace
    return bad_row_count # return the number of bad rows in the last 5 minutes

In the above snippet change the host variable to point to your Elasticsearch cluster. This is likely to be either a load balancer if you’re running Elasticsearch on EC2 instances, or the Elasticsearch endpoint if using AWS Elasticsearch Service.

You should also change port, to the appropriate port which is likely to be either 9200 (default) or 80 is using Elasticsearch Service or port-forwarding on a load balancer.

Finally, enter the name of the index. If you’re unsure of what this is by adding /_cat/indices to the end of your Elasticsearch URL which will list all indices in your Elasticsearch cluster.

  1. To modify the period in which Elasticsearch looks for bad events change the 'now-5m' string to the period required, e.g., for 1 hour use 'now-1h'.
  2. To modify the name of the metric, change the 'MetricName' in the data payload
  3. To modify the event namespace, change the Namespace in the put_metric_data call.
  4. The handler field can be left the same.
  5. Under ‘Existing role’ select the role we created in the previous steps.
  6. The memory and timeout of advanced settings can be left the same.
  7. Select an appropriate VPC. If running an Elasticsearch cluster within a VPC you may need to use this VPC, however if you’re cluster is accessible externally ‘No VPC’ may work.
  8. Review the function and select ‘Create function’.
  9. Let’s quickly test the function to ensure that it is able to connect to Elasticsearch successfully and produces the expected output.
  10. Select ‘Test’ in the interface, and ‘Hello World’ as the sample event template. The data we use as input does not matter as our event only produces output.
  11. Select ‘Save and test’
  12. Your function should successfully execute and return the number of bad rows in the Elasticsearch index selected. You can verify this using the Kibana interface if required.
  13. If the function times out, double check the address and settings for your Elasticsearch cluster, the function may not be able to connect.

Create our Cloudwatch trigger

  1. In the Cloudwatch interface, select ‘Rules’ under ‘Events’ from the left-hand pane.
  2. Select ‘Create Rule’
  3. For the event source select ‘Schedule’, and for the moment we’ll leave it as the default of fixed rate of every 5 minutes. This sets the option of when, and how often our Lambda function will run and our metric will update. More information about the cron syntax AWS uses can be found here.
  4. Under ‘Targets’, select ‘Lambda function’ and the name of the Lambda function you created above.
  5. Select ‘Configure details’
  6. Add a meaningful name and description for this rule and leave the enabled checkbox selected. Select ‘Create Rule’.

Check our metric is working in Cloudwatch

  1. In the Cloudwatch interface scroll down in the left-hand panel to the bottom and under ‘Custom Metrics…’ select ‘snowplow’, your metric name should appear here and by clicking the checkbox you should see a point of data every 5 minutes. If you’ve only just set this up you may only see one data point.

To create an alarm

  1. Select the metric and select the ‘Create Alarm’ button in the right hand panel under ‘Actions’.
  2. Set a meaningful name and description for the alarm, and a threshold to trigger at based upon the number of bad rows you expect to have in a 5 minute period. Anything above this threshold will trigger the alarm.
  3. Under Actions, you can select a notification list (if you have a SNS topic set up) or alternately if you select ‘New list’ you can enter a comma delimited list of email addresses.
  4. This will notify users when the alarm is triggered based on the criteria you have set.

Cost

Running this monitoring will cost approximately $0.69 a month depending on your usage, frequency and availability zone. For many users some of this cost may be included in the AWS free tier.

  • $0.50 per custom metric
  • $0.10 per alarm
  • ~$0.09 for Cloudwatch PutMetricData (9000 requests/month, assuming rule triggers every 5 minutes)

If you have any questions or comments feel free to leave them in the comment section below, or in the Snowplow discourse forum.

[Read More]

About

We exist to make organisations better understand their businesses by enabling all decision makers in a company to work with the same version of the truth.

Social Links