A new compression option in Redshift allows you to make big storage savings, up to two-thirds in our tests, over the standard Snowplow setup. This guide shows how it works and how to get it happening.
In late 2016 Facebook open sourced a compression algorithm known as Zstandard that combines Lempel Ziv and tANS to achieve a compression ratio better than many algorithms with a slight tradeoff in speed.
In January 2017 AWS added ZSTD support to Redshift, so you can now use this powerful compression algorithm to reduce the storage size of your data. One immense perk of ZSTD is that it can be applied across all supported data types. It gives exceptional performance on long varchars: perfectly suited for large JSON strings that regularly appear in shredded tables.
We’ve had great success both experimentally and practically by applying ZSTD compression to multiple Snowplow tables. Experimentally across datasets ranging between 10 million and 5 billion rows we’ve achieved a mean compression ratio of ~3 meaning that the newly compressed table takes up approximately a third of the original table on disk when compared to the compression defaults in atomic.events 0.8.0.
In particular we’ve found:
atomic.events
(note this will be highly workload dependent so benchmarks here are less useful)collector_tstamp
and root_tstamp
).ANALYZE COMPRESSION
.ANALYZE COMPRESSION
on your table which will provide some recommendations by Redshift as to what the suggested column encodings are. One caveat of this approach is that you are limited to sampling 1 billion rows so if possible choose a sample data set that contains representative variability within columns.VACUUM
. If performing a deep copy ensure you have sufficient disk space to complete the action: it’s difficult to know how much space is required but we opt for at least 50% of the cluster storage remaining. You may need to consider resizing the cluster and/or temporarily pausing your pipeline to complete this action.DROP CASCADE
or individually drop any dependencies that may rely on this table such as views or foreign key constraints from shredded or derived tables.ANALYZE COMPRESSION
.ALTER
the new table to atomic events3. Ensure that this new table has an identical owner.There’s a few gotchas and edge cases involved with doing this. If you’d like help or advice on how to do it feel free to post in the comments below or get in touch.
1
SELECT
"database",
"table",
diststyle,
sortkey1,
sortkey1_enc,
size,
pct_used,
unsorted,
stats_off,
tbl_rows
FROM
svv_table_info
WHERE
schema = 'atomic'
ORDER BY
size DESC
2
SELECT DISTINCT c_p.oid AS tbloid
,n_p.nspname AS schemaname
,c_p.relname AS NAME
,n_c.nspname AS refbyschemaname
,c_c.relname AS refbyname
,c_c.oid AS viewoid
FROM pg_class c_p
JOIN pg_depend d_p ON c_p.relfilenode = d_p.refobjid
JOIN pg_depend d_c ON d_p.objid = d_c.objid
JOIN pg_class c_c ON d_c.refobjid = c_c.relfilenode
LEFT JOIN pg_namespace n_p ON c_p.relnamespace = n_p.oid
LEFT JOIN pg_namespace n_c ON c_c.relnamespace = n_c.oid
WHERE
schemaname = 'atomic'
3
ALTER TABLE atomic.events_new RENAME TO atomic.events;
Subscribe to this blog via RSS.
Snowplow 4
Infrastructure (4) Data-modeling (1) Open-source (3) Snowplow (4) Announcements (1) Community (1) Infrastructure (1)