28th February 2022
AZ64
encoding is the Gorilla compression method from
Facebook (2015), but with the 1-bit encoding for repeating values
removed and replaced by no less than three distinct runlength-like
compression methods, all of which seem completely crazy. With repeating
values, when the number of repetitions is 2 to 63, no runlength encoding
occurs and the full Gorilla encoding is used for each row, a lost
compression opportunity; when the number of repeating values is a
positive integer power of 2 between (inclusive both ends) 64 and 16384,
each group of repeating values is encoded into a “storage unit” which is
between 5 and 33 (inclusive both ends) bytes depending on data type
which stores a copy of the value and a 15 bit counter, allowing for
billions of rows per block, where I think (but have not yet
investigated) materialization is block based and so needing to
materialize any row in that block means all rows must
be materialized; and, finally, when the number of repetitions is 65 or
more, but not a positive integer power of 2, the number of rows stored
per block varies depending on the value being encoded, the number of
repetitions and the bit-pattern of the number of repetitions (!), as
Gorilla-like encoding is being used on that bit-pattern, such that
increasing the number of repetitions often decreases the number
of rows stored in a block (as the bit-pattern of the number of
repetitions has become less favourable to Gorilla encoding).
Redshift is a column-store relational database, which means that each column in a table is stored contiguously.
It is often the case that the data in a single column has similar characteristics - for example, might be all names, or ages, or say integer values within a given range; in other words, data which is far from randomly distributed across the value range for the data type of the column.
This provides an opportunity for unusually effective data compression, as well as an opportunity to blunder terribly, as Redshift offers a range of data compression methods (known in the Redshift documentation as “encodings”), most of which work spectacularly well with and only with data which expresses the characteristics necessary for that data compression method, and which often work spectacularly badly with data lacking those characteristics.
It is then necessary to understand the type of data characteristics suitable for each of the data compression methods offered by Redshift, as well of course as the behaviours and limitations of the data compression methods, so the good choices can be made when selecting data compression for columns.
This document is one in a series, each of which examines one data
compression method offered by Redshift, which here investigates the
AZ64
encoding.
The basic test method used is to create a test table, with a single
raw encoded column, such that all rows are on a single slice, and then
populate the test table with rows, issuing VACUUM
between
INSERT
s, until more than one block has been filled. This
allows us to see how many rows were needed to fill one block, the first
block.
I can then vary exactly what data is being inserted, to try and figure out how the encoding method is working, by seeing how changing the data affects the number of rows which fit into the first block.
So, for example, the test might be all rows with the same value, or alternating rows with two different values, or ascending values, and so on.
The results are given here for ease of reference, but they are primarily presented, piece by piece along with explanation, in the Discussion.
See Appendix A for the Python
pprint
dump of the results dictionary.
The script used to generated these results in designed for readers to use, and is available here.
Test duration, excluding server bring-up and shut-down, was 9736 seconds.
Datum | Value |
---|---|
First value | 0 |
First value (pattern) | 0 |
Second value | 1 |
Second value (pattern) | 1 |
XOR | 1 |
Stored bit pattern | 1 |
Stored length in bits | 1 |
Overhead in bits | 0.5 |
Bits per Block | 8388608 |
Estimated rows per block | 5592405.33 |
Actual rows per block | 5591809 |
Datum | Value |
---|---|
First value | 51 |
First value (pattern) | 110011 |
Second value | 60 |
Second value (pattern) | 111100 |
XOR | 001111 |
Stored bit pattern | 1111 |
Stored length in bits | 4 |
Overhead in bits | 0.5 |
Bits per Block | 8388608 |
Estimated rows per block | 1864135.11 |
Actual rows per block | 1863937 |
Datum | Value |
---|---|
First value | -32768 |
First value (pattern) | 1000000000000000 |
Second value | -1 |
Second value (pattern) | 0000000000000001 |
XOR | 0111111111111111 |
Stored bit pattern | 111111111111111 |
Stored length in bits | 15 |
Overhead in bits | 0.5 |
Bits per Block | 8388608 |
Estimated rows per block | 541200.52 |
Actual rows per block | 541121 |
Datum | Value |
---|---|
First value | -32768 |
First value (pattern) | 1000000000000000 |
Second value | 0 |
Second value (pattern) | 0000000000000000 |
XOR | 1000000000000000 |
Stored bit pattern | 1 |
Stored length in bits | 1 |
Overhead in bits | 0.5 |
Bits per Block | 8388608 |
Estimated rows per block | 5592405.33 |
Actual rows per block | 5591809 |
Datum | Value |
---|---|
First value | 0 |
First value (pattern) | 0 |
Second value | 1 |
Second value (pattern) | 1 |
XOR | 1 |
Stored bit pattern | 1 |
Stored length in bits | 1 |
Overhead in bits | 1.0 |
Bits per Block | 8388608 |
Estimated rows per block | 4194304.00 |
Actual rows per block | 4193856 |
Datum | Value |
---|---|
First value | 51 |
First value (pattern) | 110011 |
Second value | 60 |
Second value (pattern) | 111100 |
XOR | 001111 |
Stored bit pattern | 1111 |
Stored length in bits | 4 |
Overhead in bits | 1.0 |
Bits per Block | 8388608 |
Estimated rows per block | 1677721.60 |
Actual rows per block | 1677505 |
Datum | Value |
---|---|
First value | -2147483648 |
First value (pattern) | 10000000000000000000000000000000 |
Second value | -1 |
Second value (pattern) | 00000000000000000000000000000001 |
XOR | 01111111111111111111111111111111 |
Stored bit pattern | 1111111111111111111111111111111 |
Stored length in bits | 31 |
Overhead in bits | 1.0 |
Bits per Block | 8388608 |
Estimate rows per block | 262144.00 |
Actual rows per block | 262081 |
Datum | Value |
---|---|
First value | -2147483648 |
First value (pattern) | 10000000000000000000000000000000 |
Second value | 0 |
Second value (pattern) | 00000000000000000000000000000000 |
XOR | 10000000000000000000000000000000 |
Stored bit pattern | 1 |
Stored length in bits | 1 |
Overhead in bits | 1.0 |
Bits per Block | 8388608 |
Estimated rows per block | 4194304.00 |
Actual rows per block | 4193856 |
Datum | Value |
---|---|
First value | 0 |
First value (pattern) | 0 |
Second value | 1 |
Second value (pattern) | 1 |
XOR | 1 |
Stored bit pattern | 1 |
Stored length in bits | 1 |
Overhead in bits | 2.0 |
Bits per Block | 8388608 |
Estimated rows per block | 2796202.67 |
Actual rows per block | 2795904 |
Datum | Value |
---|---|
First value | 51 |
First value (pattern) | 110011 |
Second value | 60 |
Second value (pattern) | 111100 |
XOR | 001111 |
Stored bit pattern | 1111 |
Stored length in bits | 4 |
Overhead in bits | 2.0 |
Bits per Block | 8388608 |
Estimated rows per block | 1398101.33 |
Actual rows per block | 1397952 |
Datum | Value |
---|---|
First value | -9223372036854775808 |
First value (pattern) | 1000000000000000000000000000000000000000000000000000000000000000 |
Second value | -1 |
Second value (pattern) | 0000000000000000000000000000000000000000000000000000000000000001 |
XOR | 0111111111111111111111111111111111111111111111111111111111111111 |
Stored bit pattern | 111111111111111111111111111111111111111111111111111111111111111 |
Stored length in bits | 63 |
Overhead in bits | 2.0 |
Bits per Block | 8388608 |
Estimated rows per block | 129055.51 |
Actual rows per block | 129025 |
Datum | Value |
---|---|
First value | -9223372036854775808 |
First value (pattern) | 1000000000000000000000000000000000000000000000000000000000000000 |
Second value | 0 |
Second value (pattern) | 0000000000000000000000000000000000000000000000000000000000000000 |
XOR | 1000000000000000000000000000000000000000000000000000000000000000 |
Stored bit pattern | 1 |
Stored length in bits | 1 |
Overhead in bits | 2.0 |
Bits per Block | 8388608 |
Estimated rows per block | 2796202.67 |
Actual rows per block | 2795904 |
Datum | Value |
---|---|
First value | 0 |
First value (pattern) | 0 |
Second value | 1 |
Second value (pattern) | 1 |
XOR | 1 |
Stored bit pattern | 1 |
Stored length in bits | 1 |
Overhead in bits | 4.0 |
Bits per Block | 8388608 |
Estimated rows per block | 1677721.60 |
Actual rows per block | 1677504 |
Datum | Value |
---|---|
First value | 51 |
First value (pattern) | 110011 |
Second value | 60 |
Second value (pattern) | 111100 |
XOR | 001111 |
Stored bit pattern | 1111 |
Stored length in bits | 4 |
Overhead in bits | 4.0 |
Bits per Block | 8388608 |
Estimated rows per block | 1048576.00 |
Actual rows per block | 1048448 |
Data Type | Values | # Repetitions | Actual Rows per Block |
---|---|---|---|
int2 | 0/1 | 2 | 5591810 |
int2 | 0/1 | 3 | 5591811 |
int2 | 0/1 | 4 | 5591812 |
int2 | 0/1 | 62 | 5591842 |
int2 | 0/1 | 63 | 5591817 |
int2 | 0/1 | 64 | 13420416 |
int2 | 0/1 | 65 | 5694016 |
int2 | 65/119 | 2 | 1524994 |
int2 | 65/119 | 3 | 1524993 |
int2 | 65/119 | 4 | 1524996 |
int2 | 65/119 | 62 | 1525014 |
int2 | 65/119 | 63 | 1525041 |
int2 | 65/119 | 64 | 13420416 |
int2 | 65/119 | 65 | 1567800 |
Data Type | Values | # Repetitions | Actual Rows per Block |
---|---|---|---|
int4 | 0/1 | 2 | 4193856 |
int4 | 0/1 | 3 | 4193856 |
int4 | 0/1 | 4 | 4193856 |
int4 | 0/1 | 62 | 4193856 |
int4 | 0/1 | 63 | 4193856 |
int4 | 0/1 | 64 | 7455744 |
int4 | 0/1 | 65 | 4251072 |
int4 | 65/119 | 2 | 1397952 |
int4 | 65/119 | 3 | 1397952 |
int4 | 65/119 | 4 | 1397952 |
int4 | 65/119 | 62 | 1397952 |
int4 | 65/119 | 63 | 1397952 |
int4 | 65/119 | 64 | 7455744 |
int4 | 65/119 | 65 | 1433770 |
Data Type | Values | # Repetitions | Actual Rows per Block |
---|---|---|---|
int8 | 0/1 | 2 | 2795904 |
int8 | 0/1 | 3 | 2795904 |
int8 | 0/1 | 4 | 2795904 |
int8 | 0/1 | 62 | 2795904 |
int8 | 0/1 | 63 | 2795904 |
int8 | 0/1 | 64 | 3947136 |
int8 | 0/1 | 65 | 2821248 |
int8 | 65/119 | 2 | 1198210 |
int8 | 65/119 | 3 | 1198209 |
int8 | 65/119 | 4 | 1198212 |
int8 | 65/119 | 62 | 1198212 |
int8 | 65/119 | 63 | 1198260 |
int8 | 65/119 | 64 | 3947136 |
int8 | 65/119 | 65 | 1224470 |
Data Type | Values | # Repetitions | Actual Rows per Block |
---|---|---|---|
numeric(38,0) | 0/1 | 2 | 1677504 |
numeric(38,0) | 0/1 | 3 | 1677504 |
numeric(38,0) | 0/1 | 4 | 1677504 |
numeric(38,0) | 0/1 | 62 | 1677504 |
numeric(38,0) | 0/1 | 63 | 1677504 |
numeric(38,0) | 0/1 | 64 | 2033344 |
numeric(38,0) | 0/1 | 65 | 1686592 |
numeric(38,0) | 65/119 | 2 | 931968 |
numeric(38,0) | 65/119 | 3 | 931968 |
numeric(38,0) | 65/119 | 4 | 931968 |
numeric(38,0) | 65/119 | 62 | 931968 |
numeric(38,0) | 65/119 | 63 | 931968 |
numeric(38,0) | 65/119 | 64 | 2033344 |
numeric(38,0) | 65/119 | 65 | 947765 |
Data Type | Data Type Size | Storage Unit Size | Number Encoding Units per Block |
---|---|---|---|
int2 | 2 | 5 | 209715 |
int4 | 4 | 9 | 116508 |
int8 | 8 | 17 | 61680 |
numeric(38,0) | 16 | 33 | 31775 |
(Note the negative value of -859340800 is due to the Redshift
int4
system table column recording the number of rows per
block overflowing.)
Data Type | Values | # Repetitions | Storage Units per Block | Estimated Rows per Block | Actual Rows per Block |
---|---|---|---|---|---|
int2 | 0/1 | 64 | 209715 | 13421772 | 13420416 |
int2 | 0/1 | 128 | 209715 | 26843545 | 26840832 |
int2 | 0/1 | 256 | 209715 | 53687091 | 53681664 |
int2 | 0/1 | 512 | 209715 | 107374182 | 107363328 |
int2 | 0/1 | 16384 | 209715 | 3435973836 | -859340800 |
int2 | 65/119 | 64 | 209715 | 13421772 | 13420416 |
int2 | 65/119 | 128 | 209715 | 26843545 | 26840832 |
int2 | 65/119 | 256 | 209715 | 53687091 | 53681664 |
int2 | 65/119 | 512 | 209715 | 107374182 | 107363328 |
int2 | 65/119 | 16384 | 209715 | 3435973836 | -859340800 |
int4 | 0/1 | 64 | 116508 | 7456540 | 7455744 |
int4 | 0/1 | 128 | 116508 | 14913080 | 14911488 |
int4 | 0/1 | 256 | 116508 | 29826161 | 29822976 |
int4 | 0/1 | 512 | 116508 | 59652323 | 59645952 |
int4 | 0/1 | 16384 | 116508 | 1908874353 | 1908670464 |
int4 | 65/119 | 64 | 116508 | 7456540 | 7455744 |
int4 | 65/119 | 128 | 116508 | 14913080 | 14911488 |
int4 | 65/119 | 256 | 116508 | 29826161 | 29822976 |
int4 | 65/119 | 512 | 116508 | 59652323 | 59645952 |
int4 | 65/119 | 16384 | 116508 | 1908874353 | 1908670464 |
int8 | 0/1 | 64 | 61680 | 3947580 | 3947136 |
int8 | 0/1 | 128 | 61680 | 7895160 | 7894272 |
int8 | 0/1 | 256 | 61680 | 15790320 | 15788544 |
int8 | 0/1 | 512 | 61680 | 31580641 | 31577088 |
int8 | 0/1 | 16384 | 61680 | 1010580540 | 1010466816 |
int8 | 65/119 | 64 | 61680 | 3947580 | 3947136 |
int8 | 65/119 | 128 | 61680 | 7895160 | 7894272 |
int8 | 65/119 | 256 | 61680 | 15790320 | 15788544 |
int8 | 65/119 | 512 | 61680 | 31580641 | 31577088 |
int8 | 65/119 | 16384 | 61680 | 1010580540 | 1010466816 |
numeric(38,0) | 0/1 | 64 | 31775 | 2033601 | 2033344 |
numeric(38,0) | 0/1 | 128 | 31775 | 4067203 | 4066688 |
numeric(38,0) | 0/1 | 256 | 31775 | 8134407 | 8133376 |
numeric(38,0) | 0/1 | 512 | 31775 | 16268815 | 16266752 |
numeric(38,0) | 0/1 | 16384 | 31775 | 520602096 | 520536064 |
numeric(38,0) | 65/119 | 64 | 31775 | 2033601 | 2033344 |
numeric(38,0) | 65/119 | 128 | 31775 | 4067203 | 4066688 |
numeric(38,0) | 65/119 | 256 | 31775 | 8134407 | 8133376 |
numeric(38,0) | 65/119 | 512 | 31775 | 16268815 | 16266752 |
numeric(38,0) | 65/119 | 16384 | 31775 | 520602096 | 520536064 |
Data Type | Values | # Repetitions | Actual Rows per Block |
---|---|---|---|
int2 | 0/1 | 64 (0b1000000) | 13420416 |
int2 | 0/1 | 65 (0b1000001) | 5694016 |
int2 | 0/1 | 66 (0b1000010) | 5796780 |
int2 | 0/1 | 67 (0b1000011) | 5793624 |
int2 | 0/1 | 68 (0b1000100) | 6003856 |
int2 | 0/1 | 69 (0b1000101) | 5890624 |
int2 | 0/1 | 70 (0b1000110) | 5991232 |
int2 | 0/1 | 71 (0b1000111) | 5985229 |
int2 | 0/1 | 72 (0b1001000) | 6424640 |
int4 | 0/1 | 64 (0b1000000) | 7455744 |
int4 | 0/1 | 65 (0b1000001) | 4251072 |
int4 | 0/1 | 66 (0b1000010) | 4308096 |
int4 | 0/1 | 67 (0b1000011) | 4306358 |
int4 | 0/1 | 68 (0b1000100) | 4421440 |
int4 | 0/1 | 69 (0b1000101) | 4359744 |
int4 | 0/1 | 70 (0b1000110) | 4414592 |
int4 | 0/1 | 71 (0b1000111) | 4411328 |
int4 | 0/1 | 72 (0b1001000) | 4645512 |
int4 | 0/1 | 90 (0b1011010) | 4854656 |
int4 | 0/1 | 91 (0b1011011) | 4846296 |
int4 | 0/1 | 92 (0b1011100) | 4946624 |
int4 | 0/1 | 93 (0b1011101) | 4883008 |
int4 | 0/1 | 94 (0b1011110) | 4927808 |
int4 | 0/1 | 95 (0b1011111) | 4918720 |
int4 | 0/1 | 96 (0b1100000) | 5920800 |
int4 | 0/1 | 97 (0b1100001) | 4953499 |
int4 | 0/1 | 98 (0b1100010) | 4996928 |
int4 | 0/1 | 99 (0b1100011) | 4987323 |
int4 | 0/1 | 100 (0b1100100) | 5083500 |
int4 | 0/1 | 101 (0b1100101) | 5020224 |
int4 | 0/1 | 102 (0b1100110) | 5062400 |
int4 | 0/1 | 103 (0b1100111) | 5052253 |
int4 | 0/1 | 104 (0b1101000) | 5255016 |
int4 | 0/1 | 105 (0b1101001) | 5083470 |
int4 | 0/1 | 106 (0b1101010) | 5124480 |
int4 | 0/1 | 107 (0b1101011) | 5113856 |
int4 | 0/1 | 108 (0b1101100) | 5206144 |
int4 | 0/1 | 109 (0b1101101) | 5143552 |
int4 | 0/1 | 110 (0b1101110) | 5183424 |
int4 | 0/1 | 111 (0b1101111) | 5172489 |
int4 | 0/1 | 112 (0b1110000) | 5591824 |
int4 | 0/1 | 113 (0b1110001) | 5200640 |
int4 | 0/1 | 114 (0b1110010) | 5239488 |
int4 | 0/1 | 115 (0b1110011) | 5228130 |
int4 | 0/1 | 116 (0b1110100) | 5316800 |
int4 | 0/1 | 117 (0b1110101) | 5254976 |
int4 | 0/1 | 118 (0b1110110) | 5292800 |
int4 | 0/1 | 119 (0b1110111) | 5281152 |
int4 | 0/1 | 120 (0b1111000) | 5470320 |
int4 | 0/1 | 121 (0b1111001) | 5306752 |
int4 | 0/1 | 122 (0b1111010) | 5343616 |
int4 | 0/1 | 123 (0b1111011) | 5331804 |
int4 | 0/1 | 124 (0b1111100) | 5417088 |
int4 | 0/1 | 125 (0b1111101) | 5356160 |
int4 | 0/1 | 126 (0b1111110) | 5392170 |
int4 | 0/1 | 127 (0b1111111) | 5379974 |
int4 | 0/1 | 128 (0b10000000) | 14911488 |
int4 | 0/1 | 129 (0b10000001) | 5464698 |
int4 | 0/1 | 130 (0b10000010) | 5563350 |
int4 | 0/1 | 131 (0b10000011) | 5549422 |
int4 | 0/1 | 132 (0b10000100) | 5766592 |
int4 | 0/1 | 133 (0b10000101) | 5634146 |
int4 | 0/1 | 134 (0b10000110) | 5734530 |
int4 | 0/1 | 135 (0b10000111) | 5718870 |
int4 | 0/1 | 136 (0b10001000) | 6199696 |
int4 | 0/1 | 137 (0b10001001) | 5803594 |
int4 | 0/1 | 138 (0b10001010) | 5905710 |
int4 | 0/1 | 139 (0b10001011) | 5888318 |
int4 | 0/1 | 140 (0b10001100) | 6116096 |
int4 | 0/1 | 190 (0b10111110) | 8131050 |
int4 | 0/1 | 191 (0b10111111) | 8091142 |
int4 | 0/1 | 192 (0b11000000) | 22367232 |
int4 | 0/1 | 193 (0b11000001) | 8175866 |
int4 | 0/1 | 194 (0b11000010) | 8302230 |
int4 | 65/119 | 64 (0b1000000) | 7455744 |
int4 | 65/119 | 65 (0b1000001) | 1433770 |
int4 | 65/119 | 66 (0b1000010) | 1470348 |
int4 | 65/119 | 67 (0b1000011) | 1469243 |
int4 | 65/119 | 68 (0b1000100) | 1545708 |
int4 | 65/119 | 69 (0b1000101) | 1504200 |
int4 | 65/119 | 70 (0b1000110) | 1541050 |
int4 | 65/119 | 71 (0b1000111) | 1538854 |
int4 | 65/119 | 72 (0b1001000) | 1705968 |
int4 | 65/119 | 90 (0b1011010) | 1870848 |
int4 | 65/119 | 91 (0b1011011) | 1863953 |
int4 | 65/119 | 92 (0b1011100) | 1948652 |
int4 | 65/119 | 93 (0b1011101) | 1894503 |
int4 | 65/119 | 94 (0b1011110) | 1932452 |
int4 | 65/119 | 95 (0b1011111) | 1924700 |
int4 | 65/119 | 96 (0b1100000) | 3050048 |
int4 | 65/119 | 97 (0b1100001) | 1954560 |
int4 | 65/119 | 98 (0b1100010) | 1992732 |
int4 | 65/119 | 99 (0b1100011) | 1984257 |
int4 | 65/119 | 100 (0b1100100) | 2071040 |
int4 | 65/119 | 101 (0b1100101) | 2013435 |
int4 | 65/119 | 102 (0b1100110) | 2051648 |
int4 | 65/119 | 103 (0b1100111) | 2042387 |
int4 | 65/119 | 104 (0b1101000) | 2236728 |
int4 | 65/119 | 105 (0b1101001) | 2071040 |
int4 | 65/119 | 106 (0b1101010) | 2109312 |
int4 | 65/119 | 107 (0b1101011) | 2099340 |
int4 | 65/119 | 108 (0b1101100) | 2188096 |
int4 | 65/119 | 109 (0b1101101) | 2127360 |
int4 | 65/119 | 110 (0b1101110) | 2165900 |
int4 | 65/119 | 111 (0b1101111) | 2155176 |
int4 | 65/119 | 112 (0b1110000) | 2609488 |
int4 | 65/119 | 113 (0b1110001) | 2182595 |
int4 | 65/119 | 114 (0b1110010) | 2221120 |
int4 | 65/119 | 115 (0b1110011) | 2209840 |
int4 | 65/119 | 116 (0b1110100) | 2300164 |
int4 | 65/119 | 117 (0b1110101) | 2236689 |
int4 | 65/119 | 118 (0b1110110) | 2275276 |
int4 | 65/119 | 119 (0b1110111) | 2263380 |
int4 | 65/119 | 120 (0b1111000) | 2466960 |
int4 | 65/119 | 121 (0b1111001) | 2289683 |
int4 | 65/119 | 122 (0b1111010) | 2328370 |
int4 | 65/119 | 123 (0b1111011) | 2315844 |
int4 | 65/119 | 124 (0b1111100) | 2407584 |
int4 | 65/119 | 125 (0b1111101) | 2341625 |
int4 | 65/119 | 126 (0b1111110) | 2380288 |
int4 | 65/119 | 127 (0b1111111) | 2367280 |
int4 | 65/119 | 128 (0b10000000) | 14911488 |
int4 | 65/119 | 129 (0b10000001) | 2404560 |
int4 | 65/119 | 130 (0b10000010) | 2455872 |
int4 | 65/119 | 131 (0b10000011) | 2441840 |
int4 | 65/119 | 132 (0b10000100) | 2562912 |
int4 | 65/119 | 133 (0b10000101) | 2479120 |
int4 | 65/119 | 134 (0b10000110) | 2531456 |
int4 | 65/119 | 135 (0b10000111) | 2516400 |
int4 | 65/119 | 136 (0b10001000) | 2795888 |
int4 | 65/119 | 137 (0b10001001) | 2553680 |
int4 | 65/119 | 138 (0b10001010) | 2606976 |
int4 | 65/119 | 139 (0b10001011) | 2590960 |
int4 | 65/119 | 140 (0b10001100) | 2718240 |
int4 | 65/119 | 190 (0b10111110) | 3589312 |
int4 | 65/119 | 191 (0b10111111) | 3560240 |
int4 | 65/119 | 192 (0b11000000) | 22367232 |
int4 | 65/119 | 193 (0b11000001) | 3597520 |
int4 | 65/119 | 194 (0b11000010) | 3664896 |
There are two major points to begin with.
The first major point is that Amazon have published very nearly no
information about the AZ64
encoder.
This is problematic.
Usually, a given encoder works staggeringly well when and only when used with data which is just right for that encoder - data which expresses characteristics which are appropriate for the way in which the encoder works.
Similarly, and critically, when a given encoder is used with data which is not appropriate for the encoder, often you find the results are as staggeringly bad as they would be good if the data had been just right.
So, for example, a runlength encoder, which stores a value and then a count of how many times that value then occurs, works staggeringly well when the values being encoded are repetitions of the same value; and staggeringly badly when this is not so.
In all cases, to select the appropriate encoder for the data in hand, it is absolutely necessary to know how the encoders work - and so what data is just right for them - and to know what data you have in hand. There’s no getting around this.
An encoder which has no documentation describing how it works, or at
least the characteristics of data it works well with, is absolutely
and totally useless in every single possible way, and cannot be
used under any circumstances whatsoever, and this is the case with
AZ64
.
As far as I can tell, the only information about
AZ64
is found in an AWS a blog
post.
(There’s also the official doc page, but it’s completely devoid of content and as such not worth even linking to, because there’s absolutely no information in it.)
That blog post compares AZ64
to LZO
and
ZSTD
, and makes claims for AZ64
compressing
much more, and much more quickly.
As we will see, however, AZ64
is a fundamentally
different type of compression method to LZO
and
ZSTD
, and as such the data it works well, and works badly
with, is fundamentally different to the data for LZO
and
ZSTD
(which although different methods, are fundamentally
similar).
This difference means comparing AZ64
to LZO
and ZSTD
is absolutely incorrect.
It’s akin to comparing runlength encoding to delta encoding; you simply cannot compare them direct, without explanation, because they’re utterly different to each other and are used in totally different situations and with totally different data.
Additionally, I have unpublished benchmarks which utterly contradict
the performance claims made; I find AZ64
about the same, or
very slightly faster, than all the other encoders; the claims of 40% or
70% faster are, as far as I’ve found, completely and utterly
incorrect.
I regard that blog post as one of the very worst ever published by Amazon.
I suspect the unidentified chap who wrote the blog post to be the usual chap who does the docs; and although I could be wrong, I think from all I’ve seen over the years that the way it works is that someone gives him some information, which he doesn’t really understand, he writes it up, and no one technical ever reviews what is written.
The second major point is that I have spent a lot of time
trying to figure out how AZ64
works over the last four
months, but I have been only partially successful.
I believe I have discerned the core encoding methods in use, and
there are four of them, but the problem is, although I may be wrong, as
far as I can tell AZ64
looks to be half nonsense.
It appears to use one encoding method from Facebook, the Gorilla method, which is excellent and makes sense, and three encoding methods, all of which are runlength-like, which are baroquely complex, even bizarre, while at the same time always being far less efficient, more complex, and much harder to reason about than simplest and obvious runlength method of one copy of the datum and a count of repetitions.
AZ64
to me make no sense, and this makes it
problematic to figure out how it operates; it’s easy to understand a
method which makes optimal design choices because it makes sense but
hard to understand a method which does not, because it could be doing
anything for any reason at all.
Having said this then, let us turn to AZ64
.
Essentially, I think AZ64
consists of the Gorilla
encoder, published by Facebook in 2015, with the runlength encoding
behaviour removed, plus three different runlength encoding methods,
which I think are hand-rolled by Amazon.
AZ64
selects one of its four encoding methods depending
on the situation, but there is in this a lot of strange behaviour, which
I can only really describe, as I cannot explain in it terms which make
sense to me.
The four encoding methods are explained in detail, but to begin with, an overview of the methods and when they are used, to give a general feel for what’s going on;
Gorilla Encoding
This can be considered the default method; it is used when there are no repetition of values, which is to say, when we have rows where the value differs from the value in the previous row.
Gorilla/Runlength Encoding with 63 or Fewer Repetitions
Gorilla encoding specifies behaviour to handle repetition of values, but that behaviour has not been implemented.
Instead, when there is is repetition of values, but when the number of repetitions is 63 or less, each value is encoded as if it were a single value with no repetition (and so a great deal of compression is lost, as Gorilla encodes repeated values using a single bit).
The normal, simple and obvious method for encoding repeated values is to have a single copy of the value, and then a count of the number of repetitions. I have no idea why it was not used.
Runlength Encoding with Positive Integer Power of 2 Repetitions
This method is used when there is repetition of values, and the number of repetitions is positive integer power of 2, e.g. 64, 128, 256, etc.
(This encoding method handles a maximum of 16384 repetitions. When there are more repetitions than this, they are encoded as blocks of 16384 repetitions, with the final block probably not a power of 2, and so handled by one of the other encoding methods.)
With this method, the value being encoded makes no difference to the number of rows in a block (unlike the other three methods).
This method fits the most rows into a block, and in fact can fit billions of rows into a single block, which is extremely dangerous, as Redshift normally can materialize rows only on a per-block basis, and so any query which causes any rows in such a block to need to be materialized causes all the rows in the block to be materialized.
It has been possible to develop math to exactly predict the behaviour of this encoding method, and so to be able to exactly predict (as is then proved with the test suite) the maximum possible number of rows per block for the different data types.
As before, the normal, simple and obvious method for encoding repeated values is to have a single copy of the value, and then a count of the number of repetitions. I have no idea why it was not used.
Gorilla/Runlength Encoding with 65 or More Repetitions
This method is used when there is repetition of values, but when the number of repetitions is 65 or more and the number of repetitions is not a positive integer power of 2 (as when this is the case, method #3 is used).
With this method, both the value being encoded makes a difference to the number of rows in a block (as is normal with Gorilla encoding), but also the bit pattern of the number of repetitions makes a difference, and the actual number of repetitions makes a difference (the bit pattern makes a difference in the number of leading and trailing zeros, and so affects encoding in a way which is different to its actual, numeric value), and as such increasing the number of repetitions often decreases the number of rows encoded per block, because the bit pattern becomes less favourable.
This is crazy. It is difficult to reason about, and it’s worse in every way than the normal, simple and obvious method for encoding repeated values, by having a single copy of the value, and then a count of the number of repetitions. I have no idea why it was not used.
It looks like this encoding method behaves for the first 64 repetitions in the same way as method #2 (so there is no extra compression beyond Gorilla encoding with its repetition handling removed), but after that, as the number of extra repetitions increases (and depending on the bit pattern of the number of repetitions), there is an increasing gain in the number of rows per block.
So, I will begin with the simplest and the sensible part of
AZ64
.
As far as I can tell, when the data values have no
repetition (so each value differs from the previous value),
AZ64
encodes using the Gorilla encoding method from
Facebook, published in 2015, PDF available here.
(Note Gorilla encoding specifically supports highly efficient compression where there is repetition, but as I will explain it looks to me this functionality has been removed from this implementation.)
At this point I need to explain how Gorilla encoding works.
The Gorilla encoding method is based around the use of the bit-wise
operator XOR
.
Here’s the XOR
logic chart.
A | B | XOR |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
When we come to load data into Redshift, we typically have an
incoming set of rows, either from COPY
or
INSERT
, and a table we’re going to load the rows into, and
we then load the incoming rows into the table.
To make life simple, let us here imagine the table has only a single column.
So we have a set of rows, each with one column, where each row then holds a value in that column, and these rows and their values are loaded into the table.
With Gorilla encoding, however, we store is the first value in full
(in the first row), but after that, we store XOR
results,
rather than the actual original values presented by COPY
or
INSERT
.
So we take the first value and the second value, compute the
XOR
result, and store the result in the second row; then we
take the second value and the third value, compute the XOR
result and store the result in the third row, and so on and so on.
Apart from the first value, which is a special case, we are
not storing the actual, original values - they are thrown away
once they’ve been used to compute their XOR
result.
So we have our incoming rows and their values, from the
COPY
or INSERT
command, we iterate over them,
using them to compute XOR
results, and we store the
XOR
results, throwing away the actual original values from
COPY
or INSERT
.
Now, a key property of XOR
is that if you have one of
the two values used to produce a XOR
result, and you have
the result itself, you can from that value and the result compute the
other of the two values which generated the XOR
result.
To render this algebraically;
a XOR b = result
a XOR result = b
b XOR result = a
Now, if we think back to what we’ve stored in the column in our
table, we have the first value in full, and then a series of
XOR
results.
We can in fact fully recompute all the original values from the
COPY
or INSERT
command.
We begin with the first row in the table; this stored in full the first original value, so we directly have the first original value.
We then load the value from the second row, which as we recall is the
XOR
result of the value from the first row with the value
in the second row.
So now we have one of the original two values (the value from the
first row), and the XOR
result, and so we can directly
compute (by performing an XOR
of the original value and the
result) the other of the two values used to compute the stored
XOR
result (the original value from the second row).
So we can imagine our original incoming values were
5, 8, 10, 19, 6
, then we stored the 5
in full
in the first row, and stored the XOR
result of
5 and 8
in the second row; and so now we take the
5
from the first row, the XOR
result from the
second row, and that lets us compute that the original value in the
second row was 8
.
Since we know now the full original value in the second row, we can
in exactly the same way take that value, and the XOR
result
in the third row, to compute the original value in the
third row (the 10
).
We continue doing this, row after row, to fully compute all the original values.
Now, this is all good and well but so far pointless, because the
length in bits of an XOR
result is the same as the length
in bits as the values being XOR
ed, so we’re not saving any
space and the whole point of this was data compression, after all.
However, we now can describe some special data processing which can be performed, which leads to data compression.
Let’s begin by looking at one or two example XOR
calculations.
First, here’s an XOR
for two values which have lots of
different bits in their bit-patterns;
datum | value |
---|---|
A | 1011 1011 1010 0000 1001 |
B | 0000 1110 1110 1000 0000 |
XOR | 1011 1010 0101 0111 0110 |
Second, here’s an XOR
for two values which have lots of
identical bits in their bit-patterns;
datum | value |
---|---|
A | 0011 1101 1100 0000 1000 |
B | 0011 1101 0101 0000 1000 |
XOR | 0000 0000 1001 0000 0000 |
Notice the difference in the results? When two bits in A
and B
are the same, you get a 0 bit in the result,
so the more similar the two bit-patterns are, the more 0 bits you get in
the result.
It’s at this point we have a lovely opportunity for some cunning data compression.
So, first, remember, we’re storing entire XOR
results.
If we could find a way somehow to need to store only partial
results, we’d save a lot of disk space.
Next, remember how we’re decompressing the rows; we start with the
first row, which stores A
in full, and then we use that
along with the XOR
result stored in the second row, to
compute the original value for the second row; and we then use the
original value from the second row along with the XOR
result in the third row, to compute the original value for third row,
and so on…
So we normally have the first value of the two values used to make
the XOR
result, and the result itself (and this allows us
to compute the second value, which is the original value of the row
we’re looking at).
Now, remember - when in the result a bit is 0, then the bits in
A
and B
were the same.
We always have A
(the original value of the current
row), so when in the result we have a 0 bit, we already know
what the bit must be in B
- it’s the same as the matching
bit in A
.
So in fact, in theory, if we could find a way not to need to
store the 0 bits in the result - but still know where they were in the
result - we would save a ton of disk space; we could store a partial
result, but still have all the information we need to recompute
B
.
Now go back and have another look at the second XOR
result, with lots of 0 bits.
Notice the large number of contiguous leading 0 bits and contiguous trailing 0 bits?
Rather than storing them all, why not store a count of the the number of leading 0 bits and the count of the number of trailing 0 bits? there are eight leading and eight trailing bits, so we’re using eight bits, but we could for example use just four bits to store the actual number eight. We also store in full the bits in-between the leading and trailing 0 bits, because there’s no obvious reasonable way given a bunch of mixed up 1 bits and 0 bits to avoid storing the 0 bits but still be able to figure out where they are.
So, what we actually store is something like the number of leading 0 bits, the actual bits in-between the leading and trailing 0 bits, and the length of the bits in-between. We do not need to store the number of trailing 0 bits, because we know the length of the data type of the column - of the three lengths, leading 0 bits, bits in-between, trailing 0 bits - we only need to store any two, because we know the length of the data type.
In fact, Gorilla does some fancy footwork (which I’ll come to) to store the necessary length information, much more so than just storing a count of the number of contiguous 0 bits.
Now, what we do make of this, as a data compression method?
If we look at the first XOR
example, we see that we save
almost nothing at all - there is only a single contiguous trailing 0
bit; so we save only a single bit. The worst case then obviously is when
both the very first and very last bits both differ, and then Gorilla
saves nothing - indeed, even if all the bits in-between were 0 bits, it
wouldn’t help us, because Gorilla works by compression contiguous
leading and trailing 0 bits; and the longer the data type with this
happening, the worse it is, because the the result is the same length as
the length of the data type.
The best case is when two numbers are identical. Then every bit in the result is a 0 bit, and in fact in this case Gorilla encoding just stores a single 0 bit and that’s it.
However, we have to remember it’s the bit-pattern which matters, so our base-10 intuition is not going to serve us well - for example, numbers which are integer powers of 2 are going to be very different to their neighbouring numbers;
4294967296 = 100000000000000000000000000000000
4294967295 = 011111111111111111111111111111111
Only 1 difference, but the bit-patterns are almost entirely unlike each other.
However, we then see this;
4294967296 = 100000000000000000000000000000000
2147483648 = 010000000000000000000000000000000
And now we find our two numbers, which are about two billion apart, are almost identical for Gorilla encoding; we have no leading 0 bits, two bits of “in-between” bits, and then 31 bits of trailing 0 bits, which we save.
You can reason about Gorilla compression, but you have to
think about it. It operates in the domain of base 2, not base
10, and the more similar numbers are, the better they compress - but
similarity means contiguous leading and trailing 0 bits in the
XOR
result.
When we have small numbers in large data types, Gorilla works very
well, because we end up with really long contiguous leading 0 bits.
Imagine you have a 64 bit data type, with two numbers being
XOR
ed, and both only use say the least significant four or
five bits each; you’re going to save about 60 bits in the
XOR
result, from the huge number of leading contiguous 0
bits.
What this really means normally of course is that the data type is
wrong; it’s much too long for the values. However, there are going to be
situations where although normally the values are small, there is
occasionally going to be a very large value, and so the large data type
is needed; and here Gorilla does very well - although so do the
mostly
encoders.
I mentioned earlier that Gorilla encoding does some fancy footwork to encode length information. This is what we find in the PDF;
Re-use of previously stored length information is something which Gorilla chooses to do, as an optimization. When storing a value, if the length of both the leading zeros and the body are less than the most recently stored lengths (so the lengths stored for the most recent value which actually did have lengths written out), just write a 1 bit, which indicates the re-use of the most recently stored lengths, and then write the body.
Note that this is what to me the white paper seems to say, but it doesn’t quite make sense to me - I think you’d need to pad the XOR body with leading and trailing 0 bits to match the lengths being used, and so you’d only do perform re-use if the padding was less than the 11 bits needed to store the lengths.
Now, it’s not exactly clear to me how Redshift is encoding Gorilla length information. I know for sure (as will be described in the next section) that the standard Gorilla encoding of identical values is not in use; Redshift has deviated from the standard Gorilla spec.
Empirically, by issuing test cases, I can demonstrate that the number of bits needed to store the current row is the number of bits of contiguous difference between the bit-patterns of the current row and the previous row, plus an overhead, where the overhead is;
Data Type Length | Overhead in Bits per Row |
---|---|
2 | 0.5 |
4 | 1 |
8 | 2 |
16 | 4 |
Experimentation shows this formula is accurate. It is impossible to be exactly accurate, because in Redshift each block has a small header, something like 120 bytes, but that header length seems to vary a bit, so it’s not possible to exactly know how much of a block is available to store user data, and you need to know that exactly, to exactly figure out how many rows will be stored in a block.
Let’s look at some examples of alternating values being stored and the number of rows we estimate will be found in one block, and the number of rows we actually find. The “Values” column has two values, and these alternate through the entire block, the “# Bits Stored” is per row, as is the “Overhead”, and “Estimated Rows per Block” is computed by dividing the number of bits in one block by the number of bits needed for one row. Finally, we then have the “Actual Rows per Block”, as found by the test script.
(The negative numbers in “Values” unfortunately do not render very well.)
Data Type | Values | # Bits Stored | Overhead | Expected Rows per Block | Actual Rows per Block | |
---|---|---|---|---|---|---|
int2 | 0/1 | 1 | 0.5 | 1.5 | 5,592,405 | 5,591,809 |
int4 | 0/1 | 1 | 1 | 2 | 4,194,304 | 4,193,856 |
int8 | 0/1 | 1 | 2 | 3 | 2,796,202 | 2,795,904 |
int2 | 51/60 | 4 | 0.5 | 4.5 | 1,864,135 | 1,863,937 |
int4 | -2147483648/-1 | 32 | 1 | 33 | 262,144 | 262,881 |
int4 | -2147483648/0 | 1 | 1 | 2 | 4,194,304 | 4,193,856 |
Particularly given the large number of difference in rows for the
last two examples, which differ only by storing a -1 or a 0, it’s hard
to imagine Gorilla encoding is not in use, because we would
need to dream up some other compression method which has exactly the
same extremely unusual and highly distinctive characteristics and
behaviour. It is only reasonably possible to explain these findings by
imaging XOR
based compression, which compresses contiguous
leading and trailing 0 bits in the XOR
result, and that is
Gorilla encoding.
We can look more closely at the final two findings.
First, int4
, storing values -2147483648 and -1, with one
repetition of each
(e.g. -2147483648/-1/-2147483648/-1/-2147483648/-1/etc).
-2147483648 = 10000000000000000000000000000000
-1 = 11111111111111111111111111111111
There are 31 bits of contiguous difference between each pair of
values, and a 1 bit overhead per value for int4
, so we
expect each value to use 32 bits. There are 102410248 =
8,388,608 bits in one block. 8,388,608 / 32 = 262,144 rows per block.
The actual number of rows found, by experimentation, is 262,881.
Second, int4
, storing values -2147483648 and 0, with one
repetition of each
(e.g. -2147483648/0/-2147483648/0/-2147483648/0/etc).
-2147483648 = 10000000000000000000000000000000
0 = 00000000000000000000000000000000
There is 1 bit of contiguous difference between each pair of values, and a 1 bit overhead per value, so we expect each value to use 2 bits. There are 102410248 = 8,388,608 bits in one block. 8,388,608 / 2 = 4,194,304 rows per block. The actual number of rows found, by experimentation, is 4,193,856.
So, in summary, we can then make some hard and fast rules about the
use of AZ64
when it is using Gorilla encoding;
Never use AZ64
(with Gorilla encoding) on
any columns in interleaved tables. The values in all columns in
interleaved tables are effectively randomly ordered.
Never use AZ64
(with Gorilla encoding) on
any columns in unsorted tables.
With compound sorted tables, only use AZ64
(with Gorilla encoding) on columns which are part of the sortkey.
Never use AZ64
(with Gorilla encoding) on columns
which are outside of the sortkey.
Never use AZ64
with data which has
consecutive rows with the same value, as the 1-bit runlength encoding
behaviour in Gorilla has been removed and replaced by what appear to be
three different, all crazy, runlength methods.
In short, use AZ64
when and only when the value is a row
is always different to the previous row, and when the XOR
result for each row has as many contiguous leading and trailing 0 bits
as is possible.
Turning to the results, taking one of them to explain it, this is the
result for int2
with values -32768/-1.
Datum | Value |
---|---|
First value | -32768 |
First value (pattern) | 1000000000000000 |
Second value | -1 |
Second value (pattern) | 0000000000000001 |
XOR | 0111111111111111 |
Stored bit pattern | 111111111111111 |
Stored length in bits | 15 |
Overhead in bits | 0.5 |
Bits per Block | 8388608 |
Estimated rows per block | 541200.52 |
Actual rows per block | 541121 |
We see the first and second values and their bit-patterns, then the
XOR
result of those bit-patterns which is almost all 1
bits. Gorilla strips away the contiguous leading and trailing 0 bit,
which here saves us only the single leading 0 bit, and so we then see
the bits we’ll actually store, and that we store 15 bits.
Next is the unexplained but observed overhead per row, which for
int2
is 0.5 bits. Then we have a simple constant, the
number of bits in one block, and as one block is one megabyte, that is
8388608 bits.
We can now estimate the number of rows per block - each row will consume 15.5 bits, and 8388608 divided by 15.5 gives 541200.52; finally we have the actual number of rows found by the test script.
The slight difference is I think due to the per-block header.
Now, Gorilla encoding as defined in its PDF normally handles repeated values extremely well. The first value is stored, and after that every repetition of that value is stored using a single bit. It is expected then that repetitions will lead to excellent compression.
This is however not what is found.
Bizarrely, when storing alternating blocks of values (such as 0/1, or 65/99), when each of the values is repeated, up to 63 times, the number of values stored in a block very nearly does not change - a difference of a few rows, out of millions, as we change the number of repetitions by 1; in other words, runlength compression is not occurring.
This is to me incomprehensible. It’s simply a lost compression opportunity; there is no gain by it, and no reason for not compressing.
So here we have results from the test script, for the values 0/1 and 65/119.
Each row of the table is the result of a single test, where of the two values in a value pair, each is repeated a give number of times, as specified in the “# Repetitions” column. So for the first row in the table, the data in the table consists of 0, repeated twice, followed by 1, repeated 2, with this pattern continuing until the first block is full. We then see how many rows are in the first block.
Data Type | Values | # Repetitions | Actual Rows per Block |
---|---|---|---|
int2 | 0/1 | 2 | 5591810 |
int2 | 0/1 | 3 | 5591811 |
int2 | 0/1 | 4 | 5591812 |
int2 | 0/1 | 62 | 5591842 |
int2 | 0/1 | 63 | 5591817 |
int2 | 0/1 | 64 | 13420416 |
int2 | 0/1 | 65 | 5694016 |
int2 | 65/119 | 2 | 1524994 |
int2 | 65/119 | 3 | 1524993 |
int2 | 65/119 | 4 | 1524996 |
int2 | 65/119 | 62 | 1525014 |
int2 | 65/119 | 63 | 1525041 |
int2 | 65/119 | 64 | 13420416 |
int2 | 65/119 | 65 | 1567800 |
As we can see, the number of values stored changes as the values change, but the number of rows store does not change (not meaningfully - just by a very few rows, probably something to do with how much overhead is consumed in the first block) as the number of repetitions occurs (with the exception of 64 repetitions, because then method #2 kicks in, as 64 is an integer power of 2).
If Gorilla encoding has been used as defined in its white paper, with
a single bit for every repetition of a value, it would perform far
better than AZ64
is doing here, where AZ64
is
re-using the full Gorilla encoded bit-pattern for the repeated value for
every row.
The upshot of this in terms of using AZ64
is that with
63 or fewer repetitions, there is no compression improvement over single
repetitions.
The next encoding mode presented by AZ64
comes into use
when the number of repetitions is a positive integer power of 2, between
64 and 16384 (inclusive both ends).
For these particular number of repetitions, a “storage unit” is written to disk, the size of which is the data type size in bytes times two, plus one byte, where the storage unit contains the full value being stored (and as such the value being stored no longer matters, as it did before with encoding methods #1 and #2), and the storage unit records the number of repetitions of the value (which must be a positive integer power of 2, between 64 and 16384 inclusive both ends - if it is not, this encoding method is not used).
This encoding method can store very large numbers of rows per block,
and as such in fact determines the maximum possible rows per block for
AZ64
.
We can directly compute the maximum number of values per block (which is when we have 16384 repetitions per storage unit) and these values have been found, empirically, in the test suite, to be exactly correct.
Data Type | Data Type Size | Storage Unit Size | Number Storage Units | Max Rows per Block |
---|---|---|---|---|
int2 | 2 | 5 | 209694 | 3,435,626,496 |
int4 | 4 | 9 | 116496 | 1,908,670,464 |
int8 | 8 | 17 | 61674 | 1,010,466,816 |
numeric(38) | 16 | 33 | 31771 | 520,536,064 |
I am astounded by this behaviour.
I may be wrong, I’ve not checked materialization behaviour in
Redshift for some time, but I think it is dangerous. It allows an
extremely large number of rows in a single block, and Redshift used to
and I believe still does basically materialize rows on a per-block basis
(so if you need one row in a block, you get all of them), so that if
Redshift attempts to materialize such as block, then the cluster may
well grind to a halt - a couple of billion rows usually brings most
systems to their knees. Such blocks then are booby-traps, and you need
to make sure when using AZ64
that your data is such that no
such blocks end up being encoded.
The simplest runlength encoder design would store the entire
value being encoded, followed by a count of the number of repetitions.
The maximum number of repetitions allowed here is 16384, which is 15
bits. If we think then about say int8
we’re talking 64 bits
for the value and 15 bits for the count, which is 79 bits. With this
encoding method though, int8
uses 136 bits.
The smallest storage unit seen is for int2
, at 40 bits.
A runlength encoder would need 16 bits for the int2
and 15
bits for the length encoding, a total of 39 bits, so even in this case,
method #3 is inferior.
As a method for storing runlength information, method #3 makes absolutely no sense.
As an encoding method, despite being more complex and less
effective than the simplest runlength encoder, it nevertheless
compresses profoundly more effectively than either of the other
runlength-like compression methods used in AZ64
(method #2
and #4). Apart from the problem of massively large row counts (which
could be handled with some code to limit the maximum number of rows),
the power-of-2 runlength encoding method is fantastically more efficient
than methods #2 and #4; why are the other methods in existence at all,
particular method #4 with it baroque design which depends on the value
being encoded, the number of repetitions and the bit-pattern of
the number of repetitions? and why is this method in existence but
limited to positive integer powers of 2 repetitions only? why would you
have all three methods in the first place, and then use the best one
only very rarely? and why would you have any of them, when the
simplest runlength encoder is better than all of them? exactly which
drugs were the dev team snorting when they did this work?
The next encoding mode presented by AZ64
comes into use
when the number of repetitions is 65 or greater, and is not a
positive integer power of 2, between 64 and 16384 inclusive both
ends.
I think the first 64 repetitions receive no extra compression at all, just as with 63 or fewer repetitions, but that the additional repetitions over 64 obtain some form of runlength compression.
With this method, both the value being encoded, and the number of repetitions, affect compression, but, surprisingly, of the number of repetitions, both the number itself but also its bit-pattern affect compression, such that increasing the number of repetitions often reduces compression because the bit-pattern has become less favourable, and the magnitude of effect of the bit-pattern can be extremely large - many times greater than the number of repetitions itself.
We can examine the actual results found from the test script, for
int4
.
Note the number of repetitions tested is not contiguous. This is to reduce the time taken to run the test script.
Data Type | Values | # Repetitions | Actual Rows per Block |
---|---|---|---|
int4 | 0/1 | 64 (0b1000000) | 7455744 |
int4 | 0/1 | 65 (0b1000001) | 4251072 |
int4 | 0/1 | 66 (0b1000010) | 4308096 |
int4 | 0/1 | 67 (0b1000011) | 4306358 |
int4 | 0/1 | 68 (0b1000100) | 4421440 |
int4 | 0/1 | 69 (0b1000101) | 4359744 |
int4 | 0/1 | 70 (0b1000110) | 4414592 |
int4 | 0/1 | 71 (0b1000111) | 4411328 |
int4 | 0/1 | 72 (0b1001000) | 4645512 |
int4 | 0/1 | 90 (0b1011010) | 4854656 |
int4 | 0/1 | 91 (0b1011011) | 4846296 |
int4 | 0/1 | 92 (0b1011100) | 4946624 |
int4 | 0/1 | 93 (0b1011101) | 4883008 |
int4 | 0/1 | 94 (0b1011110) | 4927808 |
int4 | 0/1 | 95 (0b1011111) | 4918720 |
int4 | 0/1 | 96 (0b1100000) | 5920800 |
int4 | 65/119 | 64 (0b1000000) | 7455744 |
int4 | 65/119 | 65 (0b1000001) | 1433770 |
int4 | 65/119 | 66 (0b1000010) | 1470348 |
int4 | 65/119 | 67 (0b1000011) | 1469243 |
int4 | 65/119 | 68 (0b1000100) | 1545708 |
int4 | 65/119 | 69 (0b1000101) | 1504200 |
int4 | 65/119 | 70 (0b1000110) | 1541050 |
int4 | 65/119 | 71 (0b1000111) | 1538854 |
int4 | 65/119 | 72 (0b1001000) | 1705968 |
int4 | 65/119 | 90 (0b1011010) | 1870848 |
int4 | 65/119 | 91 (0b1011011) | 1863953 |
int4 | 65/119 | 92 (0b1011100) | 1948652 |
int4 | 65/119 | 93 (0b1011101) | 1894503 |
int4 | 65/119 | 94 (0b1011110) | 1932452 |
int4 | 65/119 | 95 (0b1011111) | 1924700 |
int4 | 65/119 | 96 (0b1100000) | 3050048 |
int4 | 65/119 | 97 (0b1100001) | 1954560 |
64 repetitions invokes method #2, but after that we’re looking at the behaviour of method #4.
What we see is that both the values being stored, the number of repetitions, and the bit pattern of the number of repetitions, affect the number of rows per block.
Something Gorilla-like is happening with the number of repetitions.
This can be seen clearly by looking at 96 repetitions, where we then have a large number of trailing 0 bits; the number of rows stored per block suddenly shoots up to 3,050,048, from the 1,924,700 of 95 repetitions.
This is though different to method #2, because with the powers of 2 (which is method #2), the number of rows per block does not change when the values being encoded change, whereas with the non-powers of 2, such as 96, the number of rows stored does change depending on the values being encoded.
This number of repetitions is usually less important than the bit-pattern of the number of repetitions can be much more important than the actual number of repetitions. For values 0/1 we see the following;
Data Type | Values | # Repetitions | Actual Rows per Block |
---|---|---|---|
int4 | 0/1 | 96 (0b1100000) | 5920800 |
int4 | 0/1 | 135 (0b10000111) | 5718870 |
int4 | 0/1 | 136 (0b10001000) | 6199696 |
For example, 96 repetitions gives 5,920,800 rows per block, a value we do not manage to exceed until we get to 136 repetitions, with 6,199,696 rows per block.
Needless to say, it is not expected that increasing the number of repetitions reduces compression.
Of all the encoding methods in AZ64
, this is the method
where I’ve least pinned down what’s going on, and this is due to a
complete lack of motivation, because what’s going on looks to be crazy,
and so there is no value in figuring it out. What should have been
implemented is the usual runlength method of one value, plus a count of
repetitions, plus, should you want to limit the number of repetitions so
blocks do not have too many rows, an arbitrary maximum number of
repetitions.
The AZ64
encoder appears to implement four different
compression methods, each of which is used in different
circumstances.
When a value differs to the value in the previous row, the
Gorilla compression method from Facebook (2015) is used. Gorilla
encoding is based on storing for each row the result of an
XOR
of a value with the value in the previous row, but
removing from the XOR
result all contiguous leading and
trailing 0 bits; which is to say, the more there are identical
contiguous leading and trailing bits, in the two values, the better.
Note however the Gorilla specification encodes repeated values in 1
bit, but that this behaviour has been removed from the AZ64
implementation and replaced with three different runlength encoding
methods, described below.
When a value repeats, from 2 to 63 times, the full Gorilla encoding for the value is used in each row (rather than storing 1 bit). This is a lost compression opportunity.
When a value repeats, and the number of repetitions is a positive integer power of 2, from 64 to 16384 inclusive both ends, a “storage unit” (from 5 to 33 bytes inclusive both ends in length, depending on the data type) is used to encode a copy of the value and a 15 bit counter of the number of repetitions.
This allows billions of rows per block, which I think highly
dangerous when materializing rows (I’ve yet to investigate this, but I
currently think rows are materialized on a per block basis, so needing
to materialize any row in a block means all its rows are materialized),
and in fact also overflows the signed 32 bit num_values
column in ’STV_BLOCKLIST`,
which becomes negative.
When a value repeats, more than 64 times but the number of repetitions is not a positive integer power of 2, the number of values per block is based on the value being stored, the number of repetitions, and the bit pattern of the number of repetitions, where Gorilla-like encoding is occurring on the value as well as the number of repetitions, such that increasing the number of repetitions often decreases the number of rows stored per block.
I have not pinned down exactly how this method works, because it seems crazy, and I simply can find no motivation within myself to figure it out when there’s so many more pressing investigations to work on.
When it comes to using AZ64
, observe the following
rules;
Never use AZ64
with interleaved tables, as the
values in rows are basically random, which is the worst use case for
this compression method.
Never use AZ64
with unsorted tables, for the same
reason (unless you control your data load so it is in fact ordered, even
though the table is not).
Never use AZ64
with unsorted columns (outside of the
sortkey) in compound sorted tables.
Never use AZ64
with data which repeats, as
runlength
encoding is better.
Note that Redshift’s default encoding selection behaviour ignores all these rules.
If you have been creating tables using CREATE TABLE AS
,
or using CREATE TABLE
but without specifying encodings, or
using COPY
to load data with COMPUPDATE
in
used in a non-sampling mode (I’ve not checked with sampling), you will
find AZ64
is used for all data types it supports (with the
exception of columns in sortkeys, which are are raw
).
Recently AWS also introduced an option to have Redshift select column
encodings automatically on an ongoing basis, changing encodings as it
sees fit; I absolutely would not use this functionality until it has
been investigated and found to be safe.
In short, AZ64
performs when well the XOR
of the value in the current row and the previous row has as many
contiguous leading and trailing 0 bits as possible. That’s not quite
intuitive to work with, because for example with int2
,
-32678 and 0 works almost perfectly (1 bit stored per value), but -32678
and -1 works appallingly (15 bits stored per value). You need to
understand how the compression method works to use it correctly; there’s
never any getting away from this, not for any compression method.
I strongly recommend AZ64
is not used when any of its
runlength encoding methods would come into play.
The only information I have ever seen from AWS about
AZ64
is in this blog post, here.
As we can now directly see, comparing AZ64
to
ZSTD
and LZO
is wrong, as they are different
types of compression method. AZ64
works extremely well on
values which when XOR
ed with the value in the previous row
have many contiguous leading and trailing 0 bits in the XOR
result, but increasingly poorly as we move from this to data which is
more and more random, where-as ZSTD
and LZO
are general purpose compression methods which work reasonably well on
all data.
For AZ64
to have done so much better, as described in
the blog post, the data must have been carefully chosen to be highly
suitable for AZ64
and reporting only this test scenario,
and so presenting it as the normal scenario, is unethical.
Furthermore, I have as-yet unpublished benchmarks showing
AZ64
to be at best a couple of percentage points faster
than LZO
or ZSTD
.
I assert that the information AWS have published, to the extent I
think I now understand AZ64
, has no credibility.
There is, as far as I can see, no connection at all between the blog post and the reality of the encoding method.
I call upon AWS to publish their benchmark method and the data used.
Note these results are completely unprocessed; they are a raw dump of the results, so the original, wholly unprocessed data, is available.
{'proofs': {'dc2.large': {2: {'method_1': {'int2': {-32768: {-1: {1: 541121},
0: {1: 5591809}},
0: {1: {1: 5591809}},
51: {60: {1: 1863937}}},
'int4': {-2147483648: {-1: {1: 262081},
0: {1: 4193856}},
0: {1: {1: 4193856}},
51: {60: {1: 1677505}}},
'int8': {-9223372036854775808: {-1: {1: 129025},
0: {1: 2795904}},
0: {1: {1: 2795904}},
51: {60: {1: 1397952}}},
'numeric(38,0)': {0: {1: {1: 1677504}},
51: {60: {1: 1048448}}}},
'method_2': {'int2': {0: {1: {2: 5591810,
3: 5591811,
4: 5591812,
62: 5591842,
63: 5591817,
64: 13420416,
65: 5694016}},
65: {119: {2: 1524994,
3: 1524993,
4: 1524996,
62: 1525014,
63: 1525041,
64: 13420416,
65: 1567800}}},
'int4': {0: {1: {2: 4193856,
3: 4193856,
4: 4193856,
62: 4193856,
63: 4193856,
64: 7455744,
65: 4251072}},
65: {119: {2: 1397952,
3: 1397952,
4: 1397952,
62: 1397952,
63: 1397952,
64: 7455744,
65: 1433770}}},
'int8': {0: {1: {2: 2795904,
3: 2795904,
4: 2795904,
62: 2795904,
63: 2795904,
64: 3947136,
65: 2821248}},
65: {119: {2: 1198210,
3: 1198209,
4: 1198212,
62: 1198212,
63: 1198260,
64: 3947136,
65: 1224470}}},
'numeric(38,0)': {0: {1: {2: 1677504,
3: 1677504,
4: 1677504,
62: 1677504,
63: 1677504,
64: 2033344,
65: 1686592}},
65: {119: {2: 931968,
3: 931968,
4: 931968,
62: 931968,
63: 931968,
64: 2033344,
65: 947765}}}},
'method_3': {'int2': {0: {1: {64: 13420416,
128: 26840832,
256: 53681664,
512: 107363328,
16384: -859340800}},
65: {119: {64: 13420416,
128: 26840832,
256: 53681664,
512: 107363328,
16384: -859340800}}},
'int4': {0: {1: {64: 7455744,
128: 14911488,
256: 29822976,
512: 59645952,
16384: 1908670464}},
65: {119: {64: 7455744,
128: 14911488,
256: 29822976,
512: 59645952,
16384: 1908670464}}},
'int8': {0: {1: {64: 3947136,
128: 7894272,
256: 15788544,
512: 31577088,
16384: 1010466816}},
65: {119: {64: 3947136,
128: 7894272,
256: 15788544,
512: 31577088,
16384: 1010466816}}},
'numeric(38,0)': {0: {1: {64: 2033344,
128: 4066688,
256: 8133376,
512: 16266752,
16384: 520536064}},
65: {119: {64: 2033344,
128: 4066688,
256: 8133376,
512: 16266752,
16384: 520536064}}}},
'method_4': {'int2': {0: {1: {64: 13420416,
65: 5694016,
66: 5796780,
67: 5793624,
68: 6003856,
69: 5890624,
70: 5991232,
71: 5985229,
72: 6424640}}},
'int4': {0: {1: {64: 7455744,
65: 4251072,
66: 4308096,
67: 4306358,
68: 4421440,
69: 4359744,
70: 4414592,
71: 4411328,
72: 4645512,
90: 4854656,
91: 4846296,
92: 4946624,
93: 4883008,
94: 4927808,
95: 4918720,
96: 5920800,
97: 4953499,
98: 4996928,
99: 4987323,
100: 5083500,
101: 5020224,
102: 5062400,
103: 5052253,
104: 5255016,
105: 5083470,
106: 5124480,
107: 5113856,
108: 5206144,
109: 5143552,
110: 5183424,
111: 5172489,
112: 5591824,
113: 5200640,
114: 5239488,
115: 5228130,
116: 5316800,
117: 5254976,
118: 5292800,
119: 5281152,
120: 5470320,
121: 5306752,
122: 5343616,
123: 5331804,
124: 5417088,
125: 5356160,
126: 5392170,
127: 5379974,
128: 14911488,
129: 5464698,
130: 5563350,
131: 5549422,
132: 5766592,
133: 5634146,
134: 5734530,
135: 5718870,
136: 6199696,
137: 5803594,
138: 5905710,
139: 5888318,
140: 6116096,
190: 8131050,
191: 8091142,
192: 22367232,
193: 8175866,
194: 8302230}},
65: {119: {64: 7455744,
65: 1433770,
66: 1470348,
67: 1469243,
68: 1545708,
69: 1504200,
70: 1541050,
71: 1538854,
72: 1705968,
90: 1870848,
91: 1863953,
92: 1948652,
93: 1894503,
94: 1932452,
95: 1924700,
96: 3050048,
97: 1954560,
98: 1992732,
99: 1984257,
100: 2071040,
101: 2013435,
102: 2051648,
103: 2042387,
104: 2236728,
105: 2071040,
106: 2109312,
107: 2099340,
108: 2188096,
109: 2127360,
110: 2165900,
111: 2155176,
112: 2609488,
113: 2182595,
114: 2221120,
115: 2209840,
116: 2300164,
117: 2236689,
118: 2275276,
119: 2263380,
120: 2466960,
121: 2289683,
122: 2328370,
123: 2315844,
124: 2407584,
125: 2341625,
126: 2380288,
127: 2367280,
128: 14911488,
129: 2404560,
130: 2455872,
131: 2441840,
132: 2562912,
133: 2479120,
134: 2531456,
135: 2516400,
136: 2795888,
137: 2553680,
138: 2606976,
139: 2590960,
140: 2718240,
190: 3589312,
191: 3560240,
192: 22367232,
193: 3597520,
194: 3664896}}}}}}},
'tests': {'dc2.large': {2: {}}},
'versions': {'dc2.large': {2: 'PostgreSQL 8.0.2 on i686-pc-linux-gnu, '
'compiled by GCC gcc (GCC) 3.4.2 20041017 (Red '
'Hat 3.4.2-6.fc3), Redshift 1.0.35649'}}}
I am a C programmer - kernel development, high performance computing, networking, data structures and so on.
I read the C. J. Date book, the classic text on relational database theory, and having learned the principles, wrote a relational database from scratch in C, which purely by chance set me up quite nicely for what came next, moving into data engineering in late 2011, when I joined as the back-end engineer two friends in their startup.
In that startup, I began using Redshift the day it came out, in 2012 (we had been trying to get into the beta programme).
We were early, heavy users for a year and a half, and I ending up
having monthly one-to-one meetings with one of the Redshift team
managers, where one or two features which are in Redshift today
originate from suggestions made in those meetings, such as the
distribution style ALL
.
Once that was done, after a couple of years of non-Redshift data engineering work, I returned to Redshift work, and then in about mid-2018 contracted with a publisher to write a book about Redshift.
The book was largely written but it became apparent I wanted to do a lot of things which couldn’t be done with a book - republish on every new Redshift release, for example - and so in the end I stepped back from the contract and developed the web-site, where I publish investigation into, and ongoing monitoring of, Redshift.
So for many years now I’ve been investigating Redshift sub-systems full-time, one by one, and this site and these investigations are as far as I know the and the only source of this kind of information about Redshift.
I provide consultancy services for Redshift - advice, design, training, getting failing systems back on their feet pronto, the usual gamut - but in particular offer a Redshift cluster cost reduction service, where the fee is and only is one month of the savings made.
Broadly speaking, to give guidance, savings are expected fall into one of two categories; either something like 20%, or something like 80%. The former is for systems where the business use case is such that Redshift cannot be operated correctly, and this outcome requires no fundamental re-engineering work, the latter is for systems where Redshift can be operated correctly, and usually requires fundamental re-engineering work (which you may or may not wish to engage in, despite the cost savings, in which case we’re back to the 20%).
Details and contact information are on the web-site.