24th October 2021
bytedict
dictionary contains up to 256 entries, but the
final entry is always used as an end-of-dictionary marker and so does
not hold a value from the block. The dictionary holds a copy of the
values which are placed in the dictionary and so its size varies
according to the size of the column data type; varchar
are
stored with their full DDL length, regardless of the string length. The
dictionary is stored in the block, so a large dictionary means less
storage space for values. The dictionary is populated by up to the first
255 distinct values in the block (and so is affected by sorting), but is
only as large as it needs to be, rather than always being the size of
256 entries. A value which is present in the dictionary is replaced by a
one byte index into the dictionary. A value which is not present in the
dictionary is written out in full, plus a one byte overhead. A
NULL
(as opposed to NOT NULL
)
bytedict
column consumes an additional 1 bit per value,
even for varchar
, which normally consumes an additional 1
byte per value.
Redshift is a column-store relational database, which means that each column in a table is stored contiguously.
It is often the case that the data in a single column has similar characteristics - for example, might be all names, or ages, or say integer values within a given range; in other words, data which is far from randomly distributed across the value range for the data type of the column.
This provides an opportunity for unusually effective data compression, as well as an opportunity to blunder terribly, as Redshift offers a range of data compression methods (known in the Redshift documentation as “encodings”), most of which work spectacularly well with and only with data which expresses the characteristics necessary for that data compression method, and which often works spectacularly badly with data lacking those characteristics.
It is then necessary to understand the type of data characteristics suitable for each of the data compression methods offered by Redshift, as well of course as the behaviours and limitations of the data compression methods, so the good choices can be made when selecting data compression for columns.
This document is one in a series, each of which examines one data
compression method offered by Redshift, which here investigates the
bytedict
encoding.
A unsorted table is created which stores all rows on a single slice,
and which possesses an int8 not null
column which uses
bytedict
encoding.
The bytedict
dictionary is documented as holding 256
values, so this column is populated with the values 0 to 255, inclusive
both ends, to fill the dictionary.
Then rows are inserted (using a self-join, with a full
vacuum
and analyze
after every
insert
), using the value 0, which is expected to be in the
dictionary, until the first block is full (which is known by a second
block being formed).
We then examine how many rows are present in the first block.
This process is repeated, dropping and remaking the table every time, but with the values 254, 255 and 256, checking the size of the dictionary.
The next tests focus on varchar
.
The first creates an unsorted table, which as before stores all rows
on a single slice, which possesses a varchar(1) not null
column, using bytedict
encoding.
A single 1 byte string is created and inserted as many times as
possible into the table, using self-join, with a full
vacuum
and analyze
after each
insert
, until the first block is full (know as a second
block is now in existence).
We then count the number of rows in the first block.
This test is then repeated, but with the DDL indicating a 65535
length varchar
. The string used remains 1 byte in
length.
The tests seem fairly minimal, but in fact knowing what was inserted
and the number of rows allows us to deduce the answers to all the open
bytedict
questions.
The results are given here for ease of reference, but they are primarily presented, piece by piece along with explanation, in the Discussion.
See Appendix A for the Python
pprint
dump of the results dictionary.
The script used to generated these results in designed for readers to use, and is available here.
Test duration, excluding server bring-up and shut-down, was 65 seconds.
Insert Value | Rows | Block Size | Bytes/Value |
---|---|---|---|
0 | 1046405 | 1048576 | 1.002075 |
254 | 1046405 | 1048576 | 1.002075 |
255 | 116495 | 1048576 | 9.001039 |
256 | 116495 | 1048576 | 9.001039 |
Insert Value | Rows found in Dict | Rows not found in Dict | Total Bytes for Rows | Unused Space |
---|---|---|---|---|
254 | 1046405 | 0 | 1046405 | 2171 |
255 | 255 | 116240 | 1046415 | 2161 |
Insert Value | Bytes Remaining | Dict Size | Unused Space |
---|---|---|---|
254 | 2171 | 2040 | 131 |
255 | 2161 | 2040 | 121 |
DDL | NOT NULL | Rows |
---|---|---|
1 | NOT NULL | 1048455 |
65535 | NOT NULL | 917387 |
1 | NULL | 931960 |
We begin with the official docs, which can be found here.
The docs state that each block has a table, with 256 entries, each entry storing a one-byte value which is the index number of the entry and also a value from the column, where when that value is seen in the block, it is replaced by the one byte value from the table.
This doesn’t make sense, and I suspect this is garbled version of what you would think actually happens, which is that the table stores 256 values, and when those values are seen in the data, they are replaced by a one byte index (implicitly known) which is their position in the table.
There’s no mention of how the values are chosen, or if they can be replaced. It’s not clear if the table is stored in the block or not (the docs say “a separate dictionary of unique values is created”). It is stated that if a value is found which is not in the table, it is stored in its full, uncompressed length, but there is no mention of any overheads in this case.
So the questions are;
My guess is that the values are chosen on the basis of being the first distinct 256 values in the column.
The first test is to create an unsorted table which stores all its
blocks on one slice, where the table has a not null int8
column using bytedict
, where I first store the values 0 to
255, inclusive both ends, which should fill the dictionary, and then
fill the rest of the first block with the value 0, which should be in
the dictionary.
I then repeat this test, but with a few more values - 254, 255 and 256 (from knowledge obtained from prior experimentation).
The goal here is to store a value which is outside of the dictionary,
which should be stored with the normal 8 bytes of an int8
,
but we will also then see what overhead there is for being outside of
the dictionary.
These are the numbers we find, where the “Insert Value” is the single value being used to populate the block once the dictionary has been written.
Insert Value | Rows | Block Size | Bytes/Value |
---|---|---|---|
0 | 1046405 | 1048576 | 1.002075 |
254 | 1046405 | 1048576 | 1.002075 |
255 | 116495 | 1048576 | 9.001039 |
256 | 116495 | 1048576 | 9.001039 |
We observe;
Knowing all this now, we can examine this data a bit more deeply.
Insert Value | Rows found in Dict | Rows not found in Dict | Total Bytes for Rows | Unused Space |
---|---|---|---|---|
254 | 1046405 | 0 | 1046405 | 2171 |
255 | 255 | 116240 | 1046415 | 2161 |
We know the size of a block (one megabyte), and we can directly compute the space taken by rows (one byte for every row in a dictionary, nine bytes for every row outside). Working out then the number of rows found in the dictionary, the number not found in the dictionary we know the total number of bytes for rows, and we find there’s some unused space.
A dictionary for int8
of 255 elements which are 8 bytes
each is 2040 bytes.
It rather looks like the dictionary is in the block.
That would make sense, in fact, in general : blocks are independently compressed and from a software engineering point of view you’d want them to be wholly independent - that whenever you have a block, you have in your hand everything you need to access and use the data in that block.
A “separate dictionary”, in the sense of being separate from the
block, wouldn’t make sense, both from that design perspective, but also
from a fundamental Big Data point of view. Whenever you picked up a
bytedict
block, you would then have to do some extra work
to find the dictionary - and were that to involved a disk seek,
then death and ruin - you’ve just lost your capability to process Big
Data in a timely manner.
If we take it then the dictionary is in the block, we then note we have a little unused space.
Insert Value | Bytes Remaining | Dict Size | Unused Space |
---|---|---|---|
254 | 2171 | 2040 | 131 |
255 | 2161 | 2040 | 121 |
That’s quite curious. There’s room for some more values to be stored, whether they are inside or outside of the dictionary. What gives?
So, all disk I/O in Redshift is in one megabyte blocks, which is to
say, 1,048,576 bytes. I note in the official docs about geometry
one of the limitation is that the maximum size of a
geometry
type is 1,048,447 bytes, which leaves 129 bytes
unaccounted for.
Interesting, eh? almost what we see here and that has the feel of a per-block header to me.
So now we have a pretty good idea of what’s going on with
bytedict
;
Wouldn’t it have been nice if all this was just in the documentation in the first place?
By the way, as an side, given that NULL
is implemented
for almost all data types as a one bit flag for each value, couldn’t
not-in-dictionary have been implemented as a one bit flag also?
A few more questions come to mind;
varchar
? the dictionary seems to be
index based, but varchar
vary in length. It’s not obvious
how those two work together.varchar
) or
a byte (varchar
) which is used as a flag to indicate the
value is NULL?These questions can be answered with three more tests; create a table
with a varchar(1) not null
, storing a 1 byte string as many
times as possible, and then see how many rows are in the first block.
Repeat this, but change the DDL to ‘null’. Then repeat again, still with
a 1 byte string, but now with the DDL specifying a
varchar(65535) not null
.
DDL | NULL | Rows |
---|---|---|
1 | NOT NULL | 1048455 |
1 | NULL | 931960 |
65535 | NOT NULL | 917387 |
This is fascinating!
Okay, so, let’s begin with the first the third rows, where all that changes is the length in the DDL - here we see, even though in both cases the actual string is 1 byte, we see when the DDL is 65535 we have 131,068 less values stored in the block.
That demonstrates - although seemingly a bit strangely, given the difference in rows in much larger than the difference in DDL length - that the dictionary entries store the full length of the DDL.
The difference in the number of rows is 131,068.
The size of a dictionary entry when the DDL is 65535 is, naturally enough, 65535; and we immediately note that the difference in the number of rows is almost exactly the size of two dictionary entries, which is 65535*2 = 131,070.
It looks, the two byte difference notwithstanding, awfully like two dictionary entries have been allocated, even though there was only one unique string.
I guess the extra entry is being used to indicate the end of the dictionary, but it seems like an improper method to do so, since it is costly with long data types.
This would explain why the documentation states the dictionary is 256 entries, when really it is 255. It’s because the final entry is being used to mark end-of-dictionary.
This also means the math in the tables for the int8
work
above is slightly wrong, since it assumes there are 255 entries in the
dictionary. In fact the dictionary is 2048 bytes, not 2040.
Now we can turn to the first and second rows, which examine the
behaviour of NULL
.
Normally, each value in a varchar
column which allows
NULL
carry each a one byte overhead, used to hold a flag
which is set when the value is NULL
.
We see that with a 1 byte NOT NULL
string, and
bytedict
, we have 1048455 values in a block.
When we change to NULL
, we drop to 931,960 values.
It is immediately apparent that there is not a one byte flag per value; if there was, the size of each value would have doubled, and so the number of rows would be about 500,000 or so.
Similarly, it is also apparent that there still is a NULL marker on a per-value basis, because the number of values is a block has decreased; each value is now taking up a bit space.
In fact, if we look at the numbers, there are 1048455 - 931960 = 116,495 less rows in the block, where each block is one byte in length.
The number of bits in 116495 bytes is 116495 * 8 = 931,960.
We also see the number of rows in the NULL
block is
931,960.
We conclude that the NULL
marker for
bytedict
encoded values is 1 bit in length, regardless of
the data type.
So, we can infer the following;
varchar
, the dictionary is allocating the DDL
length of the string not the actual length. I suspect this may also mean
the four byte header is being removed.bytedict
encoding when NULL
(as opposed to NOT NULL
) consume an additional 1 bit per
value, even with varchar
, which normally consumes 1 byte
per value.We can now also make sense of the warning given in the documentation;
Byte-dictionary encoding is not always effective when used with VARCHAR columns. Using BYTEDICT with large VARCHAR columns might cause excessive disk usage. We strongly recommend using a different encoding, such as LZO, for VARCHAR columns.
This warning - like all in the docs - reminds me of an observation made by Richard Feynman in one of his memoirs. He said, basically, that giving people sets of rules, without explaining the principles behind those rules, doesn’t work; people can follow the rules, but because they do not know why those rules exist, they are still going to do lots of things which are wrong.
(This was something which happened during the Manhattan Project. There was a plant where all the research was done, and a plant where the bombs were made. The Army for security reasons wanted the plant were the bombs were made to know nothing about what they were doing - which was a problem, because when you’re handling radioactive materials, there are safety issues you need to know about, or maybe people die or maybe you get a big explosion. The bomb building plant had been given a whole bunch of safety rules, like “don’t put too much Uranium in one pile”. They were following all the rules - but they were doing things like having one pile of Uranium in one room, against the wall, and another pile in the neighbouring room, up against the other side of the same wall. They were following the rules, but getting it wrong, because they didn’t understand the principles.)
However, given what we’ve now learned about the dictionary, we can
interpret the warning : the problem is that with varchar
,
developers often specify long lengths in the DDL but then use short
strings. If you do this with bytedict
, the dictionary will
be huge, taking up most of the space in the block, almost all
of it with empty space.
The dictionary contains 256 entries, but only 255 are used; the final entry is an end-of-dictionary marker.
The dictionary is populated by the first 255 distinct values in the block.
The dictionary holds a copy of the values which are placed in the dictionary (and so the longer the data type, the more space the dictionary consumes).
With the varchar
type, the dictionary stores the DDL
length of the data type, not the length of the actual string. In the
extreme case, with varchar(65535)
, the dictionary holds 15
entries, each 65535 bytes in length, (the final entry, despite being
full-size, being used for the end-of-dictionary marker), and is 983,025
bytes in length.
The dictionary is stored in the block.
When a value is in the dictionary, it is replaced by a one-byte value (which is the index of its position in the dictionary).
When a value is not in the dictionary, it is written out in full, with an additional one-byte marker to indicate a non-compressed value.
When a bytedict
encoded column is NULL
(as
opposed to NOT NULL
), an additional 1 bit is stored per
value, used for a flag to indicate the value is NULL
or
not. This is true for all data types, including varchar
,
where varchar
normally uses 1 byte for this flag.
Blocks when full (in the sense that the insert
of one
more record would form a new block) have a little unused space; for
example, int8 not null
where all values are in the
dictionary has 123 bytes of unused space.
The documentation contains a warning, thus;
Byte-dictionary encoding is not always effective when used with VARCHAR columns. Using BYTEDICT with large VARCHAR columns might cause excessive disk usage. We strongly recommend using a different encoding, such as LZO, for VARCHAR columns.
What this is actually referring to is that the full DDL length of a
varchar
is stored in the dictionary. If you make a
varchar
with a large DDL length, and then store short
strings, the dictionary will be huge, but mainly composed of empty
space.
The same is true for long char
types, since they also
always store the full length in the DDL, but char
only goes
up to 4096.
Does Redshift scan the block for the values which will form the
dictionary, build the dictionary, and then populate the block? or does
Redshift build the dictionary as it goes along? this can make a big
difference; imagine we have a column which is a long
varchar
, where the block begins with many repeated copies
of the same value, and then later has many distinct values.
If the dictionary is built first, it will be full before any encoding is done, and there will not be much space for values, and so not many of the repeated values will be stored; but if the dictionary is built as Redshift goes along, it will begin with just one entry and have lots of room for many of the repeated values, and so maybe occupy most of the block and not have room for the later distinct values.
Remembering that all I/O is in one megabyte blocks, Redshift always reads an entire block then writes the entire block, and that whenever a block is written, it’s being stored permanently and so must be in a sane, readable state, I think it’s going to depend in part on the way in which rows are being added to a block.
If rows are being added one by one, by say single-row
insert
, then necessarily each row which could go into the
dictionary will do so, and that will occur at the time that row is
presented, and all rows after that which match that row do so.
However, we can imagine a more reasonable and common scenario where a
COPY
is being executed, or a VACUUM
, or a
multi-row INSERT
, and so we have a stream of incoming rows.
Is the dictionary built up as rows arrive, or is the incoming flow of
rows scanned (and buffered) until the dictionary is full? is there even
an actual flow of rows? that’s how I would expect the implementation to
be designed, but I could be completely wrong.
So I’m not sure, and I need to test, but I need to think of a decent, simple, robust way to test this, and how it may vary in different situations.
VACUUM
to being a per-block header.Note these results are completely unprocessed; they are a raw dump of the results, so the original, wholly unprocessed data, is available.
{'proofs': {'dc2.large': {2: {'bytedict_results': {'int8': {0: 1046405,
254: 1046405,
255: 116495,
256: 116495},
'varchar': {'ddl_1_string_1': 1048455,
'ddl_1_string_1_null': 931960,
'ddl_65535_string_1': 917387}}}}},
'tests': {'dc2.large': {2: {}}},
'versions': {'dc2.large': {2: 'PostgreSQL 8.0.2 on i686-pc-linux-gnu, '
'compiled by GCC gcc (GCC) 3.4.2 20041017 (Red '
'Hat 3.4.2-6.fc3), Redshift 1.0.32574'}}}
I am a C programmer - kernel development, high performance computing, networking, data structures and so on.
I read the C. J. Date book, the classic text on relational database theory, and having learned the principles, wrote a relational database from scratch in C, which purely by chance set me up quite nicely for what came next, moving into data engineering in late 2011, when I joined as the back-end engineer two friends in their startup.
In that startup, I began using Redshift the day it came out, in 2012 (we had been trying to get into the beta programme).
We were early, heavy users for a year and a half, and I ending up
having monthly one-to-one meetings with one of the Redshift team
managers, where one or two features which are in Redshift today
originate from suggestions made in those meetings, such as the
distribution style ALL
.
Once that was done, after a couple of years of non-Redshift data engineering work, I returned to Redshift work, and then in about mid-2018 contracted with a publisher to write a book about Redshift.
The book was largely written but it became apparent I wanted to do a lot of things which couldn’t be done with a book - republish on every new Redshift release, for example - and so in the end I stepped back from the contract and developed the web-site, where I publish investigation into, and ongoing monitoring of, Redshift.
So for many years now I’ve been investigating Redshift sub-systems full-time, one by one, and this site and these investigations are as far as I know the and the only source of this kind of information about Redshift.
I provide consultancy services for Redshift - advice, design, training, getting failing systems back on their feet pronto, the usual gamut - but in particular offer a Redshift cluster cost reduction service, where the fee is and only is one month of the savings made.
Broadly speaking, to give guidance, savings are expected fall into one of two categories; either something like 20%, or something like 80%. The former is for systems where the business use case is such that Redshift cannot be operated correctly, and this outcome requires no fundamental re-engineering work, the latter is for systems where Redshift can be operated correctly, and usually requires fundamental re-engineering work (which you may or may not wish to engage in, despite the cost savings, in which case we’re back to the 20%).
Details and contact information are on the web-site.