9th August 2021
dc2.large
and ra3.xlplus
, but is much larger
for ds2.xlarge
. However, for normally sized tables, up to
say 100 columns, even on ds2.xlarge
the slowdown is very
small.
Redshift is a column-store database and as such tables are broken up into their constituent columns and each column is stored separately on disk.
A query typically will use only some rather than all of the columns in a table, and naturally the more columns are used, the slower the query, since more data has to be read from disk and more work has to be done to perform materialization, which is the task of joining the separated columns back up into contiguous rows for return to the client.
However, there is a question as to whether or not the columns which are not used by a query have a performance impact on the query. If this is so, then the more columns a table has, even though they are not used by a query, the more harmful for performance.
In principle those extra columns, being unused, should have no impact on query performance. The question is whether or not this is true in practise.
The basic method is that we create a test table, and issue a test query on the table, and measure the time taken for the query to execute.
There are four test queries.
Each test query is issued five times, with the slowest and fastest results being discarded.
The test table is dropped, created and repopulated for every iteration, and for every test query; which is to say, all the test tables are identical, but each table only ever has one query issued on it.
The test table is varied, by the number of columns (2, 800, and 1600) and the number of blocks per column, per slice (1 and 8), with the five iterations of the four queries repeated on every variant of the table.
The table has key
distribution. The even
has off-by-one bugs when it comes to distribution rows, so cannot be
used, as you end up with an uneven number of blocks on the different
slices.
The value stored in the distribution key column, and the number of rows, are chosen to ensure each slice has an equal number of exactly completely full blocks. Validation checks are present in the test code to ensure each slice does in fact has the correct number of blocks with the correct number of rows.
For all columns the data type is bigint
and
not null
is specified, encoding is raw
, and
the table is fully vacuumed and analyzed. Result caching is disabled,
and the analysis threshold set to 0.
The tests described above are performed on three clusters, all two
node, a dc2.large
, a ds2.xlarge
and an
ra3.xlplus
.
How long a query took to execute is taken directly from the system tables, and so mark the times internal to Redshift as to when the query execution started and ended (queuing and compile times are excluded).
The four test queries are;
sum()
of the first columnsum()
of the last columnsum()
of the first column, when the second
column, with one subtracted from its value, is greater than -10sum()
of the first column, when the last
column, with one subtracted from its value, is greater than -10The purpose of the “subtract one” is to defeat use of the Zone Map,
and the comparison, greater than -10, ensures all rows are used in the
sum()
. This is because every column in a row has the same
value, and that value is always 0 or greater.
These tests then measure how long it takes to sum()
the
first column, and the last column, and to access pairs of columns (first
and second, first and last), as we vary the number of columns, the
amount of data, and the node type.
See Appendix A for the Python
pprint
dump of the results dictionary.
The X-axis is the number of exactly full blocks per column, per slice, in the table.
The Y-axis is the number of columns in the table.
The individual results are “mean/standard deviation” of the duration of the test queries in seconds.
As ever, five trials occurred, with the slowest and fastest being discarded.
Test duration was 14,375 seconds.
1 | 8 | |
---|---|---|
2 | 0.007/0.000 | 0.013/0.000 |
800 | 0.008/0.000 | 0.012/0.000 |
1600 | 0.008/0.000 | 0.013/0.000 |
1 | 8 | |
---|---|---|
2 | 0.011/0.003 | 0.012/0.000 |
800 | 0.008/0.000 | 0.013/0.000 |
1600 | 0.008/0.001 | 0.013/0.000 |
1 | 8 | |
---|---|---|
2 | 0.012/0.003 | 0.018/0.002 |
800 | 0.009/0.000 | 0.025/0.000 |
1600 | 0.009/0.001 | 0.026/0.001 |
1 | 8 | |
---|---|---|
2 | 0.009/0.001 | 0.019/0.001 |
800 | 0.009/0.000 | 0.017/0.000 |
1600 | 0.008/0.000 | 0.018/0.002 |
1 | 8 | |
---|---|---|
2 | 0.006/0.000 | 0.010/0.001 |
800 | 0.006/0.000 | 0.010/0.001 |
1600 | 0.007/0.000 | 0.012/0.001 |
1 | 8 | |
---|---|---|
2 | 0.006/0.000 | 0.011/0.002 |
800 | 0.007/0.000 | 0.010/0.000 |
1600 | 0.007/0.000 | 0.011/0.001 |
1 | 8 | |
---|---|---|
2 | 0.006/0.000 | 0.014/0.001 |
800 | 0.007/0.000 | 0.091/0.002 |
1600 | 0.007/0.000 | 0.145/0.014 |
1 | 8 | |
---|---|---|
2 | 0.007/0.000 | 0.015/0.001 |
800 | 0.008/0.000 | 0.015/0.000 |
1600 | 0.008/0.001 | 0.014/0.001 |
1 | 8 | |
---|---|---|
2 | 0.008/0.002 | 0.009/0.001 |
800 | 0.006/0.000 | 0.008/0.001 |
1600 | 0.006/0.000 | 0.009/0.001 |
1 | 8 | |
---|---|---|
2 | 0.005/0.000 | 0.008/0.001 |
800 | 0.006/0.000 | 0.009/0.001 |
1600 | 0.006/0.000 | 0.008/0.001 |
1 | 8 | |
---|---|---|
2 | 0.006/0.000 | 0.011/0.002 |
800 | 0.006/0.001 | 0.019/0.000 |
1600 | 0.007/0.000 | 0.019/0.000 |
1 | 8 | |
---|---|---|
2 | 0.006/0.001 | 0.011/0.001 |
800 | 0.006/0.000 | 0.012/0.001 |
1600 | 0.007/0.000 | 0.011/0.001 |
The first point of note is that for all tests with one block per column per slice, there are no significant differences in performance, for any of the SQL test queries, no matter how many columns there are, no matter which node type is used.
However, when we move to the eight block tests, differences do emerge.
The query times are of course longer, because eight blocks are being read per column per slice. What matters is not how long queries take as such, but differences in how long the queries take as the number of columns (and node types) vary.
Looking first at dc2.large
, we see that when accessing
the first column on its own, or the last column on its own, the number
of columns makes no difference.
However, we have next the two test queries which access a pair of columns. As with increasing the number of blocks, these queries will be slower - they are reading two columns not one - but as before, we are only looking at the difference in performance as we vary the number of columns.
What we find is that when accessing the first and second column, moving from two columns to 800 or 1600 results in a slowdown, from 0.018 seconds to 0.025 and 0.026 seconds, respectively. This is very slight, but it seems to be a real change.
However, when looking at accessing the first and last column, we again see no significant change in performance as the number of columns change.
Very tentatively then, since we have so little evidence so far, it looks like the last columns in a table are accessed most quickly, and the first most slowly.
Moving on to ds2.xlarge
, as before, accessing the first
or last column makes no difference, and accessing first and last shows
no significant change.
Now, as before, accessing first and second shows a slowdown, but the slowdown is far more pronounced with this node type. Query time goes from 0.014, to 0.091, to 0.145 seconds. That’s a lot.
Finally, moving to ra3.xlplus
, we see the same pattern,
but with a change in performance for first and second on much the same
order as for dc2.large
.
When a query accesses a single column only, the number of columns in a table has no effect on performance.
When a query accesses the first and last columns, we also see the number of columns in a table has no effect on performance.
However, when a query accesses the first and second columns, we see the number of columns in a table does have an effect on performance. As the number of columns increases, the query becomes slower.
This slowdown effect is very small on dc2.large
and
ra3.xlplus
, with the query duration going from about 0.011
second for 2 columns, to about 0.019 seconds for 1600 columns, but much
larger on ds2.xlarge
, which goes from 0.014 seconds to
0.145 seconds.
Tentatively, we can think to generalize this to say that the more columns there are between an accessed column and the final column in the table, the slower the column is to access, and so the more columns are in the table, the slower accessing a column becomes, because there are increasingly more columns between the column and the final column in the table.
In other words, the closer a column is to the end of the table, the faster it is to access, and the closer a column is to the start of the table, the slower.
The difference in performance is very small on dc2.large
and ra3.xlplus
, even with the worst case of 1600 columns,
but looks considerable on ds2.xlarge
, although having said
that, it’s a pretty rare table which exceeds even 30 columns, and for
tables of this size, the slowdown even on ds2.xlarge
is
going to be very small.
When you investigate Redshift, there are always unexpected findings.
The distribution style even
appears to have
off-by-one errors. When populating four slices with one block each,
where each block has 130,994 rows (the maximum number of
bigint
with raw encoding per block), even
will
not give you one block per slice, but two slices with one block which is
short one record, and the other two slices both have two blocks, one
completely full and the next with one record.
It won’t matter in normal use, but it seems indicative of a lack of
testing or testing but a lack of care to fix such errors, and it was a
problem for me with this testing work, where I need to exactly control
blocks per slice. It took some hours to implement my own solution using
key
based distribution, and this solution has unavoidable
drawbacks compared to using even
.
Not quite an unexpected finding, since I knew this already from
earlier work, but I’m listing it here as I had to work around this issue
and I would have discovered it here if I’d not already known; the
slice_num()
function appears very rarely to give the wrong
slice number.
I want to use slice_num()
because where I’m using
key
, I need to know which slice a given distribution key
value will go to, so I can deliberately put rows on each slice.
Where slice_num()
is unreliable, what has to be done
instead is write a single row, then check stv_blocklist
to
find the column number, and then truncate the table, and then repeat
until a value is found for each slice.
So to work around one bug, I had to work around another bug.
You can’t help but feel at times Redshift is a bit of a clunker.
I realised something which actually came into being a little while ago; when the devs introduced automatic sortkey selection, and made this the default, it is not longer possible to keep unsorted tables.
To be more exact, you can create unsorted tables, but because such
tables are created by not specifying a sorting type, and now not
specifying a sorting type has Redshift set the sorting type to
auto
, which gives Redshift carte blanche to change the
sorting type, when you make an unsorted table, it can stop being an
unsorted table at any time.
This breaks existing systems, because deliberately using unsorted tables is a perfectly valid design choice (with small tables and many slices, they save a lot of disk space), and it also breaks my test work, because I’ve been using unsorted tables as they’re simpler to work with by not having an unsorted segment.
I have very strong feelings about this. The devs should never repeat never break existing systems - let alone doing so silently - and they should never get in the way of system developers and admin running their own cluster.
The problem is I suspect that the devs don’t know when they’re making breaking changes; this isn’t the first time. I suspect all their work is based on fleet telemetry, and they’re not actually in touch with users at all.
This white paper was published on r/aws, leading to a review by one of the moderators, which identified two flaws in the test method. These were corrected, the test re-run, the white paper updated with the results, and then republished.
After posting in r/aws
, a subreddit mod reviewed the
white paper and pointed out flaws in the test method. As a result, the
following changes were made;
Note these results are completely unprocessed; they are a raw dump of the results, so the original, wholly unprocessed data, is available.
{'proofs': {'dc2.large': {2: {}}, 'ds2.xlarge': {2: {}}, 'ra3.xlplus': {2: {}}},
'tests': {'dc2.large': {2: {1: {2: {'first': [0.00983100000000001,
0.007121,
0.007106,
0.007927,
0.006986],
'first and last': [0.009111,
0.007817,
0.008855,
0.013065,
0.007816],
'first and second': [0.009239,
0.021148,
0.015663,
0.010932,
0.008843],
'last': [0.013651,
0.006955,
0.007586,
0.05611,
0.012056]},
800: {'first': [0.007494,
0.007512,
0.007573,
0.008072,
0.00712],
'first and last': [0.008437,
0.008456,
0.008418,
0.008853,
0.009487],
'first and second': [0.00826,
0.008568,
0.008834,
0.008985,
0.019111],
'last': [0.017022,
0.007516,
0.007351,
0.007447,
0.007659]},
1600: {'first': [0.007455,
0.008847,
0.007901,
0.00799,
0.011692],
'first and last': [0.009078,
0.008211,
0.00825,
0.008508,
0.007939],
'first and second': [0.009466,
0.011271,
0.008583,
0.008835,
0.010071],
'last': [0.007188,
0.007826,
0.009507,
0.010988,
0.007804]}},
8: {2: {'first': [0.013255,
0.012184,
0.01709,
0.012838,
0.012969],
'first and last': [0.028075,
0.01886,
0.016114,
0.020439,
0.016953],
'first and second': [0.02386,
0.020475,
0.017573,
0.016527,
0.016873],
'last': [0.012078,
0.012822,
0.011611,
0.012327,
0.012086]},
800: {'first': [0.012186,
0.012387,
0.012328,
0.013211,
0.012549],
'first and last': [0.016758,
0.017129,
0.017125,
0.018622,
0.017386],
'first and second': [0.025634,
0.025044,
0.027201,
0.024404,
0.024922],
'last': [0.01231,
0.015804,
0.013119,
0.012416,
0.013035]},
1600: {'first': [0.012601,
0.014621,
0.012467,
0.012254,
0.012759],
'first and last': [0.020315,
0.016714,
0.016645,
0.021115,
0.016532],
'first and second': [0.035006,
0.024816,
0.026592,
0.026567,
0.02428],
'last': [0.013417,
0.015566,
0.013112,
0.012518,
0.012926]}}}},
'ds2.xlarge': {2: {1: {2: {'first': [0.010581,
0.006096,
0.005874,
0.006064,
0.005705],
'first and last': [0.007069,
0.007114,
0.00696,
0.021457,
0.006111],
'first and second': [0.00665499999999999,
0.006148,
0.008511,
0.005904,
0.005982],
'last': [0.005403,
0.005576,
0.00568,
0.005537,
0.005331]},
800: {'first': [0.009515,
0.006787,
0.00573,
0.006183,
0.006404],
'first and last': [0.007479,
0.006345,
0.007722,
0.007325,
0.009618],
'first and second': [0.007499,
0.007612,
0.007331,
0.007663,
0.006303],
'last': [0.006621,
0.006717,
0.008064,
0.006663,
0.006469]},
1600: {'first': [0.006767,
0.008588,
0.006473,
0.006884,
0.007529],
'first and last': [0.006403,
0.006559,
0.008803,
0.007665,
0.00828],
'first and second': [0.006299,
0.007696,
0.007924,
0.008282,
0.006877],
'last': [0.006611,
0.006213,
0.006902,
0.009258,
0.007416]}},
8: {2: {'first': [0.010172,
0.010276,
0.009098,
0.011165,
0.008997],
'first and last': [0.014892,
0.012017,
0.012556,
0.01606,
0.019128],
'first and second': [0.012179,
0.01791,
0.012364,
0.015408,
0.01507],
'last': [0.014448,
0.009436,
0.010099,
0.009261,
0.013047]},
800: {'first': [0.010728,
0.011137,
0.009314,
0.00939,
0.010986],
'first and last': [0.015952,
0.012699,
0.016202,
0.01528,
0.015233],
'first and second': [0.087814,
0.093193,
0.093329,
0.090595,
0.087897],
'last': [0.01018,
0.009748,
0.009193,
0.009391,
0.009604]},
1600: {'first': [0.011818,
0.015312,
0.012704,
0.011397,
0.011389],
'first and last': [0.015404,
0.012765,
0.015671,
0.012296,
0.012886],
'first and second': [0.092846,
0.158659,
0.157882,
0.126101,
0.150604],
'last': [0.011752,
0.009684,
0.009571,
0.011116,
0.011377]}}}},
'ra3.xlplus': {2: {1: {2: {'first': [0.009755,
0.010415,
0.005871,
0.008958,
0.005723],
'first and last': [0.009507,
0.00601,
0.006902,
0.005113,
0.005115],
'first and second': [0.00786500000000023,
0.00534,
0.005758,
0.005071,
0.005595],
'last': [0.005239,
0.004884,
0.005239,
0.005269,
0.006372]},
800: {'first': [0.005928,
0.005915,
0.005506,
0.005278,
0.006179],
'first and last': [0.006394,
0.006603,
0.00621,
0.006334,
0.007127],
'first and second': [0.005769,
0.005586,
0.006847,
0.008837,
0.005641],
'last': [0.004933,
0.005238,
0.007724,
0.005784,
0.005889]},
1600: {'first': [0.006202,
0.005788,
0.005208,
0.005904,
0.006407],
'first and last': [0.005717,
0.006912,
0.007767,
0.008512,
0.007253],
'first and second': [0.006739,
0.006878,
0.005955,
0.006739,
0.007418],
'last': [0.00646,
0.006753,
0.005414,
0.00586,
0.005476]}},
8: {2: {'first': [0.009762,
0.009675,
0.010701,
0.007362,
0.007175],
'first and last': [0.011825,
0.01177,
0.011457,
0.009661,
0.009611],
'first and second': [0.01092,
0.014204,
0.013526,
0.009901,
0.009765],
'last': [0.00701,
0.006871,
0.010002,
0.008035,
0.008293]},
800: {'first': [0.012611,
0.009181,
0.007319,
0.008433,
0.007569],
'first and last': [0.013254,
0.012883,
0.011015,
0.012974,
0.009578],
'first and second': [0.018516,
0.018681,
0.019176,
0.018859,
0.019516],
'last': [0.009171,
0.007369,
0.007532,
0.011004,
0.009196]},
1600: {'first': [0.009756,
0.007568,
0.010311,
0.007383,
0.008999],
'first and last': [0.010471,
0.010031,
0.012248,
0.011519,
0.01256],
'first and second': [0.018623,
0.019316,
0.019296,
0.018557,
0.019013],
'last': [0.007297,
0.007603,
0.011624,
0.008758,
0.007641]}}}}},
'versions': {'dc2.large': {2: 'PostgreSQL 8.0.2 on i686-pc-linux-gnu, '
'compiled by GCC gcc (GCC) 3.4.2 20041017 (Red '
'Hat 3.4.2-6.fc3), Redshift 1.0.29551'},
'ds2.xlarge': {2: 'PostgreSQL 8.0.2 on i686-pc-linux-gnu, '
'compiled by GCC gcc (GCC) 3.4.2 20041017 (Red '
'Hat 3.4.2-6.fc3), Redshift 1.0.29551'},
'ra3.xlplus': {2: 'PostgreSQL 8.0.2 on i686-pc-linux-gnu, '
'compiled by GCC gcc (GCC) 3.4.2 20041017 (Red '
'Hat 3.4.2-6.fc3), Redshift 1.0.29551'}}}
I am a C programmer - kernel development, high performance computing, networking, data structures and so on.
I read the C. J. Date book, the classic text on relational database theory, and having learned the principles, wrote a relational database from scratch in C, which purely by chance set me up quite nicely for what came next, moving into data engineering in late 2011, when I joined as the back-end engineer two friends in their startup.
In that startup, I began using Redshift the day it came out, in 2012 (we had been trying to get into the beta programme).
We were early, heavy users for a year and a half, and I ending up
having monthly one-to-one meetings with one of the Redshift team
managers, where one or two features which are in Redshift today
originate from suggestions made in those meetings, such as the
distribution style ALL
.
Once that was done, after a couple of years of non-Redshift data engineering work, I returned to Redshift work, and then in about mid-2018 contracted with a publisher to write a book about Redshift.
The book was largely written but it became apparent I wanted to do a lot of things which couldn’t be done with a book - republish on every new Redshift release, for example - and so in the end I stepped back from the contract and developed the web-site, where I publish investigation into, and ongoing monitoring of, Redshift.
So for many years now I’ve been investigating Redshift sub-systems full-time, one by one, and this site and these investigations are as far as I know the and the only source of this kind of information about Redshift.
I provide consultancy services for Redshift - advice, design, training, getting failing systems back on their feet pronto, the usual gamut - but in particular offer a Redshift cluster cost reduction service, where the fee is and only is one month of the savings made.
Broadly speaking, to give guidance, savings are expected fall into one of two categories; either something like 20%, or something like 80%. The former is for systems where the business use case is such that Redshift cannot be operated correctly, and this outcome requires no fundamental re-engineering work, the latter is for systems where Redshift can be operated correctly, and usually requires fundamental re-engineering work (which you may or may not wish to engage in, despite the cost savings, in which case we’re back to the 20%).
Details and contact information are on the web-site.