In af-south-1
, ap-northeast-2
,
ap-northeast-3
, ap-southeast-1
,
ap-southeast-2
, ap-southeast-3
(no
ra3
nodes in this region), ca-central-1
,
eu-north-1
, eu-south-1
,
eu-west-1
, eu-west-3
, me-south-1
and sa-east-1
, dc2.large
and
ra3.xlpus
nodes have received the one third improvement in
processor performance, first seen in a number fo other regions two weeks
ago (other node types are not tested in this region).
In ap-east-1
, a region where all node types are
tested, has seen a one third improvement in all node type processor
benchmarks. Curiously, the dc2.8xlarge
read benchmark
became slow (0.14s vs the usual 0.06s).
Apart from that, all quiet on the Western Front.
https://www.amazonredshift-observatory.ch/cross_region_benchmarks/index.html
Compilation information all comes from the system table
svl_compile
, which originally recorded compilation times,
but there has been so much change with regard to compilation over the
years, without corresponding changes to that system table, that the
meanings of what you see in that system table are now a complete
mess.
Back when Redshift first came out, compilation always happened on the leader node, and there was a compiled segment cache on the leader node (note queries which run purely on the leader-node never compile; they are interpreted, rather than compiled, by the Postgres-descended leader node).
Typically compilation time for a single segment on an unloaded
dc2.large
leader node were a few seconds, and so crazy
queries (such as those issued by BI tooling) with dozens or even most of
a hundred segments, on a busy leader node, would take several minutes to
compile.
The compiled segment cache was kept in memory on the leader node, and so was small, but fast; if a segment in the cache could be re-used, it would be retrieved from the cache in about ten microseconds.
(Whether or not a segment can be re-used is a rather vague matter. It’s not just a case of checking the SQL and seeing if it matches the SQL of an existing segment; segments apparently are constructed rather like function calls - the internal code is fixed, but you can pass in values, and those values are used. So a segment compiled for one query can in principle be used by another query.)
When there was a cache hit, and so an already-compiled segment was
re-used, obviously no compilation occurred, and the “compile time” (in
quotes, as no compilation actually occurred) in the system table was
exceedingly short - tens of microseconds. What this means is that the
times in svl_compile
are not really compile times, but
something more like “how long it took to obtain this segment”.
Compilation occurred serially on a per-query basis; so multiple queries could be compiling concurrently, but each would be compiling one segment after another, and all segments compiled before a query began to execute.
Now, in svl_compile
is a column which indicates if
compilation occurred, or if the cache served the segment instead; and if
it did not, then it must be that the segment cache was used.
Originally all was clear - either the leader node has compiled (and the column was “true”), or the segment came from the leader node segment cache (and then the column was “false”).
Then in time a second segment cache was added, external to the cluster and originally on a per-account basis but then later made per-region (but always also per-Redshift version - a segment compiled on one version of Redshift cannot be used by a different version of Redshift), and now the compilation flag if True meant either cache, where the only way to know which was by looking at how long it took to get the segment; 10 microseconds meant the leader node cache, 300 or so microseconds meant the external cache.
(Already at this point svl_compile
should have been
improved so it could represent this new behaviour.)
AWS has always (as ever, though) spoken in the most fulsome terms of the efficacy of the segment caches. I recall claims of 95% hit rates, then 99% hit rates, then 100%.
In truth, in my experience, it’s very hit and miss; I would actually also call it erratic - and here you have to remember Redshift appears not to get much in the way of testing. I suspect the caches are actually unreliable, but we have no way of knowing, other than observing from the outside their behaviour.
For example, with one particular client, we ran a test where an ETL
system ran the same UPDATE
query 100 times, where the SQL
did not change, but the values being updated did change.
48 of those queries were compiled, which took a few minutes in total.
One of the chaps I was working with told me he, just once, had the entire set of 100 queries run through in seven seconds - which is to say, although he did not realise what it mean - without any compilation at all.
I usually find that segments I have had compiled one day are often re-compiled a day or two later.
Another client ran an overnight test, with a moderate range of queries, but generally varying by the value used, being generate by BI tooling; they saw about 25% compilation.
My general view on the segment cache is that yes, it’s there, but you can’t predict what you’ll get from it.
Also, of course, being a cache, it’s no use at all for novel queries, and the flexibility of segments to be re-used in different queries is certainly not so great as to be able to address this problem.
As mentioned, the external cache is per-Redshift version. (I’ve not checked if it’s per-AZ, and I’ve not checked if it’s per-node-type.)
As such, when a new Redshift version is released, you’re facing an empty cache. You need to make sure your ETL system can handle the situation where all queries are compiling - and there’s no way to test for this, because there is no way to disable caching. This is a major design flaw, and to my eye symptomatic of a lack of real-world experience in the dev team. They simply do not know to think about this issue, but this is a critical issue for users.
Moving on; a couple more years passed, and what happened next was that some regions had compilation off-load added. This is a system whereby the leader node no longer performs compilation, but farms the work out to helpers, who concurrently compile, and then the leader node serially (but extremely quickly) retrieves those compiled segments.
Compilation off-load is the one major feature I’ve seen added to Redshift in the last however-many years which I thoroughly and completely approve of. The implementation isn’t perfect (when I investigated it, I found ways in which to crash the cluster) but these issues were not something you’d run into in normal use.
In particular, there was one important (but improper) use of Redshift where compilation off-load removed one of the key problems; the use of BI tools.
BI tools have no clue about sorting, and so cannot handle Big Data and are entirely inappropriate for use with Redshift. Everyone and their dog uses them, because neither AWS nor the BI tooling vendors inform users about these issues (indeed, Tableau’s marketing is a constant wall of how scalable it is - in fact, it scales only for number of concurrent users; it does not scale for data volume).
Setting aside scalability for a moment, a second killer problem was of course compile time. BI tools emit crazy and novel queries, and humans do not respond well to multi-minute pauses before their dashboard updates.
Compilation off-load reduced compile times to a consistent 4 to 8 seconds, and so the compilation problem went away. This is important, because, as we will come to later, this improvement has recently been undone.
Now, this new functionality, compilation off-load, did for
svl_compile
what the Yucatán meteorite did for the
dinosaurs.
In regions where there is no compile off-load, the compilation flag if True means compilation occurred, and this is always on the leader-node, and if False, means either the leader node or external segment cache were used - same as before.
In regions where there is compile off-load, the compilation flag now has a different meaning.
First, it is now that the leader node can access only the leader node segment cache; the compilation off-load system is now what accesses the external segment cache.
When the compilation flag is set to False, it means the leader node obtained the segment from the leader node cache.
When the compilation flag is set to True, it means the leader invoked the compilation off-load system to compile the segment. However, the compilation off-load system checks the external segment cache, and can return a segment from there. If the segment is not in the external segment cache, then it really is compiled.
If all segments were found in the leader node segment cache, then the compilation off-load system would not be invoked, and the “compilation” phase would be extremely brief - 10 microseconds per segment.
If however any segment was not in the leader node cache, then the first segment would always be marked as compiled, and also the segment(s) which were not found in the leader node cache would be marked as compiled.
I think the first segment was not really compiled, but rather it was doing the work to invoke the compilation off-load system and measuring the time taken for the compilation off-load to complete to the point the compiled segments are ready to be retrieved, as the duration for the first segment is always be about 4 to 8 seconds, and then all the other segments would “compile” very quickly, where the “compile” now in fact means the leader node retrieving the already compiled segments from the compilation off-load system.
Then another year or two pass, bringing us to about Q3 2022, and two more major changes occur.
First, it used to be that all segments were compiled before query execution, but this changed and now the compilation off-load system can be invoked per stream, and that this invocation occurs while the query is running (which fairly often induces pauses in query execution, where the query must wait for segments to be compiled), and each compilation takes a few seconds - and so where-as before, there was only one compilation per query, there can now be two, three, or even more.
This is absolutely fine for Big Data queries, where the query execution time massively dominates compile time, but it once again meant BI tools were no longer viable for human users, because now there are often multiple 4 to 8 second delays, one for each invocation of the compile off-load system, and this is too slow for interactive use. I think the dev team did not perceive the consequence of their change to compilation off-load.
Second, it used to be that the compiled segment cache could only be interrogated for queries which had compiled. What happens now is that once compilation begins, the properties of the segment are registered with the cache, so that other queries which will be able to use that segment once it is compiled, can find that segment - and when they do so, those queries will pause, waiting for the compilation to complete and the compiled segment to be placed in the compiled segment cache, and then they will retrieve the compiled segment from the cache.
However, and yet another blunder to add onto the… collective nous for blunders, anyone? the catastrophe of blunders? but anyway - yet another blunder, when a query waits for compilation to occur, the compilation flag for that query for the segments which were waited for is set to False - and this when the query has waited usually the full time needed for compilation (as this situation of waiting usually happens when something like a BI tool issues essentially simultaneously a number of very similar queries).
The only way you can tell this is happening is to look at the compile plans (a term I use to indicate the set of segment compilation information for a query) for the immediately preceding queries, examine their query text, and see that when the earlier similar query finishes compiling a particular segment or segments, the later queries, which have very long “compile times” for a non-compiled segment, immediately retrieve their segment.
This catastrophe of blunders is representative of the system tables in general.
I am a Redshift specialist. I’ve been investigating Redshift full time for many years. As such, I have some idea of what’s going on; but there is no other way to obtain this information. I am of the view AWS publish no meaningful information about Redshift, and indeed I aver what they do publish is dangerous, profoundly misleading, and should not be read unless you already know what Redshift is actually doing, so you are safe from being misled.
Given the knowledge I have accumulated, and with some specific
investigation into svl_compile
, I am in a position to
figure out what’s going on, and so to write this post and describe the
situation.
If you’re a normal person, who does not spend you life investigating Redshift, what possible chance do you have?
Redshift being sorted is a highly knowledge-intensive database, where AWS publish no meaningful information, and the system tables are catastrophically neglected.
I have the impression many businesses now are migrating from Redshift to Snowflake, and I think this is the primary reason. Other reasons include Support (I say no more - if you’ve interacted with Support you know exactly what I mean and you’re nodding right now), and lack of reliability.
Usually ETL systems are provided with only minimal information and so are capable only of minimal actions. This leads to simple ETL systems, which for example check for new data files in some location, load them to a table, then move those files out of the way, and have no other capabilities. Such ETL systems are not robust, as they require arbitrary, unplanned human intervention to fix and to recover from bugs or data errors. The key to robustness in ETL systems is the provision of additional information such that the ETL system can undertake a wider range of actions, reducing the need for human intervention. A simple but profound enabler is the provision to the ETL system of the information of the set of available data files, and the set of data files which have been loaded into the database. This allows the automation of a wide range of behaviour, including the recovery process once human intervention has fixed bugs or data errors, and so provides a robust ETL system.
Asia-Pacific region for and only for dc2.large
, is
experiencing significant to large slow-downs running the network
benchmark.
ap-northeast-2
and ap-southeast-1
both
slowed by about 5 seconds, from about 2.5 seconds to about 7.5
seconds.ap-northeast-1
and ap-northeast-3
slowed
by 3 seconds (from 2s to 5s, and 9s to 12s, respectively).ap-south-1
and ap-southeast-2
both slowed
by about 2 seconds (from about 2s to to about 4s).In all cases, however, ra3.xlplus
is unaffected (larger
node types are not tested in these regions).
us-east-1
dc2.large
nodes have become
slow for disk-read, going from 0.06 seconds to 0.18 seconds.
https://amazonredshift-observatory.ch/cross_region_benchmarks/index.html
Redshift has a leader node, and worker nodes.
Queries can execute solely on the leader node, or on the leader node and the worker nodes (which I’ll call “worker node queries” from now on, for brevity).
Leader node queries have a pid
and transaction ID (also
known as an xid
), but no query ID.
Worker node queries have a pid
, a transaction ID and a
query ID.
A query which has a query ID went into WLM, because the query ID is an artefact of WLM, and WLM is an artefact of the worker nodes.
(Leader node queries will internally have a query ID, but it’s a different query ID, leader node only, and as far as I can tell it is exposed nowhere in the system tables. The only query IDs we see in the system tables are for worker node queries.)
Queries additionally are either basically running or completed (there are lots of other states - queued, aborted, etc, but I’m not worried about them - they all come down to either “not running yet”, or “completed” by whatever means).
When managing a cluster it is necessary to have a full list of all queries; sooner or later, to fix some problem or other, you need to look at what’s been going on.
Getting this list is in fact, as far as I can tell, problematic : Redshift does not expose to users a way to see the full list of all queries which are or have been executed.
So, first, we have four fundamental classes of query;
pid
,
xid
, no query ID)pid
,
xid
, no query ID)pid,
xid`, query
ID)pid
,
xid
, query ID)Not all information about a query is consistently present in all
tables, as we will see; stv_recents
for example does not
hold a transaction ID, stv_wlm_query_state
does not hold a
pid
.
Note the transaction IDs on the leader node and on worker nodes are
in very different number ranges. It doesn’t change anything, far as I
can see; you simply use them as you normally would. Transactions
involving leader node and worker node tables appears to function
correctly, no problems due to crossing the leader/worker boundary, but
remember that STV
tables-like objects do not participate in
MVCC and so do not participate in transactions.
(The STV
tables are really table-like interfaces to
in-memory state, with no support for MVCC, and so when you read them,
you get whatever rows are present a the table is scanned - rows can
perfectly well be being added and remove while you read rows.
By contrast, MVCC ensures that the rows you see, in all tables, are
those which were present at the time your query started; you never see
any changes, and so have a consistent view of the data.)
Note that queries which are handled by the leader node query result cache are not run, and so are not present in any system tables.
Finally, note the char
columns for query text actually
contain UTF-8
. If you export the system tables, to archive
them, and they contain any non-ASCII characters, you cannot load them
back into the DDL of the original tables, because you will end up trying
to load UTF-8
into char
, and Redshift will not
permit this.
(An open question is what happens to a multi-byte UTF-8 character
which would be truncated by the 200 byte limit. I suspect part of it is
in one row, and the rest in the next, and this in some situations may be
problematic, where command-line based SQL clients in my experience fail
hard on broken UTF-8; such clients will not be able to read
such rows, because their UTF-8
parser will fall over.
However, I could be wrong - I will need to conduct this little
investigation at some point.)
So, now, let us turn to the system tables which hold information about queries.
stv_recents
This table holds one row for each of the most recently started 100 queries, of all types - leader node, worker node, running and completed. Note I say “started” and I mean this very particularly; queued queries are not shown. Only those queries which have started, and those which have (by whatever means) completed.
Historically, this table showed running queries; however, checking it
now right now, it appears to show only completed queries. I cannot get
it to show a running query of any kind. I can see completed leader node
only queries (select 1;
) and completed worker node
queries.
I do not know if queries running via CSC are listed.
The table holds little information about a query. There’s the user
ID, query status (running, completed, etc), when the query started (but
this timestamp differs by a small fraction of a second to the
start time for the query as recorded in other tables holding information
about queries), and the pid
, but the pid
is
set to -1 if the query has completed, so you cannot use it once the
query has completed. There is no transaction ID.
To some extent the lack of information is because the table is
holding information about leader node queries and worker node
queries, which are rather different to each other, but some of it is
just seems poor implementation. The transaction ID could have been
given, for example, and I can’t see why the pid
is set to
-1 on completion; the leader node knew the process ID when the query was
running, so it does know. Why throw the information away?
stv_wlm_query_state
This table holds per-query information about worker node queries, in
every state, but I think completed queries are rapidly removed (and
placed into stl_query
and stl_wlm_query
, which
are the archive of completed queries).
stv_inflight
This table holds per-query information about running queries only, and I think it holds information only for worker queries.
This table has a column slice
, which makes it look like
the table should hold per-query, per-slice information. I have never
seen this; there has been one and only one row per query, and the slice
is always set to the slice number of the leader node (that number had
varied over the years, and these days appears to vary a bit by cluster -
it looks to be a number larger than about 12000).
I do not know if CSC queries are listed.
stv_exec_state
This table holds per-query, per-slice, per-step information about worker node running queries only. I do not know if CSC queries are listed.
stl_query
This table holds per-query information about completed worker node queries.
CSC queries are listed. Aborted queries are listed.
This was the original table for completed query information. This
table contains some columns which you can need when dealing with queries
but which are not available in stl_wlm_query
(such as
label
and pid
) and so I find
stl_wlm_query
often needs to to joined to
stl_query
.
Note also here the querytxt
column a while ago began to
be used not only for query texts, but also for status messages. If you
parse it, you need to be able to tell the difference between SQL and
messages.
stl_wlm_query
This table holds per-query information about worker node completed queries.
CSC queries are listed. Aborted queries are listed.
This is a later table, which more comprehensive WLM information.
stl_ddltext
This table provides the full query texts of a particular set of SQL commands (relating to table and schema creation and modification, but the docs do not provide a complete list), with one row for every 200 bytes of query text.
It turns out these commands are leader node only, do not have a query ID, and so this table is how you find out such queries have been run. Where there is one row for every 200 bytes of query text, if you’re building a list of all queries, you need to select only one row from every query.
stl_utilitytext
This table provides the full query texts of a particular set of SQL
commands (which looks like pretty much everything which is not in
stl_ddltext
and which does not run on a worker node, but
the docs do not provide a complete list, but it’s a different list to
that for stl_ddltext
- obviously begs the question, why
split this information over two tables?), with one row for every 200
bytes of query text.
It turns out these commands are leader node only, do not have a query ID, and so this table is how you find out such queries have been run. Where there is one row for every 200 bytes of query text, if you’re building a list of all queries, you need to select only one row from every query.
So, we have four classes of query, and we need to find a way to produce each of them.
running leader node queries
If a query is in stv_recents
and not in
stv_inflight
, and the query status is Running
,
then it is a running leader node query.
Currently, I am not seeing any Running
queries in
stv_recents
, and so this set of queries, to my eye, is not
available.
(Note we have to use stv_inflight
here (rather than
stv_wlm_query_state
) because
stv_wlm_query_state
does not contain pid
, and
pid
is the only column in stv_recents
which
allows us to join it to other query-information tables.)
completed leader node queries
Leader node queries which are commands which are stored in
stl_ddltext
or stl_utilitytext
can be found in
those two tables. Commands which are not stored in those two tables,
such as a plain select
, to my eye, are not available in the
system tables.
In principle, stv_recents
offers limited visibility of
all completed leader node queries, even those not in
stl_ddltext
or stl_utilitytext
. In practise,
it seems to be not possible to figure out which queries in
stv_recents
are both leader node and completed.
It is easy to check if a query is completed - the pid
is
-1 and/or status
is Completed
, but because I
no longer have the pid
I can see no way to check if the
query is present in stl_query
or
stl_wlm_query
, and indeed even the start timestamp cannot
be used because it is not consistent across these tables, and I need to
perform this check, for it is if query is not in those tables
that it is a leader node query.
So, in short, there is a considerable but not full record of completed leader node queries.
running worker node queries
This is simple; all queries in stv_wlm_query_state
which
have state Running
. We could also simply take all rows from
stv_inflight
. In fact, where
stv_wlm_query_state
lacks some important information, which
is present in stv_inflight
(such as pid
), it’s
often necessary to join the two tables.
completed worker node queries
This is simple; all queries in stl_query
, or
stl_wlm_query
. I believe all worker node queries are in
both of these tables, and they often need to be joined, again, due to
missing columns stl_wlm_query
.
This is a mess.
Explosions in Kyiv this morning, within earshot. Three louder explosions (I don’t have enough experience yet to tell the difference between explosion size and distance, although I begin to think I am starting to know the difference between an in-air explosion, something being shot down, and a ground explosion, when the missile or drone hits the ground to explode), one or two more further away. A bunch of car alarms going off after the first, then after a bit, emergency services sirens. Birds all unsettled and up in the air. Almost nothing visible from my window - I can see an arc about 130 degrees, so it’s not hard for events to be out of sight.
However, one photo - see the little dark puff a little to the left of center? something exploded in the sky, to make that.
It’s interesting how independent the body is, in its reactions. In my mind, and as far as I can tell in myself, I feel fine. I was never much bothered by this, the danger is zero because Kyiv is huge, and I’m also used to it now. However, someone just pulled some furniture around in a nearby apartment - so a sudden, deep sound. My guts jumped, because it was sudden and nearby.
So here’s a cool little thing.
This is the page showing information about normal views (not late-binding, not materialized).
What’s cool is that the AMI figures out how many and which tables are used by each view - fully recursing into the view, so if a view uses other views, it’s the total set of tables for all of them - and it computes the sorted, unsorted and total blocks and rows for each view.
Now you can finally know how big that “big view” actually is :-)
Well, that got busy. About 15 or so large detonations in the last few minutes, often in rapid succession. I’m pretty sure many were explosions in the air - so, missiles or drones being destroyed by defensive fire - but I saw one plume of smoke, big enough I would say to be from a ground hit - although whether it was the intended target, or a shot down drone coming to ground and then exploding, I know not.
That’s the largest number of explosions, in the shortest time, I’ve had experience of, but I have almost no experience, and you only know of what’s close enough to hear or see. I also suspect a lot of incoming drones are being blown up in the air, and that the sound from that travels a lot further. I felt one or two blast waves, but I would say they felt like they were coming downwards, not upwards.
Couple of late-comers going off now, further away.
Hoping and praying (metaphorically - not religious) for the people near those explosions, but it’s pretty certain a couple of people will have caught it.
Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers