I’ve finally got round to writing my own email client (I’m on Linux and there are none which I’m happy with).
I’ve now fully encountered MIME, and I have to say, I have come to the view MIME was a mistake. It has too much flexibility, and so much complexity, and that makes implementation complex, and that complexity simply almost all of the time is not used for email.
MIME has made something simple - email - into something complex, where that complex is not used (but you have to support it for the occasional case where it arrives, so your email client doesn’t fall over).
I’ve just been burned, again, by Apache web-site config.
This has happened too many times.
I’ve ended up serving the wrong sites, on the wrong IP names, again.
The way Apache works, which is to silently match to a default site rather than throw an error, is the core problem.
Silently doing the wrong thing is not a good idea.
Secondary problem is that the docs are a sprawling mess, so there’s nothing you can find to read to save yourself.
I use Apache ATM because I use CGI. It’s quick and easy to use, and not generally supported by web-servers.
Time has come to find one of the successors and use that.
Well, I just had a surprise.
A little while back RS devs implemented Snapshot Isolation, a second type of transaction serialization.
I assumed it was superior to the existing implementation, because Postgres implemented it in 9.1 (I think it was), which was after 8.0 (which RS came from).
I’ve finally now sat down and looked at what it is and how it works, and it turns out to be what is at first glance to me a basket case.
It’s a weakened form of transaction serialization, which allows for types of failure the existing method does not allow.
I need to read further into it, to see what benefits come from this (presumably performance, for whatever performance means - faster, or perhaps more able to cope and correctly with situations which otherwise would cause SIV), but at first glance it’s a case of why on earth would you do this?
I’m thinking maybe a lot of the people who are using this with RS have switched to it thinking it’s better and with no idea about what they’ve actually bought into.
I’ve been digging into the system tables to try to find out if I can
know when a table has been ANALYZE
ed.
So far, it looks like;
If reltuples
in pg_class
is
0.001
(it’s a float4, that’s the printed value from
SELECT
) then ANALYZE
has never been run (and
also anything else which would update that column has never been
run).
In pg_statistic_indicator
, if staiins
or staidels
are greater than 0, then the table has seen
changes since the most recent ANALYZE
(and the large number
numbers in staiins
or staidels
, which are rows
counts, the more change).
In an up-to-date table,
pg_statistic_indicator.stairows
equals
pg_class.reltuples
and staiins
and
staidels
are both 0.
I’m investigating COPY
.
In particular, the first question I had is whether or not the amount of memory avaliable in the slot for the query makes a difference to performance.
I’ve just discovered that it looks like an all-or-nothing scenario; if you have enough memory, the sort step runs in memory. If not, on disk.
If the sort phase runs on disk, then the amount of memory allocated to the slot makes zero difference to performance. I have a slot with 77% of system memory (about 6gb) loading at the same speed as a slot with 1/20th of system memory (1mb).
However, if the data is small enough to be sorted in the memory of the slot, it will be.
Next questions;
Can you end up in a situation where a COPY
runs in
memory but ends up swapping a lot and is worse than running from
disk?
What’s the performance difference between in memory and on disk? (I suspect very little in absolute terms, because to be in memory the data has to be fairly small, so the sort time even on disk is never going to be very large).
My second question regarding COPY has an answer.
When COPY runs, each slice loads one file at a time (well, in theory - sometimes, as with normal queries, you see some slices doing 2x or even 3x work, and other slices doing no work; but broadly speaking).
The question was; is there a limit of one file at a time per slice regardless of the number of queries running COPY, or can a slice load one file per query running COPY?
And the answer, unexpectedly, at least for dc2.large
and
dc2.8xlarge
, which I have tested on, is that there is a
limit of 15 (fifteen) queries per slice.
One minor exception was on dc2.8xlarge
, where of the 32
slices (two nodes), I saw 4 slices max out at 16. This might just be
some sort of off-by-one behaviour. In any event, it is not a substantial
change from 15.
So, on the face of it, the correct config for COPY is to set up a single queue of 15 slots, give that queue 1% of memory, and direct all COPY commands to that queue.
I have not, repeat NOT, tried this on a real, live cluster.
It looks to me like another problem with SQA and also the system tables is that when a query is routed to SQA, you have no direct way of knowing via the system tables which queue it was originally intended for.
I’ve just been looking into UNLOAD.
On dc2.large
, there is a limit of three concurrently
running UNLOADs.
Issue more than that, and they’re under the hood being queued.
Also, UNLOAD for simple queries (plain SELECT with no join) is wholly insensitive to memory.
It looks like then for simple UNLOADs, a single three slot queue with 1% of memory.
I have not REPEAT not tried this on a real cluster.
I’ve not yet had a look at other node types.
This is bad news.
On or about the 10th September 2022 AWS emailed end-users running RA3 clusters and who had used AQUA in the past (there was no public announcement) with the following message;
We want to inform you that Amazon Redshift will now automatically determine when to apply performance acceleration techniques that leverage AQUA technology. Thus, the ability to enable and disable AQUA will be removed from the console starting the week of September 12, 2022.
That grandiose first statement made me think AQUA was being made mandatory.
This prompted me to drop what I was doing and spend what turned out to be the following eleven days, all day, every day, investigating while I still could (to investigate, you need to be able to compare with to without, which means being able to disable AQUA).
In fact, at the time, as a bit of time passed, it became clear all AWS had done was… …remove the option from the Redshift console.
From Python using boto3
, the AQUA configuration options
are unchanged.
However, as of 18th May 2024, I happened purely by chance to look at
the create_cluster
command for Redshift in boto3 and saw
this;
AquaConfigurationStatus (string) – This parameter is retired. It does not set the AQUA configuration status. Amazon Redshift automatically determines whether to use AQUA (Advanced Query Accelerator).
To my knowledge there has been no announcement of this change. It
looks like AQUA
really is now mandatory - or rather, it’s
up to Redshift whether it is in use or not.
On my clusters, where I do at bring up set this parameter to False, I see the following;
"AquaConfiguration":
{
"AquaStatus": "disabled",
"AquaConfigurationStatus": "auto"
}
I’ve also seen this on a client’s cluster. It looks to say that
Redshift can enable or disabled AQUA
as it sees fit, and it
is currently disabled.
This is a fundamental problem. To put it simply : it is no longer
completely safe to use LIKE
or SIMILAR TO
on
Redshift.
The problem is that AQUA
makes a copy of the tables in
use by a query which runs on AQUA
, and so for a query
running on AQUA
the first run is very slow (tables being
copied), the excess time taken being gradually recovered over the course
of further runs of that query or queries which are using the same
tables.
In other words, if you run queries on AQUA
just once or
twice or a few times, AQUA
can perfectly well hammer your
system, and by a lot, not speed it up.
Note that I’ve no information on how long AQUA
keeps its
copy of a table on Redshift, or what happens when that table
changes.
So, now, we have in the heart of your system an undocumented black
box controlled by a third party which can sabotage your system
performance and it can be enabled or disabled (or rapidly toggle, who
knows?) at any time for reasons of which you know nothing, where you
have no control over whether AQUA
is enabled or disabled,
and you cannot disable it if it is a problem.
The devs are lunatics - actually, I think they’re academics, and wholly lack real-world experience.
There was a incident a few years back with the Redshift version string. It’s been like this since day one, so ten years plus;
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.28422
Then one fine day back in late 2021, without warning, the devs changed it, to this;
dev=# select version();
version
-----------------------------------------------------------
Redshift 1.0.32574 on Amazon Linux, compiled by gcc-7.3.0
(1 row)
It’s much nicer (though it turned out the original version string has been lying about the GCC version for about a year, and I imagine the OS for much longer).
Only problem, as you can imagine, is that the original format string as you can see is unstructured and software around the world uses a thousand different ways to pull out the version string, and so changing the version string format broke a lot of systems - it broke SQLAlchemy. It broke AWS’s own JDBC driver (at which point it instantly became clear the RS test suite, which has long been considered weak and questionable, could not be including even a single connection to RS over AWS’s own JDBC driver).
Now you and I would know without a microsecond of thought this would happen.
I heard (directly - not via the usual channels) from someone on the Support team who said “we were going to introduce an option to use the old format of version string in the next release”.
These are the devs responsible for the black box which is
AQUA
sitting in the heart of your business critical
Redshift cluster.
Until this point, you could turn off everything which the devs had
done which was a fundamental problem - this is the first time when it is
no longer so. I knew this had to come sooner or later; the most recent
close call was with auto-MV, but you can disable that in the cluster
configuration. With AQUA
, you could only work around the
risk by not using LIKE
or SIMILAR TO
any more
- and that’s central SQL
functionality.
I’m afraid I must retract one of the earlier claims about
UNLOAD
.
My deepest apologies.
The earlier post about UNLOAD
was correct with regard to
memory use, but incorrect with regard to an internal under-the-hood
three query limit.
The reason for this is that I’m actually currently working on a cluster cost reduction contract.
I originally threw some test code together as a means to an end (I’m currently looking at WLM config) rather than an end in itself; and the problem with that is that if the code lacks structure you’re more likely to make mistakes and also to miss them, and the code gradually become more complex and got to a point where I ha to sort it out, because it was no longer structured enough to be safe to rely on for correctness.
This was after I emailed the list.
I then found a bug in the test code for parallelism, such that queries were going to the wrong queue.
Having fixed this and re-structured the code, I’ve found
UNLOAD
has parallelism for at least 20 queries (which is
the max I’ve tried on for now - probably there is no limit).
I’m about to release a PDF for UNLOAD
, with the test
code and proper write up.
My apologies again.
So, late reporting on this, I’ve been very busy with a client.
Strange goings-on in the land of Redshift version numbers - so much so that there is a possibility here this is actually a bug in my code, so keep a bit of caution in mind.
https://www.redshift-observatory.ch/redshift_version_tracker/index.html
What’s happened is that on 2024-05-16 (probably earlier - I’ve not been checking as often as usual, busy), Redshift went back to a previously release version number (back to 1.0.66954, which was out on 2024-04-23).
This has never happened before. When RS does a rollback, you get the old version but the version number increases.
Emergency? blunder? change in policy? no clue.
In other news, on 2024-05-26, maintenance
track moved
forward, in a few regions, to 1.67699.
Again, this turns out to be a version which has never been in
current
. Given that RS when rolling back takes an old
version but gives it a new version number, I would guess these are older
releases, but they’ve been given a new version number - but this means
we have no clue from the version number how far back maintenance track
actually is, or indeed, simply, what version it actually is. Which is
bloody stupid. It also means there’s no compile cache “warming” on the
maintenance track.
However, we can make a guess as to the original version by looking at the system tables.
https://www.redshift-observatory.ch/system_table_tracker/index.html
We see 1.0.67699 has 1255/756/2978 tables/views/functions, which was first seen 2024-02-02 with version 1.0.62614, and last seen 2024-04-27 with version 1.66960.
STL_COMMIT_STATS
Every system table is broken.
When you come to actually need to use them, you become a dwarf, because now you dig like crazy to find truth at the bottom of it all.
So, STL_COMMIT_STATS
.
Here’s the docs page;
https://docs.aws.amazon.com/redshift/latest/dg/r_STL_COMMIT_STATS.html
Here’s my dump of the table (view really) in latest version of RS at time of writing;
https://www.redshift-observatory.ch/system_table_tracker/1.0.68205/pg_catalog.stl_commit_stats.html
So, first things first.
What basically (exceptions will be discussed) happens when there’s a commit is;
So to be clear - there is a stage in the commit process where a single and only a single query can run at a time, which is the leader node completing the commit.
(I’ve not checked, but it may be the column queuelen
is
measuring the queue into this bottleneck stage - it should be possible
to (at least roughly) check this by computing the cumulative number of
queries where the workers have finished but the leader node has yet to
start.)
If you want to measure the performance of the single-commit stage of
the commit queue (which I expect - but I’ve not actually checked - will
be the bottleneck), you’re looking at the time between
startwork
and endtime
for when
node
is -1, and that’s it (well, after you’ve handled all
the issues in the table so you have a sanitized version - see
below).
If you feed transactions into the commit queue faster than the leader node can commit then, you get a huge backlog forming on the leader node.
This table basically (we’ll get to the exceptions) for each transaction has one row for the leader node and one row for each worker node.
The leader node row has node
set to -1, and this is the
only row which sets startqueue
, and this row does
not set endflush
, endstage
, or
endlocal
(as these timestamps are for work performed only
on worker nodes).
Note also each node is setting times using its local clock, and different clocks are never exactly sychronized. As such you can see seemingly out-of-order events.
Now for the fun stuff.
2000-01-01 00:00:00
means
NULL
.xid
0.
Presumably something to do with Redshift itself; I don’t know what these
are, so I exclude them.startqueue
is
2000-01-01 00:00:00
. As of 2024-08-06, I’m of the view
these represent transactions which did not queue at all; I can see, on a
busy cluster, than queuelen
(number of transactions ahead
of this transaction) is always 0 when startqueue
is
2000-01-01 00:00:00
.VACUUM
queries totally spam this table with garbage.
You get a ton of commits for the same transaction - yes,
really. It’s incredible. Go to STL_VACUUM
, get distinct
xid
from there, and ignore them all.ANALYZE
queries are in here too; they look normal in
this table, but their problem is that in STL_QUERY
, the
querytxt
column has been invalidated by them - that column
is supposed to be the query text, but what you actually get are
arbitrary status/update messages from the ANALYZE
command.
You may or may not want to include them in your results - they’re not
your normal queries, but they are part of normal system workload.node
set to -1, every other
column has the same value). This looks like a bug to me, unintended
duplicate rows, so I use a CTE to produce a sanitized version of
STL_COMMIT_STATS
by DE-duplicating node
-1
rows.Having done all that, I cannot now see further errors (and I mean exactly what I write :-), and that’s enough for tonight.
Here’s the current code to produce some stats on that single-query bottleneck;
with
-- MG2 : de-dupe what look like duplicate leader-node rows
valid_stl_commit_stats as
(
select
xid, node, startqueue, startwork, endflush, endstage, endlocal, startglobal, endtime, queuelen, permblocks, newblocks, dirtyblocks, tombstonedblocks, tossedblocks, headers, numxids, oldestxid, extwritelatency, metadatawritten, batched_by
from
stl_commit_stats
where
node != -1
union all
select
xid, node, startqueue, startwork, endflush, endstage, endlocal, startglobal, endtime, queuelen, permblocks, newblocks, dirtyblocks, tombstonedblocks, tossedblocks, headers, numxids, oldestxid, extwritelatency, metadatawritten, batched_by
from
stl_commit_stats
where
node = -1
group by
xid, node, startqueue, startwork, endflush, endstage, endlocal, startglobal, endtime, queuelen, permblocks, newblocks, dirtyblocks, tombstonedblocks, tossedblocks, headers, numxids, oldestxid, extwritelatency, metadatawritten, batched_by
),
-- MG2 : get rid of all questionable rows (VACUUM, odd data values, the unexplained xid 0, etc)
valid_xids as
(
select
xid,
startwork,
endtime,
extract( epoch from (endtime - startwork) ) as duration
from
valid_stl_commit_stats
where
node = -1
and xid != 0
and xid not in ( select distinct xid from stl_vacuum )
)
select
min( duration ) as min_duration,
max( duration ) as max_duration,
avg( duration ) as avg_duration,
stddev_pop( duration ) as dur_sd,
count(*) as count
from
(
select
valid_xids.xid,
valid_xids.startwork,
valid_xids.endtime,
valid_xids.duration,
sum( dirtyblocks ) as sum_dirtyblocks,
sum( headers ) as sum_headers
from
valid_stl_commit_stats
join valid_xids on valid_xids.xid = valid_stl_commit_stats.xid
group by
valid_xids.xid,
valid_xids.startwork,
valid_xids.endtime,
valid_xids.duration
);
This should be easy. I’ve spent hours discovering and then working around the issues - and I’ve spent ten years plus getting used to the system tables and Redshift, to be able to do so, so quickly. I could be wrong, but I think a lot of people using this table are going wrong, and everyone using this table is having to repeat, with their varying measures of success, the same work to get something useful from it. Given how much AWS charge for Redshift, this is not good enough.
Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers