STL_VACUUM
Writing some views to show vacuum info.
Just been looking at STL_VACUUM
.
https://docs.aws.amazon.com/redshift/latest/dg/r_STL_VACUUM.html
Column status
;
The status of the VACUUM operation for each table. Possible values are the following:
Then goes on to give a list, and at the end enumerates;
VacuumBG An automatic vacuum operation was performed in the background. This status is prepended to other statuses when they’re performed automatically. For example, a delete only vacuum performed automatically would have a starting row with the status [VacuumBG] Started Delete Only.
In fact, the column is char(30)
, so what happens when
[VacuumBG]
is prepended is that the strings to be
truncated.
prod=# select distinct status from stl_vacuum order by status;
status
--------------------------------
Finished
Start DeleteOnly(Sorted>=100%)
Started
Started Delete Only (To 100%)
[VacuumBG] Finished
[VacuumBG] Started
[VacuumBG] Started Delete Only
[VacuumBG] Started Range Parti
(8 rows)
bool
indicating a background
operation?And now it’s gets really bad; you see, the table contains a single
timestamp
column to indicate when something happens.
You have to figure out what happened from the status field, the string.
So I have to look at the list of status strings and now guess and hope that if I find the string “Starting in there” then it’s always the beginning of a given operation, and that “Finished” is the end of the operation (and then run a window function over the rows to get the ordering right so I can pick out the start and end of each vacuum - why not just have one row per vacuum and log on completion? that’s what STL_QUERY et al do), BUT remember now the strings are being truncated by “[VacuumBG]”. so now I have to hope that the truncated status strings always have enough information in for me to be able to figure out what I need to figure out.
Also, we have a bunch of progress rows, which indicate the progress of any given vacuum - which also include key words you might look for in other status messages, so you have to know how to avoid these messages or your final output of course will be wrong. I have zero faith in the docs, I am certain the list of status messages there is not complete, so what I will have to do is manually issue every form of VACUUM and look to see what is recorded and hope I catch everything.
I then also have to hope (or automate a validation check) that new messages are not added over time, or change.
I’ve just noticed also, for delete only vacuum you can look in the
status string for “delete”, for sort only you look for sort, but for
full
there is no indicator. There’s nothing which
says “full”. You just get the word “Started”. You have to look for the
absence of all other indicators. (In fact, this turned out to
be impractical - there were too many other types of vacuum - I now rely
on the exact string “Started” to mean the start of a full vacuum, while
doing scans for other strings to figure out the other types of
vacuum.)
Also, the formatting of the status strings is inconsistent (of course - it could hardly be otherwise). When you issue a full vacuum you do not get the “to 100% percent” type message, even when you do specify a percentage.
All in all, and in every possible way, this table is abysmal. How on Earth did it come to exist, and to exist for ten years, in a product which has some billions of dollars of revenue per year?
Note on that matter of point #4 - why not just have flag to indicate
something - note we do have a column is_recluster
,
which is a flag indicating if the fairly new recluster operation has
been performed. Unfortunately, this too by the looks of it has been
messed up, because it’s an int4
being used as a
bool
(the value is only ever 1 or 0).
Ah and finally I’ve noticed the key words you search for vary in case - sometimes say “Delete”, other times “delete”, so it’s back to that old faithful with low quality string data, converting everything to lower case before performing comparisons.
(Addendum - I hate the wording of the VACUUM command when it comes to
percentage. vacuum full [table] to 95 percent;
. To me that
reads “make this table 95 sorted”. What it actually means is “if this
table is less than 95% sorted, sort this table fully”. That’s bad
enough, but also it turned out the mechanism RS uses for deciding how
sorted a table is, is completely fubared. It figures out the percentage
by the total number of sorted and unsorted rows in the table - but what
you need to do is work that out on a per slice basis, because queries
run as slowely as the slowest slice. You should never use this
functionality, ever, no matter what. It’s just broken. Don’t do it. You
have to always set to 100 percent, to disable/avoid this functionality.
Then you only have to worry about the unusual situation where a vacuum
full to 100 percent leaves you with a completely, 100% unsorted
table. I’ll write about it some other time - writing up all this so far
has taken up a good hour or two, and I’ve got other things to get
done.)
SYS_VACUUM_HISTORY
AWS have recently been coming out with a second series of system
tables, prefixed with SYS_
.
There is a new table for VACUUM, SYS_VACUUM_HISTORY
.
https://docs.aws.amazon.com/redshift/latest/dg/SYS_VACUUM_HISTORY.html
I had a look at the doc page, and it looks like quite a nice table -
but I noticed the data types looked a bit wrong, with
character
being listed for strings, which probably should
be varchar
.
So I then went and looked at my dump of the table, here;
And boy did I get a bunch of surprises.
is_automatic
is documented as a bool
, but
it’s actually a text
._name
fields listed as text
are in
fact character(128)
, which means (this is a common problem)
you can get UTF-8 in char
columns (because you can make
tables with UTF-8 names). So you can dump this table, but you can’t then
load it into its own DDL. Next, I note the status
field is
still char(30)
- same as it is in
STL_VACUUM
.status
field is a char(30)
and that
looks suspicious - because that’s exactly what’s in
STL_VACUUM
. I went and checked, did a
select distinct
, and lo and behold…prod=# select distinct status from sys_vacuum_history order by status;
status
--------------------------------
Finished
Start DeleteOnly(Sorted>=100%)
Started
Started Delete Only (To 100%)
[VacuumBG] Finished
[VacuumBG] Started
[VacuumBG] Started Delete Only
[VacuumBG] Started Range Parti
(8 rows)
It the same as STL_VACUUM
, in all its terrible
glory. Identical. The same. The docs are flatly wrong, because they now
list strings which are simply not what is in the column, and this column
has not improved.
I began to suspect that SYS_VACUUM_HISTORY
was a view
over STL_VACUUM
, because of this shared column, but if I
run explain select * from sys_vacuum_history;
and the same
again for stl_vacuum
, in fact they both originate from
different tables.
prod=# explain select * from stl_vacuum;
QUERY PLAN
------------------------------------------------------------------
XN Seq Scan on stll_vacuum (cost=0.00..3.60 rows=360 width=170)
(1 row)
Time: 226.999 ms
prod=# explain select * from sys_vacuum_history;
QUERY PLAN
-------------------------------------------------------------------------------
XN Seq Scan on stll_user_vacuum_history (cost=0.00..0.75 rows=50 width=1488)
(1 row)
The underlying tables are;
https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.stll_vacuum.html
and
STLL_VACUUM
looks a bit sparse - too few columns - to
provide everything in STL_VACUUM
(but I’ve not looked
closely). There are some other related vacuum tables;
Which I could imagine are being pulled in.
So it looks like SYS_VACUUM_HISTORY
is using the same
sources of information, but populating a different table, of its
own.
Finally, we come to the two fundamental problems with all the
SYS
tables.
First, the SYS
tables use a complete new form of query
ID - an int8
, not an int4
, such that the same
query now has two query IDs, one in the STL
tables
and one in the SYS
tables, and so the two sets of tables
cannot be joined on query ID. The system tables have been lobotomized.
This is on the face of it utter madness.
Second, AWS specifically block users from viewing the SQL of the views. We can’t know how they work, and I know from long experience there’s a lot the system tables get wrong and which you have to ignore, fix or work around. Using system tables from AWS for RS without the source code is completely off the map. It’s reckless and unprofessional given all the problems with the design of the existing tables (which we as ever see again here).
STL_QUERYTEXT
I’ve just spent half a day debugging a replacement system table view.
It turns out STL_QUERYTEXT
logged the same query twice,
which caused my use of listagg
to overflow.
prod=# select count(*) from ( select query from stl_querytext where sequence = 0 group by query having count(*) >= 2 ); count ------- 23964 (1 row)
So, taking one query in particular;
prod=# select count(*) from stl_querytext where query = 551177675 and sequence = 0; count ------- 2 (1 row)
And now looking in STL_QUERY
;
prod=# select count(*) from stl_query where query = 551177675;
count
-------
1
(1 row)
I want to benchmark how long a select
query takes to
execute, for all the obvious reasons.
Some queries are leader node only, some are worker node.
For leader node queries, the situation appears problematic.
I believe the and the only record of select
queries on
the leader node is in stv_recents
.
This table holds the most recent 100 queries (that this is so is undocumented).
So I need to be able to tell a query is leader node only, and then if
it is, I need to hope that the query is still in
stv_recents
so I can get duration information.
I can’t tell if a query is leader node by dint of it being in
stv_recents
, because worker node queries are shown there as
well.
I can’t parse the SQL to figure it out directly, because AWS do not publish a BNF.
I was able to jury-rig a method, by using
pg_last_query_id()
. This returns the most recent worker
node query ID, or -1 if no worker node queries have been issued. So what
I do is call this immediately before and immediately after issuing the
query I want to benchmark. If the query IDs before and after are
identical, it was a leader node query.
Now I need query duration information from stv_recents
,
for leader node queries. In stv_recents
, there is a column
starttime
for “Time that the query started”, and there is a
column duration
which I always assumed was run time, but
the docs now say this is “Number of microseconds since the session
started”. So either the docs are right, and you cannot know query
execution time from stv_recents
, because there is only a
start time, or the docs are wrong. Looking at the numbers, I think the
docs are wrong, and this is in fact query duration, not session duration
(but I’ve not yet sat down and proved it).
https://docs.aws.amazon.com/redshift/latest/dg/r_STV_RECENTS.html
Now, the problem here with this approach is that if a query takes a
long time, I believe it is still pushed out of
stv_recents
by following queries. The table does not retain
running queries because they are running.
So, fundamentally, there looks to be only a broken capability to know
the duration of leader node queries. You can’t benchmark longer running
leader node select
queries.
Moving on to worker node queries, I want execution time from start of execution up to beginning of return to client, as return to client is too dependent on external factors.
As such, I need to set aside queue time and compile time, and find out the start of return to leader.
Exec start time is easily found in stl_wlm_query
, but -
and note everything I now write about compilation is undocumented, but
you can’t benchmark execution time correctly without knowing this -
these days compiles occur during execution, and so compile time is
included in execution time in stl_wlm_query
. So we must now
turn to svl_compile
and sum the compile time for our query,
and subtract it from execution time.
However, this will still not always give the correct answer, because what can happen is that a first query comes to compile but notices a second query is already compiling the segment it will need, and then the first query will block, waiting for the segment to be made.
That this waiting period occurred is recorded nowhere. In
svl_compile
, the second query indicates it obtained the
segment from cache and took the usual few hundred milliseconds to do so
- when in fact the query blocked for some seconds, and these seconds are
included in the execution time.
The problem here is that the compilation system has over time fundamentally changed, but the table showing what the compilation system has done, has very nearly not changed, and now it is unable to represent what happens.
So we cannot - repeat cannot - actually always know correct query execution times.
That’s incredible.
Anyways, back to the problem in hand.
I have a bunch of queries to benchmark, about 250 or so.
I can’t benchmark one at a time, because querying the system tables on a large cluster is much too slow; I need to build up a set of query IDs, and then issue a single query which pulls the data I need for all those queries.
So, what we end up with is;
with
cte_return_times as
(
select
query as query_id,
min( starttime ) as min_r2c_start_ts
from
stl_return
where
slice >= 12800
and query in ( %s )
group by
query
),
cte_compile_times as
(
select
query as query_id,
sum( endtime - starttime ) as sum_compile_duration
from
svl_compile
where
query in ( %s )
group by
query
),
cte_exec_start as
(
select
query as query_id,
exec_start_time as exec_start_ts
from
stl_wlm_query
where
query in ( %s )
)
select
query_id,
extract( epoch from ((min_r2c_start_ts - exec_start_ts) - sum_compile_duration) ) as exec_duration
from
cte_exec_start
natural join cte_compile_times
natural join cte_return_times;
Where I replace the %s
with the comma separated list of
query IDs I’m interested in.
It took me several years to know enough about Redshift to be able to figure this out, and it took me about eight hours to actually work through it, bumping into each issue one after the other.
Likely the very large majority of users benchmarking their queries are doing their best and getting the wrong numbers, and thinking they’re correct. (Worse than that, they may be using the console. Never use the console. No one knows how it computes its numbers, and as you can see, there’s a lot of ways in which to go wrong, and given the state of the system tables, would you trust the console to somehow be correct, either in the first place given its provenance, or if it is using the system tables?)
I am of the view the docs and system tables are unfit for purpose, and that this is not acceptable for a product with what is probably $2 billion to $4 billion revenue per year.
So, in the RS system tables historically there has been no indication about whether or not a query was served from the result cache. A year or two ago a new round of system tables started coming out, one of which indicates for a query whether or not it was served from the result cache. AWS censor the SQL of these system tables, so you have no idea how they come up with this data. I’ve been using one of those new tables - the only one I use - to produce percentages for result cache hits.
I’ve just now realized that if a query is logged in
STL_QUERY
, but not in STL_WLM_QUERY
, then it
is a result cache hit.
This in turn has led me to compare the numbers I can now generate with those from the new system table, and it looks to me like the new system table has all queries in - both leader node only and worker node - which roughly doubles the number of queries it contains.
However, I think only worker node queries can be served from the result cache. As such, the percentages derived from the new system table would be wrong, understating result cache hits by a factor of two, because that table includes all leader node queries, which are about half of all queries.
The WLM queries issued by ANALYZE
, and
CREATE TEMP TABLE
(and so I would imagine also other
queries, but I’ve not seen them yet) appear to enter the WLM queue at
its head, not its tail. They do queue, but only very briefly,
and this is when there are hundreds of queries in the queue and the
queue time is a minute or more.
This is one of the most important posts I’ve made.
So, I’ve recently - last year or two - come properly into using CTEs. I was against them, before, really, but that’s mainly I think because I saw them being badly used.
So, CTEs.
Now, I think - I’ve not checked but I think - in Postgres, CTEs are materialized. The CTE runs, the rows are generated, whatever you do after with that CTE comes from those rows.
(I’m actually not sure now, thinking about it, if this makes sense - but doesn’t matter right now.)
Now I’ve recently finally got round to producing convenient and easy to use benchmarking scripting, and I’m also finally getting on with producing a full set of replacement system tables.
The combination of two has led to the discovery that in RS, CTEs look to be implemented as textual substitution (which is how views are implemented in all relational databases).
Where-ever there is a CTE name in a query, the name of the query will be replaced with the text of the CTE.
So I had views which had a base CTE which I then did various things with - lovely clean, well-laid out, easy to read code - which was unbelievably slow and expensive, because the base CTE was being run multiple times.
So for example I’d scan a ton of huge tables and produce one row per day that’s in the system tables (so, six or so days) - and then do some formatting work with those six rows.
Turns out the formatting work was running the scanning of the ton of huge tables three more times.
I have a representative view which have had two sets of fixes like this, and the second set of fixes took the view from 55 seconds to 9 seconds (I don’t have a benchmark to hand for the original version).
(The fixes are based on using windows functions - you end up with tons of identical rows, and you group by to get rid of them, but it’s infinitely faster because all the rows are being processed once only. The code is now ugly and nothing you would normally write - but it’s literally 10x faster, and this for queries which on a big system were taking hours.)
A couple of matters have accumulated in my mind regarding the UA war.
First, the mass media, the press.
What you see in not everything but very nearly almost everything is a story which has been modified to be gripping/depressing/exciting/hopeful - you name it - but the central requirement being for it to be a story which will make you read it.
The humdrum-ness and lack of meaningful actual event in the war is not conducive to exciting stories; what we get then is a constant barrage of hyped up, distorted, magnified non-events.
I also see what looks to me to be fashions in reporting - publications see other publications publishing stories with a given theme or motif, and they all jump on the bandwagon. There’s no actual substance to any of it; it’s media feeding off of media, with nothing, no actual news at all, at the center of it all.
The real actual news about the war is this : almost nothing has changed for a long time.
Putin is pushing along as much of the front as he can, because that’s all the Russian army can managed in terms of tactics. He’s very very slowly pushing forward. All this stuff about the Russian advance - what total idiocy. Ukraine is huge. I don’t care that the front line has moved kilometers in months.
Ukraine made a very smart move going into Russia. That changed a lot - if it comes to negotiation, then it’s now “our land in exchange for your land”, not Putin going “we keep everything we have so far”. I wish very much this had been done instead of the UA summer offensive which was done, down in the south.
That “spat” with Trump is pure BS. That’s Trump being a dick and making stuff up because for whatever reason he thinks it’s best for him. The man is a total idiot and by the number of people who believe his lies, harmful to everything the West stands on and for. Also, anything said by any Republican these days is pure BS - they’re all towing the Trump line, and I think Trump supports Putin because Trump thinks Putin works to manipulate US elections in Trump’s favour. Thankfully, with Kamala, the Dems look to be set to win. I’m no fan of Big Government, which the Dems tend toward, but the Republicans at the moment with Trump are utterly unfit for power. Right now they’re unfit for flushing a toilet. All the Republican voters, who want small - but sane - Government, low taxes, minimal regulation, have no one to vote for.
What else… the EU continues to support UA. The USA too. Neither RU or UA can advance against defences which have had time to be developed; the power of the defence is just too great. Attacking en masse by surprise in a weakly defended sector works (re Kursk) and the Russians can’t do that; their army isn’t up something requiring that level of operational skill. They tried it in the north and it went nowhere.
We’re pretty much in a 1916 situation. The power of the defence is much greater than the power of the offence. The race is on to invent the next tank, which looks to be a genuinely effective AI powered attack drone.
One final very important point. I do see in the press people talking about the war being difficult, the various problems that exist for UA (manpower, electricity, etc - although electricity is a complete non-starter, it’s just not that big a problem).
You have to remember that you know you own problems in full, in detail : they are wholly apparent to you and you feel them all fully.
What you are however but dimly aware of are the multitude of problems which beset your foe. Putin is struggling, whether we see it or not, and it is straightforward to think why; he is a dictator, fundamentally insecure of his position, fundamentally deeply afraid of unrest within the Russian people, and has launched a war which drags on now for three years. The economic costs of the war are enormous; the USA and EU are affording it just fine, because they have a massive economy. Russia has an economy the size of Spain. Fighting a war is unbelievably expensive and that wealth is being taken out of the economy one war of another, be it by printing money or taking loans or spending reserves or raising taxes or whatever it is - and of course hundreds upon hundreds of thousands of Russians have been killed and maimed. Putin looks over Russia, fearful of his own people, fearful of failure or looking weak, afraid of the economy failing, absolutely cannot allow failure in Ukraine, but with no way out, unable to win, where his army unable to do anything more than incremental advances at enormous cost, and even that power is coming to its end.
Ah and also - everything Putin says about nuclear threats is total BS. Ignore it completely. You have to anyway, because if you give it credence, then you have no defence. Putin invades and you do nothing meaningful because of nuclear threats. Remember : the West has nuclear weapons too, and it’s exactly for defending against this kind of threat.
Another particular point : I suspect some sort of impression is being formed of an inexorable Russian advance. It simply is not so. Russia has expanded its war production as much as it can now do - to do more would require long lead items to be produced, we’re talking some years of lead time - and Russia is burning through ammunition and equipment much more quickly than they are produced. This is Putin’s big push, prior to the US election, and Putin’s hope that Trump win will.
So, predictions : Kamala will win, the USA and EU continue to support UA, Putin continues to slaughter and maim hundreds of thousands of people in his barbaric, sickening, and anachronistic effort to territorially expand Russia, and the war continues. UA is not short on men - despite all the popular press reports - they have a ton of units available, the problem is lack of heavy weaponry to equip them; the West needs to step up, but it is short on equipment, not having taken the war really seriously and started mass investment in arms production about two years ago, which is about how long it takes to get production going. I suspect UA has got the idea now of advancing into Russia, and this gives them a massive front-line, which I think will lack solid defences, to attack in to. The Russian offensive will die down, Kamala will win, the USA will remain on side (with long struggles to get funding passed, as it will probably be a split Government with UA being widely supported by people, but Trump and his idiots holding a balance of power), the USA/EU will continue equipping UA, and the strategic initiative will pass back to UA, who now I hope will use it to take more and more Russian territory, with a view to settlement based on return to pre-war borders. Putin of course can never accept this - I think we may then end up with a frozen war, until Putin one way or another dies. I do not see Putin being removed from power, any more than Hitler was.
All of this would have been avoided if UA had been in NATO.
The West forgot Russia is a real threat, and the ignored it when Georgia was invaded.
Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers