STL_VACUUM
Writing some views to show vacuum info.
Just been looking at STL_VACUUM
.
https://docs.aws.amazon.com/redshift/latest/dg/r_STL_VACUUM.html
Column status
;
The status of the VACUUM operation for each table. Possible values are the following:
Then goes on to give a list, and at the end enumerates;
VacuumBG An automatic vacuum operation was performed in the background. This status is prepended to other statuses when they’re performed automatically. For example, a delete only vacuum performed automatically would have a starting row with the status [VacuumBG] Started Delete Only.
In fact, the column is char(30)
, so what happens when
[VacuumBG]
is prepended is that the strings to be
truncated.
prod=# select distinct status from stl_vacuum order by status;
status
--------------------------------
Finished
Start DeleteOnly(Sorted>=100%)
Started
Started Delete Only (To 100%)
[VacuumBG] Finished
[VacuumBG] Started
[VacuumBG] Started Delete Only
[VacuumBG] Started Range Parti
(8 rows)
bool
indicating a background
operation?And now it’s gets really bad; you see, the table contains a single
timestamp
column to indicate when something happens.
You have to figure out what happened from the status field, the string.
So I have to look at the list of status strings and now guess and hope that if I find the string “Starting in there” then it’s always the beginning of a given operation, and that “Finished” is the end of the operation (and then run a window function over the rows to get the ordering right so I can pick out the start and end of each vacuum - why not just have one row per vacuum and log on completion? that’s what STL_QUERY et al do), BUT remember now the strings are being truncated by “[VacuumBG]”. so now I have to hope that the truncated status strings always have enough information in for me to be able to figure out what I need to figure out.
Also, we have a bunch of progress rows, which indicate the progress of any given vacuum - which also include key words you might look for in other status messages, so you have to know how to avoid these messages or your final output of course will be wrong. I have zero faith in the docs, I am certain the list of status messages there is not complete, so what I will have to do is manually issue every form of VACUUM and look to see what is recorded and hope I catch everything.
I then also have to hope (or automate a validation check) that new messages are not added over time, or change.
I’ve just noticed also, for delete only vacuum you can look in the
status string for “delete”, for sort only you look for sort, but for
full
there is no indicator. There’s nothing which
says “full”. You just get the word “Started”. You have to look for the
absence of all other indicators. (In fact, this turned out to
be impractical - there were too many other types of vacuum - I now rely
on the exact string “Started” to mean the start of a full vacuum, while
doing scans for other strings to figure out the other types of
vacuum.)
Also, the formatting of the status strings is inconsistent (of course - it could hardly be otherwise). When you issue a full vacuum you do not get the “to 100% percent” type message, even when you do specify a percentage.
All in all, and in every possible way, this table is abysmal. How on Earth did it come to exist, and to exist for ten years, in a product which has some billions of dollars of revenue per year?
Note on that matter of point #4 - why not just have flag to indicate
something - note we do have a column is_recluster
,
which is a flag indicating if the fairly new recluster operation has
been performed. Unfortunately, this too by the looks of it has been
messed up, because it’s an int4
being used as a
bool
(the value is only ever 1 or 0).
Ah and finally I’ve noticed the key words you search for vary in case - sometimes say “Delete”, other times “delete”, so it’s back to that old faithful with low quality string data, converting everything to lower case before performing comparisons.
(Addendum - I hate the wording of the VACUUM command when it comes to
percentage. vacuum full [table] to 95 percent;
. To me that
reads “make this table 95 sorted”. What it actually means is “if this
table is less than 95% sorted, sort this table fully”. That’s bad
enough, but also it turned out the mechanism RS uses for deciding how
sorted a table is, is completely fubared. It figures out the percentage
by the total number of sorted and unsorted rows in the table - but what
you need to do is work that out on a per slice basis, because queries
run as slowely as the slowest slice. You should never use this
functionality, ever, no matter what. It’s just broken. Don’t do it. You
have to always set to 100 percent, to disable/avoid this functionality.
Then you only have to worry about the unusual situation where a vacuum
full to 100 percent leaves you with a completely, 100% unsorted
table. I’ll write about it some other time - writing up all this so far
has taken up a good hour or two, and I’ve got other things to get
done.)
SYS_VACUUM_HISTORY
AWS have recently been coming out with a second series of system
tables, prefixed with SYS_
.
There is a new table for VACUUM, SYS_VACUUM_HISTORY
.
https://docs.aws.amazon.com/redshift/latest/dg/SYS_VACUUM_HISTORY.html
I had a look at the doc page, and it looks like quite a nice table -
but I noticed the data types looked a bit wrong, with
character
being listed for strings, which probably should
be varchar
.
So I then went and looked at my dump of the table, here;
And boy did I get a bunch of surprises.
is_automatic
is documented as a bool
, but
it’s actually a text
._name
fields listed as text
are in
fact character(128)
, which means (this is a common problem)
you can get UTF-8 in char
columns (because you can make
tables with UTF-8 names). So you can dump this table, but you can’t then
load it into its own DDL. Next, I note the status
field is
still char(30)
- same as it is in
STL_VACUUM
.status
field is a char(30)
and that
looks suspicious - because that’s exactly what’s in
STL_VACUUM
. I went and checked, did a
select distinct
, and lo and behold…prod=# select distinct status from sys_vacuum_history order by status;
status
--------------------------------
Finished
Start DeleteOnly(Sorted>=100%)
Started
Started Delete Only (To 100%)
[VacuumBG] Finished
[VacuumBG] Started
[VacuumBG] Started Delete Only
[VacuumBG] Started Range Parti
(8 rows)
It the same as STL_VACUUM
, in all its terrible
glory. Identical. The same. The docs are flatly wrong, because they now
list strings which are simply not what is in the column, and this column
has not improved.
I began to suspect that SYS_VACUUM_HISTORY
was a view
over STL_VACUUM
, because of this shared column, but if I
run explain select * from sys_vacuum_history;
and the same
again for stl_vacuum
, in fact they both originate from
different tables.
prod=# explain select * from stl_vacuum;
QUERY PLAN
------------------------------------------------------------------
XN Seq Scan on stll_vacuum (cost=0.00..3.60 rows=360 width=170)
(1 row)
Time: 226.999 ms
prod=# explain select * from sys_vacuum_history;
QUERY PLAN
-------------------------------------------------------------------------------
XN Seq Scan on stll_user_vacuum_history (cost=0.00..0.75 rows=50 width=1488)
(1 row)
The underlying tables are;
https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.stll_vacuum.html
and
STLL_VACUUM
looks a bit sparse - too few columns - to
provide everything in STL_VACUUM
(but I’ve not looked
closely). There are some other related vacuum tables;
Which I could imagine are being pulled in.
So it looks like SYS_VACUUM_HISTORY
is using the same
sources of information, but populating a different table, of its
own.
Finally, we come to the two fundamental problems with all the
SYS
tables.
First, the SYS
tables use a complete new form of query
ID - an int8
, not an int4
, such that the same
query now has two query IDs, one in the STL
tables
and one in the SYS
tables, and so the two sets of tables
cannot be joined on query ID. The system tables have been lobotomized.
This is on the face of it utter madness.
Second, AWS specifically block users from viewing the SQL of the views. We can’t know how they work, and I know from long experience there’s a lot the system tables get wrong and which you have to ignore, fix or work around. Using system tables from AWS for RS without the source code is completely off the map. It’s reckless and unprofessional given all the problems with the design of the existing tables (which we as ever see again here).
STL_QUERYTEXT
I’ve just spent half a day debugging a replacement system table view.
It turns out STL_QUERYTEXT
logged the same query twice,
which caused my use of listagg
to overflow.
prod=# select count(*) from ( select query from stl_querytext where sequence = 0 group by query having count(*) >= 2 ); count ------- 23964 (1 row)
So, taking one query in particular;
prod=# select count(*) from stl_querytext where query = 551177675 and sequence = 0; count ------- 2 (1 row)
And now looking in STL_QUERY
;
prod=# select count(*) from stl_query where query = 551177675;
count
-------
1
(1 row)
I want to benchmark how long a select
query takes to
execute, for all the obvious reasons.
Some queries are leader node only, some are worker node.
For leader node queries, the situation appears problematic.
I believe the and the only record of select
queries on
the leader node is in stv_recents
.
This table holds the most recent 100 queries (that this is so is undocumented).
So I need to be able to tell a query is leader node only, and then if
it is, I need to hope that the query is still in
stv_recents
so I can get duration information.
I can’t tell if a query is leader node by dint of it being in
stv_recents
, because worker node queries are shown there as
well.
I can’t parse the SQL to figure it out directly, because AWS do not publish a BNF.
I was able to jury-rig a method, by using
pg_last_query_id()
. This returns the most recent worker
node query ID, or -1 if no worker node queries have been issued. So what
I do is call this immediately before and immediately after issuing the
query I want to benchmark. If the query IDs before and after are
identical, it was a leader node query.
Now I need query duration information from stv_recents
,
for leader node queries. In stv_recents
, there is a column
starttime
for “Time that the query started”, and there is a
column duration
which I always assumed was run time, but
the docs now say this is “Number of microseconds since the session
started”. So either the docs are right, and you cannot know query
execution time from stv_recents
, because there is only a
start time, or the docs are wrong. Looking at the numbers, I think the
docs are wrong, and this is in fact query duration, not session duration
(but I’ve not yet sat down and proved it).
https://docs.aws.amazon.com/redshift/latest/dg/r_STV_RECENTS.html
Now, the problem here with this approach is that if a query takes a
long time, I believe it is still pushed out of
stv_recents
by following queries. The table does not retain
running queries because they are running.
So, fundamentally, there looks to be only a broken capability to know
the duration of leader node queries. You can’t benchmark longer running
leader node select
queries.
Moving on to worker node queries, I want execution time from start of execution up to beginning of return to client, as return to client is too dependent on external factors.
As such, I need to set aside queue time and compile time, and find out the start of return to leader.
Exec start time is easily found in stl_wlm_query
, but -
and note everything I now write about compilation is undocumented, but
you can’t benchmark execution time correctly without knowing this -
these days compiles occur during execution, and so compile time is
included in execution time in stl_wlm_query
. So we must now
turn to svl_compile
and sum the compile time for our query,
and subtract it from execution time.
However, this will still not always give the correct answer, because what can happen is that a first query comes to compile but notices a second query is already compiling the segment it will need, and then the first query will block, waiting for the segment to be made.
That this waiting period occurred is recorded nowhere. In
svl_compile
, the second query indicates it obtained the
segment from cache and took the usual few hundred milliseconds to do so
- when in fact the query blocked for some seconds, and these seconds are
included in the execution time.
The problem here is that the compilation system has over time fundamentally changed, but the table showing what the compilation system has done, has very nearly not changed, and now it is unable to represent what happens.
So we cannot - repeat cannot - actually always know correct query execution times.
That’s incredible.
Anyways, back to the problem in hand.
I have a bunch of queries to benchmark, about 250 or so.
I can’t benchmark one at a time, because querying the system tables on a large cluster is much too slow; I need to build up a set of query IDs, and then issue a single query which pulls the data I need for all those queries.
So, what we end up with is;
with
cte_return_times as
(
select
query as query_id,
min( starttime ) as min_r2c_start_ts
from
stl_return
where
slice >= 12800
and query in ( %s )
group by
query
),
cte_compile_times as
(
select
query as query_id,
sum( endtime - starttime ) as sum_compile_duration
from
svl_compile
where
query in ( %s )
group by
query
),
cte_exec_start as
(
select
query as query_id,
exec_start_time as exec_start_ts
from
stl_wlm_query
where
query in ( %s )
)
select
query_id,
extract( epoch from ((min_r2c_start_ts - exec_start_ts) - sum_compile_duration) ) as exec_duration
from
cte_exec_start
natural join cte_compile_times
natural join cte_return_times;
Where I replace the %s
with the comma separated list of
query IDs I’m interested in.
It took me several years to know enough about Redshift to be able to figure this out, and it took me about eight hours to actually work through it, bumping into each issue one after the other.
Likely the very large majority of users benchmarking their queries are doing their best and getting the wrong numbers, and thinking they’re correct. (Worse than that, they may be using the console. Never use the console. No one knows how it computes its numbers, and as you can see, there’s a lot of ways in which to go wrong, and given the state of the system tables, would you trust the console to somehow be correct, either in the first place given its provenance, or if it is using the system tables?)
I am of the view the docs and system tables are unfit for purpose, and that this is not acceptable for a product with what is probably $2 billion to $4 billion revenue per year.
Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers