Redshift Observatory

Blog

2024-09-05

STL_VACUUM

Writing some views to show vacuum info.

Just been looking at STL_VACUUM.

https://docs.aws.amazon.com/redshift/latest/dg/r_STL_VACUUM.html

Column status;

The status of the VACUUM operation for each table. Possible values are the following:

Then goes on to give a list, and at the end enumerates;

VacuumBG An automatic vacuum operation was performed in the background. This status is prepended to other statuses when they’re performed automatically. For example, a delete only vacuum performed automatically would have a starting row with the status [VacuumBG] Started Delete Only.

In fact, the column is char(30), so what happens when [VacuumBG] is prepended is that the strings to be truncated.

prod=# select distinct status from stl_vacuum order by status;
             status             
--------------------------------
 Finished                      
 Start DeleteOnly(Sorted>=100%)
 Started                       
 Started Delete Only (To 100%) 
 [VacuumBG] Finished           
 [VacuumBG] Started            
 [VacuumBG] Started Delete Only
 [VacuumBG] Started Range Parti
(8 rows)
  1. The docs are wrong.
  2. Telling people about the flag to indicate background operation at the bottom of of a list of status codes is bad documentation.
  3. Prepending a string to a string field to indicate a flag is nuts.
  4. Why not just add a bool indicating a background operation?

And now it’s gets really bad; you see, the table contains a single timestamp column to indicate when something happens.

You have to figure out what happened from the status field, the string.

So I have to look at the list of status strings and now guess and hope that if I find the string “Starting in there” then it’s always the beginning of a given operation, and that “Finished” is the end of the operation (and then run a window function over the rows to get the ordering right so I can pick out the start and end of each vacuum - why not just have one row per vacuum and log on completion? that’s what STL_QUERY et al do), BUT remember now the strings are being truncated by “[VacuumBG]”. so now I have to hope that the truncated status strings always have enough information in for me to be able to figure out what I need to figure out.

Also, we have a bunch of progress rows, which indicate the progress of any given vacuum - which also include key words you might look for in other status messages, so you have to know how to avoid these messages or your final output of course will be wrong. I have zero faith in the docs, I am certain the list of status messages there is not complete, so what I will have to do is manually issue every form of VACUUM and look to see what is recorded and hope I catch everything.

I then also have to hope (or automate a validation check) that new messages are not added over time, or change.

I’ve just noticed also, for delete only vacuum you can look in the status string for “delete”, for sort only you look for sort, but for full there is no indicator. There’s nothing which says “full”. You just get the word “Started”. You have to look for the absence of all other indicators. (In fact, this turned out to be impractical - there were too many other types of vacuum - I now rely on the exact string “Started” to mean the start of a full vacuum, while doing scans for other strings to figure out the other types of vacuum.)

Also, the formatting of the status strings is inconsistent (of course - it could hardly be otherwise). When you issue a full vacuum you do not get the “to 100% percent” type message, even when you do specify a percentage.

All in all, and in every possible way, this table is abysmal. How on Earth did it come to exist, and to exist for ten years, in a product which has some billions of dollars of revenue per year?

Note on that matter of point #4 - why not just have flag to indicate something - note we do have a column is_recluster, which is a flag indicating if the fairly new recluster operation has been performed. Unfortunately, this too by the looks of it has been messed up, because it’s an int4 being used as a bool (the value is only ever 1 or 0).

Ah and finally I’ve noticed the key words you search for vary in case - sometimes say “Delete”, other times “delete”, so it’s back to that old faithful with low quality string data, converting everything to lower case before performing comparisons.

(Addendum - I hate the wording of the VACUUM command when it comes to percentage. vacuum full [table] to 95 percent;. To me that reads “make this table 95 sorted”. What it actually means is “if this table is less than 95% sorted, sort this table fully”. That’s bad enough, but also it turned out the mechanism RS uses for deciding how sorted a table is, is completely fubared. It figures out the percentage by the total number of sorted and unsorted rows in the table - but what you need to do is work that out on a per slice basis, because queries run as slowely as the slowest slice. You should never use this functionality, ever, no matter what. It’s just broken. Don’t do it. You have to always set to 100 percent, to disable/avoid this functionality. Then you only have to worry about the unusual situation where a vacuum full to 100 percent leaves you with a completely, 100% unsorted table. I’ll write about it some other time - writing up all this so far has taken up a good hour or two, and I’ve got other things to get done.)

SYS_VACUUM_HISTORY

AWS have recently been coming out with a second series of system tables, prefixed with SYS_.

There is a new table for VACUUM, SYS_VACUUM_HISTORY.

https://docs.aws.amazon.com/redshift/latest/dg/SYS_VACUUM_HISTORY.html

I had a look at the doc page, and it looks like quite a nice table - but I noticed the data types looked a bit wrong, with character being listed for strings, which probably should be varchar.

So I then went and looked at my dump of the table, here;

https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.sys_vacuum_history.html

And boy did I get a bunch of surprises.

  1. is_automatic is documented as a bool, but it’s actually a text.
  2. The _name fields listed as text are in fact character(128), which means (this is a common problem) you can get UTF-8 in char columns (because you can make tables with UTF-8 names). So you can dump this table, but you can’t then load it into its own DDL. Next, I note the status field is still char(30) - same as it is in STL_VACUUM.
  3. The status field is a char(30) and that looks suspicious - because that’s exactly what’s in STL_VACUUM. I went and checked, did a select distinct, and lo and behold…
prod=# select distinct status from sys_vacuum_history order by status;
             status             
--------------------------------
 Finished                      
 Start DeleteOnly(Sorted>=100%)
 Started                       
 Started Delete Only (To 100%) 
 [VacuumBG] Finished           
 [VacuumBG] Started            
 [VacuumBG] Started Delete Only
 [VacuumBG] Started Range Parti
(8 rows)

It the same as STL_VACUUM, in all its terrible glory. Identical. The same. The docs are flatly wrong, because they now list strings which are simply not what is in the column, and this column has not improved.

I began to suspect that SYS_VACUUM_HISTORY was a view over STL_VACUUM, because of this shared column, but if I run explain select * from sys_vacuum_history; and the same again for stl_vacuum, in fact they both originate from different tables.

prod=# explain select * from stl_vacuum;
                            QUERY PLAN                            
------------------------------------------------------------------
 XN Seq Scan on stll_vacuum  (cost=0.00..3.60 rows=360 width=170)
(1 row)

Time: 226.999 ms
prod=# explain select * from sys_vacuum_history;
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 XN Seq Scan on stll_user_vacuum_history  (cost=0.00..0.75 rows=50 width=1488)
(1 row)

The underlying tables are;

https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.stll_vacuum.html

and

https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.stll_user_vacuum_history.html

STLL_VACUUM looks a bit sparse - too few columns - to provide everything in STL_VACUUM (but I’ve not looked closely). There are some other related vacuum tables;

https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.stll_vacuum_detail.html

https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.stll_vacuum_history.html

Which I could imagine are being pulled in.

So it looks like SYS_VACUUM_HISTORY is using the same sources of information, but populating a different table, of its own.

Finally, we come to the two fundamental problems with all the SYS tables.

First, the SYS tables use a complete new form of query ID - an int8, not an int4, such that the same query now has two query IDs, one in the STL tables and one in the SYS tables, and so the two sets of tables cannot be joined on query ID. The system tables have been lobotomized. This is on the face of it utter madness.

Second, AWS specifically block users from viewing the SQL of the views. We can’t know how they work, and I know from long experience there’s a lot the system tables get wrong and which you have to ignore, fix or work around. Using system tables from AWS for RS without the source code is completely off the map. It’s reckless and unprofessional given all the problems with the design of the existing tables (which we as ever see again here).

2024-09-10

STL_QUERYTEXT

I’ve just spent half a day debugging a replacement system table view. It turns out STL_QUERYTEXT logged the same query twice, which caused my use of listagg to overflow.

prod=# select count(*) from ( select query from stl_querytext where sequence = 0 group by query having count(*) >= 2 ); count ------- 23964 (1 row)

So, taking one query in particular;

prod=# select count(*) from stl_querytext where query = 551177675 and sequence = 0; count ------- 2 (1 row)

And now looking in STL_QUERY;

prod=# select count(*) from stl_query where query = 551177675;
 count 
-------
     1
(1 row)

2024-09-12

Benchmarking Queries

I want to benchmark how long a select query takes to execute, for all the obvious reasons.

Some queries are leader node only, some are worker node.

For leader node queries, the situation appears problematic.

I believe the and the only record of select queries on the leader node is in stv_recents.

This table holds the most recent 100 queries (that this is so is undocumented).

So I need to be able to tell a query is leader node only, and then if it is, I need to hope that the query is still in stv_recents so I can get duration information.

I can’t tell if a query is leader node by dint of it being in stv_recents, because worker node queries are shown there as well.

I can’t parse the SQL to figure it out directly, because AWS do not publish a BNF.

I was able to jury-rig a method, by using pg_last_query_id(). This returns the most recent worker node query ID, or -1 if no worker node queries have been issued. So what I do is call this immediately before and immediately after issuing the query I want to benchmark. If the query IDs before and after are identical, it was a leader node query.

Now I need query duration information from stv_recents, for leader node queries. In stv_recents, there is a column starttime for “Time that the query started”, and there is a column duration which I always assumed was run time, but the docs now say this is “Number of microseconds since the session started”. So either the docs are right, and you cannot know query execution time from stv_recents, because there is only a start time, or the docs are wrong. Looking at the numbers, I think the docs are wrong, and this is in fact query duration, not session duration (but I’ve not yet sat down and proved it).

https://docs.aws.amazon.com/redshift/latest/dg/r_STV_RECENTS.html

Now, the problem here with this approach is that if a query takes a long time, I believe it is still pushed out of stv_recents by following queries. The table does not retain running queries because they are running.

So, fundamentally, there looks to be only a broken capability to know the duration of leader node queries. You can’t benchmark longer running leader node select queries.

Moving on to worker node queries, I want execution time from start of execution up to beginning of return to client, as return to client is too dependent on external factors.

As such, I need to set aside queue time and compile time, and find out the start of return to leader.

Exec start time is easily found in stl_wlm_query, but - and note everything I now write about compilation is undocumented, but you can’t benchmark execution time correctly without knowing this - these days compiles occur during execution, and so compile time is included in execution time in stl_wlm_query. So we must now turn to svl_compile and sum the compile time for our query, and subtract it from execution time.

However, this will still not always give the correct answer, because what can happen is that a first query comes to compile but notices a second query is already compiling the segment it will need, and then the first query will block, waiting for the segment to be made.

That this waiting period occurred is recorded nowhere. In svl_compile, the second query indicates it obtained the segment from cache and took the usual few hundred milliseconds to do so - when in fact the query blocked for some seconds, and these seconds are included in the execution time.

The problem here is that the compilation system has over time fundamentally changed, but the table showing what the compilation system has done, has very nearly not changed, and now it is unable to represent what happens.

So we cannot - repeat cannot - actually always know correct query execution times.

That’s incredible.

Anyways, back to the problem in hand.

I have a bunch of queries to benchmark, about 250 or so.

I can’t benchmark one at a time, because querying the system tables on a large cluster is much too slow; I need to build up a set of query IDs, and then issue a single query which pulls the data I need for all those queries.

So, what we end up with is;

  1. get most recent query ID
  2. run query
  3. get most recent query ID
  4. if before and after query IDs identical, dive into stv_recents and hope our query is there and that the docs are wrong and grab the duration
  5. if before and after query IDs differ, record the “after” query ID
  6. repeat above until a reasonable batch of queries have been issued
  7. for the worker node queries in the batch, now issue the following query to get exec times (which is to say, get start of return to leader, compile time as best we can, start of exec, and do the necessary math to compute exec time).
with
  cte_return_times as
  (
    select
      query            as query_id,
      min( starttime ) as min_r2c_start_ts
    from
      stl_return
    where
          slice >= 12800
      and query in ( %s )
    group by
      query
  ),

  cte_compile_times as
  (
    select
      query                      as query_id,
      sum( endtime - starttime ) as sum_compile_duration
    from
      svl_compile
    where
      query in ( %s )
    group by
      query
  ),

  cte_exec_start as
  (
    select
      query           as query_id,
      exec_start_time as exec_start_ts
    from
      stl_wlm_query
    where
      query in ( %s )
  )

select
  query_id,
  extract( epoch from ((min_r2c_start_ts - exec_start_ts) - sum_compile_duration) ) as exec_duration
from
               cte_exec_start
  natural join cte_compile_times
  natural join cte_return_times;

Where I replace the %s with the comma separated list of query IDs I’m interested in.

It took me several years to know enough about Redshift to be able to figure this out, and it took me about eight hours to actually work through it, bumping into each issue one after the other.

Likely the very large majority of users benchmarking their queries are doing their best and getting the wrong numbers, and thinking they’re correct. (Worse than that, they may be using the console. Never use the console. No one knows how it computes its numbers, and as you can see, there’s a lot of ways in which to go wrong, and given the state of the system tables, would you trust the console to somehow be correct, either in the first place given its provenance, or if it is using the system tables?)

I am of the view the docs and system tables are unfit for purpose, and that this is not acceptable for a product with what is probably $2 billion to $4 billion revenue per year.



Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers