Redshift Observatory

Blog

2024-09-05

`STL_VACUUM`

Writing some views to show vacuum info.

Just been looking at STL_VACUUM.

https://docs.aws.amazon.com/redshift/latest/dg/r_STL_VACUUM.html

Column status;

The status of the VACUUM operation for each table. Possible values are the following:

Then goes on to give a list, and at the end enumerates;

VacuumBG An automatic vacuum operation was performed in the background. This status is prepended to other statuses when they’re performed automatically. For example, a delete only vacuum performed automatically would have a starting row with the status [VacuumBG] Started Delete Only.

In fact, the column is char(30), so what happens when [VacuumBG] is prepended is that the strings to be truncated.

prod=# select distinct status from stl_vacuum order by status;
             status             
--------------------------------
 Finished                      
 Start DeleteOnly(Sorted>=100%)
 Started                       
 Started Delete Only (To 100%) 
 [VacuumBG] Finished           
 [VacuumBG] Started            
 [VacuumBG] Started Delete Only
 [VacuumBG] Started Range Parti
(8 rows)

The docs are wrong.
Telling people about the flag to indicate background operation at the bottom of of a list of status codes is bad documentation.
Prepending a string to a string field to indicate a flag is nuts.
Why not just add a bool indicating a background operation?

And now it’s gets really bad; you see, the table contains a single timestamp column to indicate when something happens.

You have to figure out what happened from the status field, the string.

So I have to look at the list of status strings and now guess and hope that if I find the string “Starting in there” then it’s always the beginning of a given operation, and that “Finished” is the end of the operation (and then run a window function over the rows to get the ordering right so I can pick out the start and end of each vacuum - why not just have one row per vacuum and log on completion? that’s what STL_QUERY et al do), BUT remember now the strings are being truncated by “[VacuumBG]”. so now I have to hope that the truncated status strings always have enough information in for me to be able to figure out what I need to figure out.

Also, we have a bunch of progress rows, which indicate the progress of any given vacuum - which also include key words you might look for in other status messages, so you have to know how to avoid these messages or your final output of course will be wrong. I have zero faith in the docs, I am certain the list of status messages there is not complete, so what I will have to do is manually issue every form of VACUUM and look to see what is recorded and hope I catch everything.

I then also have to hope (or automate a validation check) that new messages are not added over time, or change.

I’ve just noticed also, for delete only vacuum you can look in the status string for “delete”, for sort only you look for sort, but for full there is no indicator. There’s nothing which says “full”. You just get the word “Started”. You have to look for the absence of all other indicators. (In fact, this turned out to be impractical - there were too many other types of vacuum - I now rely on the exact string “Started” to mean the start of a full vacuum, while doing scans for other strings to figure out the other types of vacuum.)

Also, the formatting of the status strings is inconsistent (of course - it could hardly be otherwise). When you issue a full vacuum you do not get the “to 100% percent” type message, even when you do specify a percentage.

All in all, and in every possible way, this table is abysmal. How on Earth did it come to exist, and to exist for ten years, in a product which has some billions of dollars of revenue per year?

Note on that matter of point #4 - why not just have flag to indicate something - note we do have a column is_recluster, which is a flag indicating if the fairly new recluster operation has been performed. Unfortunately, this too by the looks of it has been messed up, because it’s an int4 being used as a bool (the value is only ever 1 or 0).

Ah and finally I’ve noticed the key words you search for vary in case - sometimes say “Delete”, other times “delete”, so it’s back to that old faithful with low quality string data, converting everything to lower case before performing comparisons.

(Addendum - I hate the wording of the VACUUM command when it comes to percentage. vacuum full [table] to 95 percent;. To me that reads “make this table 95 sorted”. What it actually means is “if this table is less than 95% sorted, sort this table fully”. That’s bad enough, but also it turned out the mechanism RS uses for deciding how sorted a table is, is completely fubared. It figures out the percentage by the total number of sorted and unsorted rows in the table - but what you need to do is work that out on a per slice basis, because queries run as slowely as the slowest slice. You should never use this functionality, ever, no matter what. It’s just broken. Don’t do it. You have to always set to 100 percent, to disable/avoid this functionality. Then you only have to worry about the unusual situation where a vacuum full to 100 percent leaves you with a completely, 100% unsorted table. I’ll write about it some other time - writing up all this so far has taken up a good hour or two, and I’ve got other things to get done.)

`SYS_VACUUM_HISTORY`

AWS have recently been coming out with a second series of system tables, prefixed with SYS_.

There is a new table for VACUUM, SYS_VACUUM_HISTORY.

https://docs.aws.amazon.com/redshift/latest/dg/SYS_VACUUM_HISTORY.html

I had a look at the doc page, and it looks like quite a nice table - but I noticed the data types looked a bit wrong, with character being listed for strings, which probably should be varchar.

So I then went and looked at my dump of the table, here;

https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.sys_vacuum_history.html

And boy did I get a bunch of surprises.

is_automatic is documented as a bool, but it’s actually a text.
The _name fields listed as text are in fact character(128), which means (this is a common problem) you can get UTF-8 in char columns (because you can make tables with UTF-8 names). So you can dump this table, but you can’t then load it into its own DDL. Next, I note the status field is still char(30) - same as it is in STL_VACUUM.
The status field is a char(30) and that looks suspicious - because that’s exactly what’s in STL_VACUUM. I went and checked, did a select distinct, and lo and behold…

prod=# select distinct status from sys_vacuum_history order by status;
             status             
--------------------------------
 Finished                      
 Start DeleteOnly(Sorted>=100%)
 Started                       
 Started Delete Only (To 100%) 
 [VacuumBG] Finished           
 [VacuumBG] Started            
 [VacuumBG] Started Delete Only
 [VacuumBG] Started Range Parti
(8 rows)

It the same as STL_VACUUM, in all its terrible glory. Identical. The same. The docs are flatly wrong, because they now list strings which are simply not what is in the column, and this column has not improved.

I began to suspect that SYS_VACUUM_HISTORY was a view over STL_VACUUM, because of this shared column, but if I run explain select * from sys_vacuum_history; and the same again for stl_vacuum, in fact they both originate from different tables.

prod=# explain select * from stl_vacuum;
                            QUERY PLAN                            
------------------------------------------------------------------
 XN Seq Scan on stll_vacuum  (cost=0.00..3.60 rows=360 width=170)
(1 row)

Time: 226.999 ms
prod=# explain select * from sys_vacuum_history;
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 XN Seq Scan on stll_user_vacuum_history  (cost=0.00..0.75 rows=50 width=1488)
(1 row)

The underlying tables are;

https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.stll_vacuum.html

and

https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.stll_user_vacuum_history.html

STLL_VACUUM looks a bit sparse - too few columns - to provide everything in STL_VACUUM (but I’ve not looked closely). There are some other related vacuum tables;

https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.stll_vacuum_detail.html

https://www.redshift-observatory.ch/system_table_tracker/1.0.74097/pg_catalog.stll_vacuum_history.html

Which I could imagine are being pulled in.

So it looks like SYS_VACUUM_HISTORY is using the same sources of information, but populating a different table, of its own.

Finally, we come to the two fundamental problems with all the SYS tables.

First, the SYS tables use a complete new form of query ID - an int8, not an int4, such that the same query now has two query IDs, one in the STL tables and one in the SYS tables, and so the two sets of tables cannot be joined on query ID. The system tables have been lobotomized. This is on the face of it utter madness.

Second, AWS specifically block users from viewing the SQL of the views. We can’t know how they work, and I know from long experience there’s a lot the system tables get wrong and which you have to ignore, fix or work around. Using system tables from AWS for RS without the source code is completely off the map. It’s reckless and unprofessional given all the problems with the design of the existing tables (which we as ever see again here).

2024-09-10

`STL_QUERYTEXT`

I’ve just spent half a day debugging a replacement system table view. It turns out STL_QUERYTEXT logged the same query twice, which caused my use of listagg to overflow.

prod=# select count(*) from ( select query from stl_querytext where sequence = 0 group by query having count(*) >= 2 ); count ------- 23964 (1 row)

So, taking one query in particular;

prod=# select count(*) from stl_querytext where query = 551177675 and sequence = 0; count ------- 2 (1 row)

And now looking in STL_QUERY;

prod=# select count(*) from stl_query where query = 551177675;
 count 
-------
     1
(1 row)

2024-09-12

Benchmarking Queries

I want to benchmark how long a select query takes to execute, for all the obvious reasons.

Some queries are leader node only, some are worker node.

For leader node queries, the situation appears problematic.

I believe the and the only record of select queries on the leader node is in stv_recents.

This table holds the most recent 100 queries (that this is so is undocumented).

So I need to be able to tell a query is leader node only, and then if it is, I need to hope that the query is still in stv_recents so I can get duration information.

I can’t tell if a query is leader node by dint of it being in stv_recents, because worker node queries are shown there as well.

I can’t parse the SQL to figure it out directly, because AWS do not publish a BNF.

I was able to jury-rig a method, by using pg_last_query_id(). This returns the most recent worker node query ID, or -1 if no worker node queries have been issued. So what I do is call this immediately before and immediately after issuing the query I want to benchmark. If the query IDs before and after are identical, it was a leader node query.

Now I need query duration information from stv_recents, for leader node queries. In stv_recents, there is a column starttime for “Time that the query started”, and there is a column duration which I always assumed was run time, but the docs now say this is “Number of microseconds since the session started”. So either the docs are right, and you cannot know query execution time from stv_recents, because there is only a start time, or the docs are wrong. Looking at the numbers, I think the docs are wrong, and this is in fact query duration, not session duration (but I’ve not yet sat down and proved it).

https://docs.aws.amazon.com/redshift/latest/dg/r_STV_RECENTS.html

Now, the problem here with this approach is that if a query takes a long time, I believe it is still pushed out of stv_recents by following queries. The table does not retain running queries because they are running.

So, fundamentally, there looks to be only a broken capability to know the duration of leader node queries. You can’t benchmark longer running leader node select queries.

Moving on to worker node queries, I want execution time from start of execution up to beginning of return to client, as return to client is too dependent on external factors.

As such, I need to set aside queue time and compile time, and find out the start of return to leader.

Exec start time is easily found in stl_wlm_query, but - and note everything I now write about compilation is undocumented, but you can’t benchmark execution time correctly without knowing this - these days compiles occur during execution, and so compile time is included in execution time in stl_wlm_query. So we must now turn to svl_compile and sum the compile time for our query, and subtract it from execution time.

However, this will still not always give the correct answer, because what can happen is that a first query comes to compile but notices a second query is already compiling the segment it will need, and then the first query will block, waiting for the segment to be made.

That this waiting period occurred is recorded nowhere. In svl_compile, the second query indicates it obtained the segment from cache and took the usual few hundred milliseconds to do so - when in fact the query blocked for some seconds, and these seconds are included in the execution time.

The problem here is that the compilation system has over time fundamentally changed, but the table showing what the compilation system has done, has very nearly not changed, and now it is unable to represent what happens.

So we cannot - repeat cannot - actually always know correct query execution times.

That’s incredible.

Anyways, back to the problem in hand.

I have a bunch of queries to benchmark, about 250 or so.

I can’t benchmark one at a time, because querying the system tables on a large cluster is much too slow; I need to build up a set of query IDs, and then issue a single query which pulls the data I need for all those queries.

So, what we end up with is;

get most recent query ID
run query
get most recent query ID
if before and after query IDs identical, dive into stv_recents and hope our query is there and that the docs are wrong and grab the duration
if before and after query IDs differ, record the “after” query ID
repeat above until a reasonable batch of queries have been issued
for the worker node queries in the batch, now issue the following query to get exec times (which is to say, get start of return to leader, compile time as best we can, start of exec, and do the necessary math to compute exec time).

with
  cte_return_times as
  (
    select
      query            as query_id,
      min( starttime ) as min_r2c_start_ts
    from
      stl_return
    where
          slice >= 12800
      and query in ( %s )
    group by
      query
  ),

  cte_compile_times as
  (
    select
      query                      as query_id,
      sum( endtime - starttime ) as sum_compile_duration
    from
      svl_compile
    where
      query in ( %s )
    group by
      query
  ),

  cte_exec_start as
  (
    select
      query           as query_id,
      exec_start_time as exec_start_ts
    from
      stl_wlm_query
    where
      query in ( %s )
  )

select
  query_id,
  extract( epoch from ((min_r2c_start_ts - exec_start_ts) - sum_compile_duration) ) as exec_duration
from
               cte_exec_start
  natural join cte_compile_times
  natural join cte_return_times;

Where I replace the %s with the comma separated list of query IDs I’m interested in.

It took me several years to know enough about Redshift to be able to figure this out, and it took me about eight hours to actually work through it, bumping into each issue one after the other.

Likely the very large majority of users benchmarking their queries are doing their best and getting the wrong numbers, and thinking they’re correct. (Worse than that, they may be using the console. Never use the console. No one knows how it computes its numbers, and as you can see, there’s a lot of ways in which to go wrong, and given the state of the system tables, would you trust the console to somehow be correct, either in the first place given its provenance, or if it is using the system tables?)

I am of the view the docs and system tables are unfit for purpose, and that this is not acceptable for a product with what is probably $2 billion to $4 billion revenue per year.

2024-09-23

Result Cache

So, in the RS system tables historically there has been no indication about whether or not a query was served from the result cache. A year or two ago a new round of system tables started coming out, one of which indicates for a query whether or not it was served from the result cache. AWS censor the SQL of these system tables, so you have no idea how they come up with this data. I’ve been using one of those new tables - the only one I use - to produce percentages for result cache hits.

I’ve just now realized that if a query is logged in STL_QUERY, but not in STL_WLM_QUERY, then it is a result cache hit.

This in turn has led me to compare the numbers I can now generate with those from the new system table, and it looks to me like the new system table has all queries in - both leader node only and worker node - which roughly doubles the number of queries it contains.

However, I think only worker node queries can be served from the result cache. As such, the percentages derived from the new system table would be wrong, understating result cache hits by a factor of two, because that table includes all leader node queries, which are about half of all queries.

2024-09-25

Queuing Discovery

The WLM queries issued by ANALYZE, and CREATE TEMP TABLE (and so I would imagine also other queries, but I’ve not seen them yet) appear to enter the WLM queue at its head, not its tail. They do queue, but only very briefly, and this is when there are hundreds of queries in the queue and the queue time is a minute or more.

2024-09-28

CTEs

This is one of the most important posts I’ve made.

So, I’ve recently - last year or two - come properly into using CTEs. I was against them, before, really, but that’s mainly I think because I saw them being badly used.

So, CTEs.

Now, I think - I’ve not checked but I think - in Postgres, CTEs are materialized. The CTE runs, the rows are generated, whatever you do after with that CTE comes from those rows.

(I’m actually not sure now, thinking about it, if this makes sense - but doesn’t matter right now.)

Now I’ve recently finally got round to producing convenient and easy to use benchmarking scripting, and I’m also finally getting on with producing a full set of replacement system tables.

The combination of two has led to the discovery that in RS, CTEs look to be implemented as textual substitution (which is how views are implemented in all relational databases).

Where-ever there is a CTE name in a query, the name of the query will be replaced with the text of the CTE.

So I had views which had a base CTE which I then did various things with - lovely clean, well-laid out, easy to read code - which was unbelievably slow and expensive, because the base CTE was being run multiple times.

So for example I’d scan a ton of huge tables and produce one row per day that’s in the system tables (so, six or so days) - and then do some formatting work with those six rows.

Turns out the formatting work was running the scanning of the ton of huge tables three more times.

I have a representative view which have had two sets of fixes like this, and the second set of fixes took the view from 55 seconds to 9 seconds (I don’t have a benchmark to hand for the original version).

(The fixes are based on using windows functions - you end up with tons of identical rows, and you group by to get rid of them, but it’s infinitely faster because all the rows are being processed once only. The code is now ugly and nothing you would normally write - but it’s literally 10x faster, and this for queries which on a big system were taking hours.)

2024-09-30

Comments on The War

A couple of matters have accumulated in my mind regarding the UA war.

First, the mass media, the press.

What you see in not everything but very nearly almost everything is a story which has been modified to be gripping/depressing/exciting/hopeful - you name it - but the central requirement being for it to be a story which will make you read it.

The humdrum-ness and lack of meaningful actual event in the war is not conducive to exciting stories; what we get then is a constant barrage of hyped up, distorted, magnified non-events.

I also see what looks to me to be fashions in reporting - publications see other publications publishing stories with a given theme or motif, and they all jump on the bandwagon. There’s no actual substance to any of it; it’s media feeding off of media, with nothing, no actual news at all, at the center of it all.

The real actual news about the war is this : almost nothing has changed for a long time.

Putin is pushing along as much of the front as he can, because that’s all the Russian army can managed in terms of tactics. He’s very very slowly pushing forward. All this stuff about the Russian advance - what total idiocy. Ukraine is huge. I don’t care that the front line has moved kilometers in months.

Ukraine made a very smart move going into Russia. That changed a lot - if it comes to negotiation, then it’s now “our land in exchange for your land”, not Putin going “we keep everything we have so far”. I wish very much this had been done instead of the UA summer offensive which was done, down in the south.

That “spat” with Trump is pure BS. That’s Trump being a dick and making stuff up because for whatever reason he thinks it’s best for him. The man is a total idiot and by the number of people who believe his lies, harmful to everything the West stands on and for. Also, anything said by any Republican these days is pure BS - they’re all towing the Trump line, and I think Trump supports Putin because Trump thinks Putin works to manipulate US elections in Trump’s favour. Thankfully, with Kamala, the Dems look to be set to win. I’m no fan of Big Government, which the Dems tend toward, but the Republicans at the moment with Trump are utterly unfit for power. Right now they’re unfit for flushing a toilet. All the Republican voters, who want small - but sane - Government, low taxes, minimal regulation, have no one to vote for.

What else… the EU continues to support UA. The USA too. Neither RU or UA can advance against defences which have had time to be developed; the power of the defence is just too great. Attacking en masse by surprise in a weakly defended sector works (re Kursk) and the Russians can’t do that; their army isn’t up something requiring that level of operational skill. They tried it in the north and it went nowhere.

We’re pretty much in a 1916 situation. The power of the defence is much greater than the power of the offence. The race is on to invent the next tank, which looks to be a genuinely effective AI powered attack drone.

One final very important point. I do see in the press people talking about the war being difficult, the various problems that exist for UA (manpower, electricity, etc - although electricity is a complete non-starter, it’s just not that big a problem).

You have to remember that you know you own problems in full, in detail : they are wholly apparent to you and you feel them all fully.

What you are however but dimly aware of are the multitude of problems which beset your foe. Putin is struggling, whether we see it or not, and it is straightforward to think why; he is a dictator, fundamentally insecure of his position, fundamentally deeply afraid of unrest within the Russian people, and has launched a war which drags on now for three years. The economic costs of the war are enormous; the USA and EU are affording it just fine, because they have a massive economy. Russia has an economy the size of Spain. Fighting a war is unbelievably expensive and that wealth is being taken out of the economy one war of another, be it by printing money or taking loans or spending reserves or raising taxes or whatever it is - and of course hundreds upon hundreds of thousands of Russians have been killed and maimed. Putin looks over Russia, fearful of his own people, fearful of failure or looking weak, afraid of the economy failing, absolutely cannot allow failure in Ukraine, but with no way out, unable to win, where his army unable to do anything more than incremental advances at enormous cost, and even that power is coming to its end.

Ah and also - everything Putin says about nuclear threats is total BS. Ignore it completely. You have to anyway, because if you give it credence, then you have no defence. Putin invades and you do nothing meaningful because of nuclear threats. Remember : the West has nuclear weapons too, and it’s exactly for defending against this kind of threat.

Another particular point : I suspect some sort of impression is being formed of an inexorable Russian advance. It simply is not so. Russia has expanded its war production as much as it can now do - to do more would require long lead items to be produced, we’re talking some years of lead time - and Russia is burning through ammunition and equipment much more quickly than they are produced. This is Putin’s big push, prior to the US election, and Putin’s hope that Trump win will.

So, predictions : Kamala will win, the USA and EU continue to support UA, Putin continues to slaughter and maim hundreds of thousands of people in his barbaric, sickening, and anachronistic effort to territorially expand Russia, and the war continues. UA is not short on men - despite all the popular press reports - they have a ton of units available, the problem is lack of heavy weaponry to equip them; the West needs to step up, but it is short on equipment, not having taken the war really seriously and started mass investment in arms production about two years ago, which is about how long it takes to get production going. I suspect UA has got the idea now of advancing into Russia, and this gives them a massive front-line, which I think will lack solid defences, to attack in to. The Russian offensive will die down, Kamala will win, the USA will remain on side (with long struggles to get funding passed, as it will probably be a split Government with UA being widely supported by people, but Trump and his idiots holding a balance of power), the USA/EU will continue equipping UA, and the strategic initiative will pass back to UA, who now I hope will use it to take more and more Russian territory, with a view to settlement based on return to pre-war borders. Putin of course can never accept this - I think we may then end up with a frozen war, until Putin one way or another dies. I do not see Putin being removed from power, any more than Hitler was.

All of this would have been avoided if UA had been in NATO.

The West forgot Russia is a real threat, and the ignored it when Georgia was invaded.

Home 3D Друк Blog Bring-Up Times Cross-Region Benchmarks Email Forums Mailing Lists Redshift Price Tracker Redshift Version Tracker Replacement System Tables Reserved Instances Marketplace Slack System Table Tracker The Known Universe White Papers