Redshift Observatory

Blog

2024-08-16

`SYS_TRANSACTION_HISTORY`

As regular readers (both of you =-) will know, I’m a big fan of the system tables.

As irregular readers will not know, that was irony.

As a part of this, I absolutely avoid using all of the new SYS_ views, because AWS have blocked users from seeing the source code of those views, I think as part of their effort to hide that Serverless is not serverless.

As such, I have no idea what the SYS views are doing, and there are so many issues in the system tables that you MUST be able to see the source of the views, to figure out the problems, so you can work around the problems, so you can get at least something out of them.

Without that, you’re going to get screwed over, because the information you’re getting is not going to be what you think you’re getting.

With current client, someone was using SYS_TRANSACTION_HISTORY, and coming up with very different numbers/stats to me, for commit behaviour.

So I today had a bit of a look at it - and what I’ve found is that the column commit_start_time is in fact the start of the commit queuing, not the start of the commit work.

Commit queue times on our system vary from zero (rare) to two minutes - which completely destroys any analysis which thinks its looking at the time taken for actual work.

The docs say nothing (the say this - I quote - “The start time of the commit”, which is in fact wrong), and I believe the client was using these numbers under the mistaken impression they were measuring actual work time only. I mean, that’s what you would expect, because having only start of queuing and end of work is absolutely bloody useless and no one with any sense (for that I think we could well read real world experience) would provide that.

I think SYS_TRANSACTION_HISTORY is actually pulling in data from STL_COMMIT_STATS, which has separate columns for start of queuing and start of work, and if we could view the source, we’d be able to see that and know what was wrong. As it is we can’t, so the SYS views simply have to be avoided. They might as well not exist for all the good they do, except they do exist and people are using them and being harmed by them.

AWS, with regard to Redshift, have a culture of secrecy, and this is absolutely and categorically harmful to Redshift users.

2024-08-17

Minor Pandoc Vent

I use Pandoc for all my doc work - I write markdown, and convert.

Right now Pandoc is being a complete fail.

I have a text file, an email, I want to convert to PDF.

I’ve quoted with “>”, and added `” where I want to add content into the document.

What do I get?

The quotes disappear and there’s no way to control the font size (I mean there is, it just doesn’t do anything).

So I’m converting the email to HTML, setting the style for <p> to give me pages when I need it, and using wkhtmltopdf, because it’s easier to do this than the work to figure out how to solve these incredible problems with Pandoc.

2024-08-24

STL “Tables”

Rumour has it that STL tables (views after 2019) are not tables as such, but are table-like interfaces onto log files.

I’ve just done a bit of ad hoc checking; they do not appear to be column store (processing two columns takes as long as one column), which is expected, but unexpectedly, they do not appear to take part in transactions.

I ran a query on STL_QUERY, on a busy system. First part is a CTE which produces one row per row in the table. Then I have two more CTEs, both of which use the first.

In both CTEs, I count the number of rows. Number of rows should be identical - the second and third CTE are both using and using only the first CTE as the and the only table in their from clause.

However - I get different counts. The AWS metrics queue differs by 9, one of the user-defined queues, which is busy, differs by 204 (the total query time for was some minutes, so it fits just fine).

So it looks like to read STL tables correctly, we need to into a limiting clause, something like starttime < current_timestamp, so ensure we get the same data throughout a query.

This, assuming I have it right, absolutely and categorically needed to be documented by AWS. It’s incredible that it is not.

(I can do a stronger check of this - emit current_timestamp from the query, then look to see how many queries turned up after that time, and see if it matches up - but that’s for when I have a bit of spare time.)

What also surprises me here is that the first CTE isn’t being materialized. Maybe it’s because the underlying table is an STL. It looks to me like all that might be happening is textual substitution within the query.

2024-08-29

Serverless

This in RS release notes;

https://docs.aws.amazon.com/redshift/latest/mgmt/cluster-versions.html#cluster-version-183

Amazon Redshift patch 183 Improves performance of INSERT/COPY statements for Serverless or provisioned data warehouses elastically resized by 2x or higher in size.

(Because Serverless is an elastic resized normal cluster.)

`SYS_QUERY_HISTORY`

prod=# select count(*) from stl_wlm_query;
  count  
---------
 7327667
(1 row)

prod=# select count(*) from sys_query_history;
  count   
----------
 14815791
(1 row)

prod=# select count(*) from stl_wlm_query where query in ( select query_id from sys_query_history );
 count 
-------
     0
(1 row)

query_id in sys_query_history looks like it is NOT the query ID everywhere else.

Incredibly, this is not documented.

I’m absolutely speechless.

If this is really true, the people doing this are raving lunatics, and the people responsible for docs need to put in a chair and have kids TV slimey goo dumped on them until they quit AWS.

Also worth reminding here that AWS specifically block you from viewing the SQL of the SYS views, so we now can’t figure out what’s going on.

2024-08-30

More on `SYS` views

An AWS chap pointed me at this;

https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_system-tables.html

Which has in it this;

Improving query identifier tracking using the SYS monitoring views

https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_system-tables.html#sys_view_migration-query-id

And this tells us the query identifiers in STL views differ to that in SYS views. There is documentation about this.

What I would say though is that if you have something called “query ID” in one view, and you have something else called “query ID” in another view, and you have nothing on the page itself of the view telling readers the two things with the same name are not the same, no one is going to think they are two different things with the same name and they need to look in the docs to find an explanation of this.

Also, far as I can see (the docs to my eye specially avoid saying this, but it looks to be so) there remains no way to map STL query ID to SYS query ID, so the new system tables are wholly segregated from all the existing system tables, with regard to queries (transaction ID is the same in both, and maybe you can try to kludge it that way, but that’s not a solution, that’s a slow and awkward workaround).

I have to say, the looseness of the writing in the docs drives me crazy.

In SYS views, the query_id column records user-submitted queries in their original form.

First, the query_id column records the query ID of queries. It does not record the query. Trivial issue? here, yes. Other places, NO. Other places you find out what’s written is just plain wrong, and you could only come to know that by investigating what actually goes on, so you can then understand how the author wrote loosely.

Second, what is original form? it’s not explained. We then get an explanation that a query can be broken down into child queries, and if you want to see those queries, you use another table. We are left to infer “original form” means before being broken down. That’s not how you write technical documentation.

Third, it’s glossed over, but I suspect (I’ve not checked) where the text says “user-submitted” it means for example all the queries issued by user rdsdb are being censored. All the queries used to run the console, health check, and so on. They actually perform a fair bit of work. You should understand what you’re not seeing.

I stay away from the docs. Firstly, my experience has been I cannot find what I’m looking for and secondly, when I do read the docs, I don’t trust them, and the writing often drives me crazy.

2024-08-31

New Node and Elastic Resize

I’ve been working on a product, replacement system tables.

One of the views gives information about nodes.

Today we had the weekly cluster reboot.

The reboot ran, and I had a look at the nodes, and saw this;

 node | ps | fs | ts | disks | disk_size | total_disk_size | disk_used | rms_used | total_used 
------+----+----+----+-------+-----------+-----------------+-----------+----------+------------
    0 |  0 | 16 | 16 |     4 |   6772561 |        27090244 |  12401559 |       -1 |   12401558
    1 |  0 | 16 | 16 |     4 |   6772561 |        27090244 |  12492962 |       -2 |   12492960
    2 |  0 | 16 | 16 |     4 |   6772561 |        27090244 |  12452633 |       -2 |   12452631
    3 |  0 | 16 | 16 |     4 |   6772561 |        27090244 |  12464522 |       -2 |   12464520
    4 |  0 | 16 | 16 |     4 |   6772561 |        27090244 |  12238519 |       -2 |   12238517
    5 |  0 | 16 | 16 |     4 |   6772561 |        27090244 |  12546885 |       -2 |   12546883
    6 |  0 | 16 | 16 |     4 |   6772561 |        27090244 |   2213868 | 10102487 |   12316355
    7 |  0 | 16 | 16 |     4 |   6772561 |        27090244 |  12061677 |       -2 |   12061675
(8 rows)

rms_used is the number of blocks in RMS (S3 or S3-like).
disk_used is the number of blocks on local disk.
The number of local blocks and the number in S3 are independent - you’re not seeing the same block counted twice.
The negative number is fine for now - it’s a new view, needs polishing, it shows closely approximate values (and by being closely approximate is almost instant - the exact view takes a little while to run on larger clusters, because it scans stv_blocklist, one row per block). Really this means 0.
ps/fs/ts is partial slices, full slices, total slices (compute/data slices, to use AWS terminology).

So what do we see here? well, we had 8 nodes, and we still do, but one of them has almost all its data in RMS.

I waited a bit and looked again and saw that the number in RMS was going down, at maybe 5 GB per second (my sense of it by eye) and those blocks were being transferred onto local disk.

What we had in fact was a new replacement node, and it arrived when we rebooted.

With about 12 terabytes on the node, and 5 GB per second, it would copy down fairly quickly - but then we issued an elastic resize.

After that, we saw (as expected) this;

 node | ps | fs | ts | disks | disk_size | total_disk_size | disk_used | rms_used | total_used 
------+----+----+----+-------+-----------+-----------------+-----------+----------+------------
    0 |  0 | 19 | 19 |     4 |   6772561 |        27090244 |  12997353 |  1570626 |   14567979
    1 |  0 | 19 | 19 |     4 |   6772561 |        27090244 |  13097940 |  1636651 |   14734591
    2 |  1 | 18 | 19 |     4 |   6772561 |        27090244 |  13051688 |   815274 |   13866962
    3 |  1 | 18 | 19 |     4 |   6772561 |        27090244 |  13074460 |  1063083 |   14137543
    4 |  1 | 18 | 19 |     4 |   6772561 |        27090244 |  12837869 |   878998 |   13716867
    5 |  1 | 18 | 19 |     4 |   6772561 |        27090244 |  13156211 |   878311 |   14034522
    6 |  1 | 18 | 19 |     4 |   6772561 |        27090244 |   3937495 |  9899686 |   13837181
(7 rows)

So, here what we have;

Check the ps/fs/ts values. We dropped a node, so 16 full slices have been redistributed over the remaining nodes. Also, compute slices have been added so each node has the same number of slices.
Each node now has data in RMS - this is the data for the slices moved to that node, which has yet to come down from RMS.

All the nodes were node downloading, and the query which produces this output now took about 30s, 35s to run, rather than being instant.

I watched this for a while and data copied down pretty quickly, and about five mins later we were like this;

 node | ps | fs | ts | disks | disk_size | total_disk_size | disk_used | rms_used | total_used 
------+----+----+----+-------+-----------+-----------------+-----------+----------+------------
    0 |  0 | 19 | 19 |     4 |   6772561 |        27090244 |  13919059 |   644379 |   14563438
    1 |  0 | 19 | 19 |     4 |   6772561 |        27090244 |  14042085 |   690853 |   14732938
    2 |  1 | 18 | 19 |     4 |   6772561 |        27090244 |  13865367 |       -2 |   13865365
    3 |  1 | 18 | 19 |     4 |   6772561 |        27090244 |  14038631 |    97511 |   14136142
    4 |  1 | 18 | 19 |     4 |   6772561 |        27090244 |  13715259 |       -2 |   13715257
    5 |  1 | 18 | 19 |     4 |   6772561 |        27090244 |  14032959 |       -2 |   14032957
    6 |  1 | 18 | 19 |     4 |   6772561 |        27090244 |   4932616 |  8902971 |   13835587
(7 rows)

And a minute or two later everything was down except for the new node. The query to produce this output was back to running instantly.

We then re-activated the ETL system, and at that point the rate at which the new node downloaded became a lot slower - maybe 100 GB/hour.

So we’re currently looking at about another four hours before the new node has caught up.

I’m surprised the shadow of an elastic resize is so long once the system is active, but hey, it makes sene - so it looks best to let the system fully download data from RMS before going to work. So you should totally get hold of this view from me so you can know when that’s happened :-)

My experience of cluster level work is that RS does quite a bit of work to look as if the operation has been done quickly, but actually under the hood its still working hard (and you just don’t know about it - you only see it in query performance).

Home 3D Друк Blog Bring-Up Times Cross-Region Benchmarks Email Forums Mailing Lists Redshift Price Tracker Redshift Version Tracker Replacement System Tables Reserved Instances Marketplace Slack System Table Tracker The Known Universe White Papers