Redshift Observatory

Blog

2024-11-02

K-Safety

On dc2 nodes, you can see in stv_partitions additional disks, which contain data for other nodes, which provide one k-safety.

You ignore them by dint of the owner and host columns.

On ra3 nodes, if there are local disks doing k-safety, they are not visible in stv_partitions.

I’ve just checked this, manually, by writing to the disk until it fills, and watching local disk and RMS usage.

So the disk size you see in stv_partitions for ra3 really is what you get.

Relating to this, it looks to me like the new ra3.large now have the disk space ra3.xlplus used to have (932 GB) and ra3.xlplus now has double that - they’ve been upgraded.

2024-11-05

Privacy Package

Last month I did a little work for a Stateside healthcare client. Health-care data in the US is regulated to the hilt, and as such they could not as such grant me syslog unrestricted (full access to the system tables), because I would be able to see query texts, and those texts could contain PII.

To resolve this, I developed a little package of views and configuration, which meant I could be granted syslog unrestricted, and so see all rows in all system tables but where columns with query texts would be NULL, empty, so I would not see the texts.

This post documents this work.

So, to begin with, I need to write about the privileges model in Postgres/Redshift (it’s the same model).

First, we have objects - schemas, tables, views, functions, etc.

Second, we have users.

Third, we have groups.

Fourth, we have privileges.

Each type of object has a set of privileges, which make sense for that object - so you have the select privilege on a table, but not on a function, and you have the execute privilege on a function, but you won’t find that on a schema, because it makes no sense to execute a schema.

Privileges for an object are granted to either a group, or to a user.

Users can be members of groups.

A user has all the privileges which are granted directly to that user, and also has by their membership of a group all the privileges which have been granted to that group, for as long as they are members. When the user stops being a member of a group, the user no longer has conferred upon them the privileges which belong to that group.

It can be a user receives the same privilege directly and also through membership of a group - and it might be even be the user is a member of many groups, and each group is giving the same privilege. As long as one source provides the privilege, the user has that privilege. So users can end up with a privilege many times over, just each being from a different source. When the user leaves a group, they lose access to the privileges held by that group, but it might be the user has those privileges anyway.

The usual way to arrange privileges in Postgres/Redshift, is to assign privileges to a group, and then assign users to the group. This makes revoking sets of privileges from users, as well as the general management of who-has-what, infinitely simpler.

Fifth, we have super-users.

Super-users are easy to understand : the privileges mechanism is not invoked when super-users issue commands. An SU can for example revoke every privilege that exists from himself, and it doesn’t make a blind bit of difference, because Postgres simply is not examining or honouring privileges, for the commands the super-user issues.

Now, finally, we have also the group-like object public.

As you will know, there are groups; you will have made them and assigned users to them, and you will have granted privileges to them. You will be aware that you can also grant privileges to something called public, and it appears like and behaves like a group containing all users. Granted a privilege to public is the same as granted the privilege to all users.

However, there is no actual group public. If you look in pg_group, where groups are listed, it’s not there. You can’t drop the group public, because there is no group public.

Rather, what happens is that when you grant a privilege to public, Postgres records in the system tables the privilege as being granted to public, and the understand that all users now have that privilege. It’s a special, built-in behaviour.

What this means then, is that all users, always, are inherently part of the group-like object public, because whenever a privilege is granted to public, Postgres allows all users that privilege.

This means you cannot remove users from public, because there is no group from which to remove them.

Now we come to the beginning of the work to hide query texts.

There are five system tables (which are all views since 2019) which contain query texts;

The privilege to read these views is granted to public.

Now, if public were actually a real group, we could just remove the user to be restricted from the group. However, public is not a real group, so we can’t do this.

So what we do first is to revoke the select privilege from public for these views.

This means now that no one (except an admin) can read these tables.

The next step is to make a group, to which we grant privileges to read these views.

Then we run a procedure (which we made earlier :-), which adds every user to that group, except for the user who should not read these tables.

At this point, everyone else is back to being able to read these views, except for the user we’re restricting, who currently cannot read them at all. Remember here our goal is for that user to be able to read those views, but not see the column containing query text.

The next step is that we create a schema (which I will call pp, for “privacy package”, and that they have their own schema is a part of all this, as we will see) and within create five views, one for each of the system views, where, critically, these views have the same name as their matching view in the system tables. Those views select every column from their matching system table view, excepting that for the query text column, they emit NULL.

The views are created by and owned by a user who has the privilege to read the system tables in question.

The select privilege is granted on these views to the user we’re restricted (as well as the matching usage privilege on the schema).

Now the final piece of the puzzle; the schema which holds these new views is added to the schema search path of the user being restricted, and it is added earlier in the path than pg_catalog. (The default schema search path contains pg_catalog.)

As such, when he uses the bare names of these views (which is to say, with no schema name, which is the usual way to access system tables, as pg_catalog is on the default schema search path), he will use the new views, which because they are owned by a user who can read the system tables, will read the system tables, but will return NULL for the query text column.

If the restricted user were to try to read any of these five system tables directly in pg_catalog, he would as before not have the privilege to do so, and so would be unable to read them.

At this point, the restricted user can be granted syslog unrestricted, and can read all rows in all system tables without being able to see query texts.

(Actually, the system tables which are veneers over row-producing functions, such as the views giving information about late-binding views, do not honour systlog unrestricted. So the user can’t actually read everything, because only SU can read everything. But they can at least read as much as syslog unrestricted allows.)

bill.com

A non-Redshift post.

A client recently paid me via bill.com.

They were hideous. If you’re using them, I advise you not to, and you’re thinking of using them, don’t.

I was charged 35 GBP for a simple, straightforward money transfer (and this via a UI donig everything it could to deceive into believing I was getting a good deal with “no fees”), was forced to accept a currency conversion I did not want, and closing the account took 45 minutes.

The SSL certs on their help site are invalid; I had to disable SSL validation to use the site. Their customer support is everything you would hate to experience but know so well from a large, bloated, robotic company - imagine a wall of repeated boilerplate, plus you had to search their web-site for 15 minutes, having found various instructions for closing an account which do not match up to what is shown in the UI, to find customer support.

Also, you can’t close the account yourself, from your account. It is made difficult, as it was, to prevent people from leaving.

Avoid.

(I’ve also just found that some of their emails are broken - multi-part MIME, where they do have text as well as HTML, but the text part is actually specifically empty, so it is rendered, but it empty, so there’s nothing. As an aside, their web-site is slow. It takes the log in page about 30 seconds to display the form for username/password, with that time being mainly an idle spinner logo.)

2024-11-07

Commit Queue

This is an important post; it’s about the commit queue.

I think I now finally understand what I set out to understand, weeks ago.

I’m going to write a little about queuing in general, and talk specifically about the commit queue.

Essentially in RS there are two queues; the query queue, and the commit queue.

The query queue we all know and love as the WLM queues, and their slots. That’s fine and well understood.

The commit queue is not well defined, or documented, and the system table for it, stl_commit_stats, is in fact totally and utterly misleading. To the extent what I think now is correct, and you’ll soon see the timestamps and numbers which lead to what I now think, if you’re using numbers from there, everything you’ve computed is wrong.

Now for the good stuff.

So, when we issue commit, the transaction commits.

Now, RS doesn’t commit single transactions - it groups them up into a batch and commits them together. (But you can have a batch with a single transaction only, and you’ll get that on quiet systems.) On a busy system you’re looking at 10 or 15 transactions in a batch, and the batch will have been collecting them for about 10 or 15s (before deciding it has enough and starting processing).

Batches are basically processed like this : the batch starts, all the worker nodes do all their work for all transactions all at once, and then each transaction in turn has processing performed by the leader node, and then we’re done.

In stl_commit_stats, every transaction in a batch has one row for the leader node and one row for each of the worker nodes. The row with node = -1 is the leader node row.

What happens now is that the timestamp of when we issued the commit is stored as startqueue in stl_commit_stats (for the leader node row - startqueue is set to 2000-01-01 00:00:00 for worker node rows, which means NULL - that timestamp means NULL, always, and you see it a lot in stl_commit_stats because some columns are for leaders only and some for workers only).

The commit query itself is stored in stl_utilitytext, and the start and end times recorded for the commit are the startwork and endwork times for that transaction in stl_commit_stats (for the row where node = -1).

Note specifically the start time recorded in stl_utilitytext is not the time the user issued the commit - it is the time the commit began to be processed, in the commit batch.

Note also here then that if you are adding up total transaction time by computing the difference between start of first query and end of last query, you are including more than the time the queries took to run - you are also including the commit time, because the maximum end time is the end time for commit, and that’s the end of the batch.

Now, batches run serially. RS starts with nothing, then it makes a batch, puts commits into the batch as the commits are issued, when RS is happy (unknown criteria) it then stops accepting new transactions into the batch, and processed the batch.

New commits during this time are queued - but this is NOT the commit queue - I’ll get to that. This is just a short pause while the current batch processes, which on a busy system with large commits almost always takes between almost no time at all and maybe 2s tops, with maybe 10% of so coming in over this, and with rare outliers of a minute or two. So you can see the delay here is characteristically short.

So the batch finishes processing, and then a new batch begins, and the waiting commits go in, and when RS is happy, that batch then begins to process. Repeat ad infinitum.

I see on current client system about one commit per second, and I think they almost always go very nearly immediately into a batch.

So - we have batches forming up, startqueue is when the commit was issued, and the worker nodes do their thing and then the leader node does its thing, and when the leader node has finished with the final transaction then the next batch begins to form up.

Now, here’s the key - for a long time I assumed the commit query returned at the endtime of the batch.

Stunningly, THIS IS NOT THE CASE.

The commit actually returns to the caller a long time AFTER the endtime of the batch.

There’s a whole chunk of time, and on current client system it’s the large majority of the time, which is completely undocumented in the system tables.

So what I see on current system is;

  1. issue commit - maybe a second or two of queuing, tops, but usually none (there’s an open batch)
  2. batch runs - call it 10s
  3. batch completes - but now we wait for commit to actually come back to us - 60 to 180 seconds!!!!

Let me show you some output, with timestamps, so you can see this.

This is output from a script, calling timeofday(), which gives the time that the query was issued (not the time of the start of the transaction), then calling commit, and then calling timeofday() again.

-------------------------------------
 Thu Nov 07 09:36:54.032014 2024 UTC
(1 row)

Time: 215.520 ms
COMMIT
Time: 109456.758 ms (01:49.457)
              timeofday              
-------------------------------------
 Thu Nov 07 09:38:43.683259 2024 UTC
(1 row)

So we see;

Immediately before commit : 09:36:54.032014 Immediately after commit : 09:38:43.683259

Commit duration : about 1m 49s.

Now, when I look at the startqueue time for this commit, it is 09:36:54.366534 - very nearly immediately after the “before” timestamp - and this is what I always find. The startqueue time is in fact very closely, or is, the time the user issued commit. I do occasionally see a delay, and I think this is when a batch is already executing - the commit has to wait for the current batch to finish, and so the next batch accept transactions. These delays are usually short, because batches usually do not take long to process - few seconds at most.

I also then see that the first startwork in the batch, which is the time the batch begins to run, is immediately after the final startqueue in the batch, and so only some seconds after the startqueue of any transaction in the batch. Batches do not wait very long to accumulate transactions.

So here we have;

startqueue : 09:36:54.366534 first batch startwork : 09:37:11.481855

So what this means is that commits ARE going into batches pretty much at, or very soon after, the commit is issued by the user. Batches ARE then executing not very long after this, and batches do NOT take very long to execute.

But now look at the endtime for the batch - the very last timestamp recorded for the batch and the time stl_commit_stats tells us the batch ended.

endtime : 09:37:47.204562

And now go back and look at the “immediately after commit” timestamp, which was 09:38:43.683259

There was just less than a minute of delay AFTER endtime.

Stunning!

And this explains something we see in stv_recents.

That system table is to my knowledge the only table which shows queries running on the leader node.

When I look on current client system, busy system, I see 100 or 200 or more commits on the leader node.

For a long time I thought these were commits which were queuing before running - but now I understand that these are commits AFTER they have run, doing whatever it is they do in that time.

The only idea I have for what’s going on is that this might be when SIV detection occurs - but that’s a wild guess.

So, where does this leave us?

We can measure WLM queuinig just fine, well, know that we know to ignore the end time of commit because it includes the commit batch processing time.

We can measure how long it took a transaction had to wait to be processed in its batch by looking at its startqueue and startwork. We can know how long the batch took to run, either at the transaction level (endtime for node = -1) or for the batch as a whole.

That leaves us with the question of how to know how long the post-batch delay is.

Since this delay is not in the system tables, the only way I can see to even get a rough number is to subtract the start time of the oldest running commit in stv_recents from the current timestamp, like so;

select
  count(*) as number_commits,
  extract( epoch from (current_timestamp - min(starttime)) ) as post_commit_duration
from
  stv_recents
where
      status = 'Running'
  and query  = 'COMMIT';

And running this right now I get this, which is in seconds.

 number_commits | post_commit_duration 
----------------+----------------------
             82 |             93.33252
(1 row)

Now, having written all this, a thought occurs - it might be these “after” commits are not queuing. Maybe they finish independently of each other, and some are very quick and some maybe are slow.

I have tooling which lets me run real-time monitoring of RS, and so I wrote a query which shows the pid and age of every COMMIT (in any state - which means running or done, there are only two states) on the leader node, and this is the kind of output I get, which updates in this case about once per second;

# 177864.408751395

interval = 1.5318

pid        | status  | age       
1073852806 | Running | 136.759696
1074124278 | Running | 136.759679
1073974011 | Running | 136.759637
1074256732 | Running | 136.759613
1074192789 | Running | 136.759599
1073916170 | Running | 136.759578
1074245724 | Running | 136.759559
1074147115 | Running | 136.759536
1073861430 | Running | 136.613052
1074008483 | Running | 132.399287
1074471840 | Running | 132.399267
1073981195 | Running | 132.399257
1074490841 | Running | 131.992516
1073761737 | Running | 131.42072 
1073924212 | Running | 131.244186
1073777895 | Running | 131.244155
[snip]

(Unhelpfully, a fairly recent change on RS is that a done query has its pid changed to -1. Classic Redshift.)

And what do I see?

Basically, four things.

First, you see what are the same pids, getting older and older, with new pids joining the list, and those also getting older. The list is building up, getting longer and older, as the seconds go by.

Second, you see a smattering of shorter-lived pids, which convert to done after between something like 1s and 20s or so.

Third, and this is the biggie, every two minutes or so, the entire built-up list of running pids converts to done in about one to three seconds.

And now for the kicker - the pid of my test script, which creates a table, puts a few gigs in it, and commits - never showed up. Now I’m polling every 1s to 2, so if that pid moved immediately or nearly immediately to done, I would not see it (particularly so as RS changes the pid to -1 when done).

(And this is with the long post-commit delay, which was described above - the most recent test script run had the batch endtime being 1m 23s before the commit returned to the user.)

Finally, sometimes there were long - 10s, 20s - delays for the monitoring query to run. I had the feeling the leader node was busy doing something else, and sometimes those long pauses co-incided with the built-up list of running pids being changed to done; the responses from the leader node at times felt “jerky”, not smooth and consistent.

So, where (the hell :-) does this leave us?

Okay.

  1. When commit is issued, then there is always (at least on a busy system) a substantial delay between the commit batch ending, and the commit actually returning to the user.

  2. This delay is independent of whether or not the pid issuing the commit ends up in the accumulating running list on the leader node.

  3. I don’t know what to make of the accumulating list of commits on the leader node - is it that these commits have the normal (#1) delay plus the additional one or two (or three!) minute delay we see for them, as they wait on the leader node? don’t know.

  4. There is to my knowledge absolutely no information logged anywhere in the system tables about when commit actually returns to the user. The only information I can see to obtain is the age of the running commits on the leader node, but that gives us information only about those pids/commits which are showing up in that list, and I also don’t know if this is in addition to the delay in #1 (or is that delay, in their case).

It looks to me like the only way to know your actual commit time is to obtain all the timing stats you can which are available in the system tables - which gets you up to the end of the commit batch - and then manually record the timestmap of when your commit returns to you.



Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers