On dc2
nodes, you can see in stv_partitions
additional disks, which contain data for other nodes, which
provide one k-safety.
You ignore them by dint of the owner
and
host
columns.
On ra3
nodes, if there are local disks doing k-safety,
they are not visible in stv_partitions
.
I’ve just checked this, manually, by writing to the disk until it fills, and watching local disk and RMS usage.
So the disk size you see in stv_partitions
for
ra3
really is what you get.
Relating to this, it looks to me like the new ra3.large
now have the disk space ra3.xlplus
used to have (932 GB)
and ra3.xlplus
now has double that - they’ve been
upgraded.
Last month I did a little work for a Stateside healthcare client.
Health-care data in the US is regulated to the hilt, and as such they
could not as such grant me syslog unrestricted
(full access
to the system tables), because I would be able to see query texts, and
those texts could contain PII.
To resolve this, I developed a little package of views and
configuration, which meant I could be granted
syslog unrestricted
, and so see all rows in all system
tables but where columns with query texts would be
NULL
, empty, so I would not see the texts.
This post documents this work.
So, to begin with, I need to write about the privileges model in Postgres/Redshift (it’s the same model).
First, we have objects - schemas, tables, views, functions, etc.
Second, we have users.
Third, we have groups.
Fourth, we have privileges.
Each type of object has a set of privileges, which make sense for
that object - so you have the select
privilege on a table,
but not on a function, and you have the execute
privilege
on a function, but you won’t find that on a schema, because it makes no
sense to execute a schema.
Privileges for an object are granted to either a group, or to a user.
Users can be members of groups.
A user has all the privileges which are granted directly to that user, and also has by their membership of a group all the privileges which have been granted to that group, for as long as they are members. When the user stops being a member of a group, the user no longer has conferred upon them the privileges which belong to that group.
It can be a user receives the same privilege directly and also through membership of a group - and it might be even be the user is a member of many groups, and each group is giving the same privilege. As long as one source provides the privilege, the user has that privilege. So users can end up with a privilege many times over, just each being from a different source. When the user leaves a group, they lose access to the privileges held by that group, but it might be the user has those privileges anyway.
The usual way to arrange privileges in Postgres/Redshift, is to assign privileges to a group, and then assign users to the group. This makes revoking sets of privileges from users, as well as the general management of who-has-what, infinitely simpler.
Fifth, we have super-users.
Super-users are easy to understand : the privileges mechanism is not invoked when super-users issue commands. An SU can for example revoke every privilege that exists from himself, and it doesn’t make a blind bit of difference, because Postgres simply is not examining or honouring privileges, for the commands the super-user issues.
Now, finally, we have also the group-like object
public
.
As you will know, there are groups; you will have made them and
assigned users to them, and you will have granted privileges to them.
You will be aware that you can also grant privileges to something called
public
, and it appears like and behaves like a group
containing all users. Granted a privilege to public is the same as
granted the privilege to all users.
However, there is no actual group public
. If you look in
pg_group
, where groups are listed, it’s not there. You
can’t drop the group public
, because there is no
group public
.
Rather, what happens is that when you grant a privilege to public, Postgres records in the system tables the privilege as being granted to public, and the understand that all users now have that privilege. It’s a special, built-in behaviour.
What this means then, is that all users, always, are inherently part
of the group-like object public
, because whenever a
privilege is granted to public
, Postgres allows all users
that privilege.
This means you cannot remove users from public
,
because there is no group from which to remove them.
Now we come to the beginning of the work to hide query texts.
There are five system tables (which are all views since 2019) which contain query texts;
The privilege to read these views is granted to
public
.
Now, if public
were actually a real group, we could just
remove the user to be restricted from the group. However,
public
is not a real group, so we can’t do this.
So what we do first is to revoke the select
privilege
from public for these views.
This means now that no one (except an admin) can read these tables.
The next step is to make a group, to which we grant privileges to read these views.
Then we run a procedure (which we made earlier :-), which adds every user to that group, except for the user who should not read these tables.
At this point, everyone else is back to being able to read these views, except for the user we’re restricting, who currently cannot read them at all. Remember here our goal is for that user to be able to read those views, but not see the column containing query text.
The next step is that we create a schema (which I will call
pp
, for “privacy package”, and that they have their own
schema is a part of all this, as we will see) and within create five
views, one for each of the system views, where, critically, these views
have the same name as their matching view in the system tables.
Those views select every column from their matching system table view,
excepting that for the query text column, they emit
NULL
.
The views are created by and owned by a user who has the privilege to read the system tables in question.
The select
privilege is granted on these views to the
user we’re restricted (as well as the matching usage
privilege on the schema).
Now the final piece of the puzzle; the schema which holds these new
views is added to the schema search path of the user being restricted,
and it is added earlier in the path than
pg_catalog
. (The default schema search path contains
pg_catalog
.)
As such, when he uses the bare names of these views (which is to say,
with no schema name, which is the usual way to access system tables, as
pg_catalog
is on the default schema search path), he will
use the new views, which because they are owned by a user who can read
the system tables, will read the system tables, but will return
NULL
for the query text column.
If the restricted user were to try to read any of these five system
tables directly in pg_catalog
, he would as before not have
the privilege to do so, and so would be unable to read them.
At this point, the restricted user can be granted
syslog unrestricted
, and can read all rows in all system
tables without being able to see query texts.
(Actually, the system tables which are veneers over row-producing
functions, such as the views giving information about late-binding
views, do not honour systlog unrestricted
. So the user
can’t actually read everything, because only SU can read everything. But
they can at least read as much as syslog unrestricted
allows.)
A non-Redshift post.
A client recently paid me via bill.com.
They were hideous. If you’re using them, I advise you not to, and you’re thinking of using them, don’t.
I was charged 35 GBP for a simple, straightforward money transfer (and this via a UI donig everything it could to deceive into believing I was getting a good deal with “no fees”), was forced to accept a currency conversion I did not want, and closing the account took 45 minutes.
The SSL certs on their help site are invalid; I had to disable SSL validation to use the site. Their customer support is everything you would hate to experience but know so well from a large, bloated, robotic company - imagine a wall of repeated boilerplate, plus you had to search their web-site for 15 minutes, having found various instructions for closing an account which do not match up to what is shown in the UI, to find customer support.
Also, you can’t close the account yourself, from your account. It is made difficult, as it was, to prevent people from leaving.
Avoid.
(I’ve also just found that some of their emails are broken - multi-part MIME, where they do have text as well as HTML, but the text part is actually specifically empty, so it is rendered, but it empty, so there’s nothing. As an aside, their web-site is slow. It takes the log in page about 30 seconds to display the form for username/password, with that time being mainly an idle spinner logo.)
This is an important post; it’s about the commit queue.
I think I now finally understand what I set out to understand, weeks ago.
I’m going to write a little about queuing in general, and talk specifically about the commit queue.
Essentially in RS there are two queues; the query queue, and the commit queue.
The query queue we all know and love as the WLM queues, and their slots. That’s fine and well understood.
The commit queue is not well defined, or documented, and the
system table for it, stl_commit_stats
, is in fact totally
and utterly misleading. To the extent what I think now is correct, and
you’ll soon see the timestamps and numbers which lead to what I now
think, if you’re using numbers from there, everything you’ve computed is
wrong.
Now for the good stuff.
So, when we issue commit
, the transaction commits.
Now, RS doesn’t commit single transactions - it groups them up into a batch and commits them together. (But you can have a batch with a single transaction only, and you’ll get that on quiet systems.) On a busy system you’re looking at 10 or 15 transactions in a batch, and the batch will have been collecting them for about 10 or 15s (before deciding it has enough and starting processing).
Batches are basically processed like this : the batch starts, all the worker nodes do all their work for all transactions all at once, and then each transaction in turn has processing performed by the leader node, and then we’re done.
In stl_commit_stats
, every transaction in a batch has
one row for the leader node and one row for each of the worker nodes.
The row with node = -1
is the leader node row.
What happens now is that the timestamp of when we issued the commit
is stored as startqueue
in stl_commit_stats
(for the leader node row - startqueue
is set to
2000-01-01 00:00:00
for worker node rows, which means
NULL
- that timestamp means NULL
, always, and
you see it a lot in stl_commit_stats
because some columns
are for leaders only and some for workers only).
The commit
query itself is stored in
stl_utilitytext
, and the start and end times recorded for
the commit are the startwork
and endwork
times
for that transaction in stl_commit_stats
(for the row where
node = -1
).
Note specifically the start time recorded in
stl_utilitytext
is not the time the user issued
the commit - it is the time the commit began to be processed, in the
commit batch.
Note also here then that if you are adding up total transaction time
by computing the difference between start of first query and end of last
query, you are including more than the time the queries took to run -
you are also including the commit time, because the maximum end time is
the end time for commit
, and that’s the end of the
batch.
Now, batches run serially. RS starts with nothing, then it makes a batch, puts commits into the batch as the commits are issued, when RS is happy (unknown criteria) it then stops accepting new transactions into the batch, and processed the batch.
New commits during this time are queued - but this is NOT the commit queue - I’ll get to that. This is just a short pause while the current batch processes, which on a busy system with large commits almost always takes between almost no time at all and maybe 2s tops, with maybe 10% of so coming in over this, and with rare outliers of a minute or two. So you can see the delay here is characteristically short.
So the batch finishes processing, and then a new batch begins, and the waiting commits go in, and when RS is happy, that batch then begins to process. Repeat ad infinitum.
I see on current client system about one commit per second, and I think they almost always go very nearly immediately into a batch.
So - we have batches forming up, startqueue
is when the
commit was issued, and the worker nodes do their thing and then the
leader node does its thing, and when the leader node has finished with
the final transaction then the next batch begins to form up.
Now, here’s the key - for a long time I assumed the commit query
returned at the endtime
of the batch.
Stunningly, THIS IS NOT THE CASE.
The commit actually returns to the caller a long time AFTER the
endtime
of the batch.
There’s a whole chunk of time, and on current client system it’s the large majority of the time, which is completely undocumented in the system tables.
So what I see on current system is;
commit
to
actually come back to us - 60 to 180 seconds!!!!Let me show you some output, with timestamps, so you can see this.
This is output from a script, calling timeofday()
, which
gives the time that the query was issued (not the time of the start of
the transaction), then calling commit
, and then calling
timeofday()
again.
-------------------------------------
Thu Nov 07 09:36:54.032014 2024 UTC
(1 row)
Time: 215.520 ms
COMMIT
Time: 109456.758 ms (01:49.457)
timeofday
-------------------------------------
Thu Nov 07 09:38:43.683259 2024 UTC
(1 row)
So we see;
Immediately before commit : 09:36:54.032014
Immediately
after commit : 09:38:43.683259
Commit duration : about 1m 49s.
Now, when I look at the startqueue time for this commit, it is
09:36:54.366534
- very nearly immediately after the
“before” timestamp - and this is what I always find. The
startqueue
time is in fact very closely, or is, the time
the user issued commit
. I do occasionally see a delay, and
I think this is when a batch is already executing - the commit has to
wait for the current batch to finish, and so the next batch accept
transactions. These delays are usually short, because batches usually do
not take long to process - few seconds at most.
I also then see that the first startwork
in the batch,
which is the time the batch begins to run, is immediately after
the final startqueue
in the batch, and so only some seconds
after the startqueue
of any transaction in the batch.
Batches do not wait very long to accumulate transactions.
So here we have;
startqueue
: 09:36:54.366534
first batch startwork
: 09:37:11.481855
So what this means is that commits ARE going into batches pretty much at, or very soon after, the commit is issued by the user. Batches ARE then executing not very long after this, and batches do NOT take very long to execute.
But now look at the endtime
for the batch - the very
last timestamp recorded for the batch and the time
stl_commit_stats
tells us the batch ended.
endtime
: 09:37:47.204562
And now go back and look at the “immediately after commit” timestamp,
which was 09:38:43.683259
There was just less than a minute of delay AFTER
endtime
.
Stunning!
And this explains something we see in stv_recents
.
That system table is to my knowledge the only table which shows queries running on the leader node.
When I look on current client system, busy system, I see 100 or 200 or more commits on the leader node.
For a long time I thought these were commits which were queuing before running - but now I understand that these are commits AFTER they have run, doing whatever it is they do in that time.
The only idea I have for what’s going on is that this might be when SIV detection occurs - but that’s a wild guess.
So, where does this leave us?
We can measure WLM queuinig just fine, well, know that we know to ignore the end time of commit because it includes the commit batch processing time.
We can measure how long it took a transaction had to wait to be
processed in its batch by looking at its startqueue
and
startwork
. We can know how long the batch took to run,
either at the transaction level (endtime
for
node = -1
) or for the batch as a whole.
That leaves us with the question of how to know how long the post-batch delay is.
Since this delay is not in the system tables, the only way I can see
to even get a rough number is to subtract the start time of the oldest
running commit
in stv_recents
from the current
timestamp, like so;
select
count(*) as number_commits,
extract( epoch from (current_timestamp - min(starttime)) ) as post_commit_duration
from
stv_recents
where
status = 'Running'
and query = 'COMMIT';
And running this right now I get this, which is in seconds.
number_commits | post_commit_duration
----------------+----------------------
82 | 93.33252
(1 row)
Now, having written all this, a thought occurs - it might be these “after” commits are not queuing. Maybe they finish independently of each other, and some are very quick and some maybe are slow.
I have tooling which lets me run real-time monitoring of RS, and so I
wrote a query which shows the pid and age of every COMMIT
(in any state - which means running
or done
,
there are only two states) on the leader node, and this is the kind of
output I get, which updates in this case about once per second;
# 177864.408751395
interval = 1.5318
pid | status | age
1073852806 | Running | 136.759696
1074124278 | Running | 136.759679
1073974011 | Running | 136.759637
1074256732 | Running | 136.759613
1074192789 | Running | 136.759599
1073916170 | Running | 136.759578
1074245724 | Running | 136.759559
1074147115 | Running | 136.759536
1073861430 | Running | 136.613052
1074008483 | Running | 132.399287
1074471840 | Running | 132.399267
1073981195 | Running | 132.399257
1074490841 | Running | 131.992516
1073761737 | Running | 131.42072
1073924212 | Running | 131.244186
1073777895 | Running | 131.244155
[snip]
(Unhelpfully, a fairly recent change on RS is that a
done
query has its pid
changed to -1. Classic
Redshift.)
And what do I see?
Basically, four things.
First, you see what are the same pids, getting older and older, with new pids joining the list, and those also getting older. The list is building up, getting longer and older, as the seconds go by.
Second, you see a smattering of shorter-lived pids, which convert to
done
after between something like 1s and 20s or so.
Third, and this is the biggie, every two minutes or so, the entire
built-up list of running pids converts to done
in about one
to three seconds.
And now for the kicker - the pid of my test script, which creates a
table, puts a few gigs in it, and commits - never showed up.
Now I’m polling every 1s to 2, so if that pid moved immediately or
nearly immediately to done
, I would not see it
(particularly so as RS changes the pid to -1 when
done
).
(And this is with the long post-commit delay, which was
described above - the most recent test script run had the batch
endtime
being 1m 23s before the commit returned to the
user.)
Finally, sometimes there were long - 10s, 20s - delays for the monitoring query to run. I had the feeling the leader node was busy doing something else, and sometimes those long pauses co-incided with the built-up list of running pids being changed to done; the responses from the leader node at times felt “jerky”, not smooth and consistent.
So, where (the hell :-) does this leave us?
Okay.
When commit
is issued, then there is always (at
least on a busy system) a substantial delay between the commit batch
ending, and the commit actually returning to the user.
This delay is independent of whether or not the pid issuing the
commit ends up in the accumulating running
list on the
leader node.
I don’t know what to make of the accumulating list of commits on the leader node - is it that these commits have the normal (#1) delay plus the additional one or two (or three!) minute delay we see for them, as they wait on the leader node? don’t know.
There is to my knowledge absolutely no information logged
anywhere in the system tables about when commit
actually
returns to the user. The only information I can see to obtain is the age
of the running commits on the leader node, but that gives us information
only about those pids/commits which are showing up in that list, and I
also don’t know if this is in addition to the delay in #1 (or
is that delay, in their case).
It looks to me like the only way to know your actual commit time is
to obtain all the timing stats you can which are available in the system
tables - which gets you up to the end of the commit batch - and then
manually record the timestmap of when your commit
returns
to you.
Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers