So, a number of updates have all come today (because I knew they would come pretty closely together, so I’ve pulled them into one).
Bring-Up Times have been updated.
https://www.amazonredshift-observatory.ch/bring_up_times/index.html
me-south-1
ds2.xlarge
remains way out in
the lead with a mean of 15 minutes for a two node clusters (and a
maximum of 30:59!) generally though across regions “slow” means three to
five minutes, “fast” means one or two minutes - but this is all for two
node clusters only. I’ve done no investigation to see how bring-up time
relates to cluster sizing.
Redshift Version Tracker has been updated.
https://www.amazonredshift-observatory.ch/redshift_version_tracker/index.html
Been quite a bit of action recently. New versions across all regions
on 2022-04-30 (mainly 1.0.37954
, but one or two
1.0.37944
, which seems like a tiny difference?), three
regions updated on 2022-05-03 to 1.037988
, and another
almost-all regions update on 2022-05-06 to 1.0.38094
,
except the three regions updated on the 3rd. Who knows why? the devs
keep many secrets.
Cross-Region Benchmarks have been updated.
https://www.amazonredshift-observatory.ch/cross_region_benchmarks/index.html
Yup, another two weeks have gone by. It’s a quiet benchmark, this week.
ap-northeast-2
dc2.large
nodes have gone
from being slow to fast (disk read benchmark takes half as long, disk
read-write down 5 seconds from 15).ap-south-1
, eu-central-1
,
eu-north-1
dc2.large
nodes have received (or
in the case of eu-north-1
had returned, as was fast two
benchmarks ago, but not in the previous benchmark) the disk read
performance improvementSystem Table Explorer has been updated.
https://www.amazonredshift-observatory.ch/system_table_explorer/index.html
Bringing it up to 1.0.37988
.
The explorer is going to be re-implemented once the AMI is out. I will in the new version take full copies of certain system tables, and so then be able to produce diffs between arbitrary Redshift version - or histories; you could look at how, say, a given table has changed over, say, the last five releases.
Alexandre Acker, who works for sevDesk, a few days ago with both
eagle-eyes and the wit to realise what he had noticed found what looks
to be a bug in the lag()
window function.
I think the rows generated are still correct, but are
emitted in reverse order because it looks like an unnecessary
sort
step is occurring.
So, here’s a test table. All int4
, nice and simple, note
the sortkey, which is customer_id, date_day
;
create table int_subscriptions
(
customer_id int4 not null distkey,
date_day int4 not null,
contract_id int4 not null
)
diststyle key
compound sortkey( customer_id, date_day );
Now we get the table ready for use by inserting some rows. When we
run the window function, we will partition by customer_id
(which is always 1
) and order by date_day
,
which is ascending, from 1
through to 5
,
inclusive both ends.
insert into int_subscriptions( customer_id, date_day, contract_id ) values (1,1,1), (1,2,2), (1,3,3), (1,4,4), (1,5,5);
vacuum full int_subscriptions to 100 percent;
analyze int_subscriptions;
Okay. Now time for the window function, first the
EXPLAIN
, then the output.
dev=# explain select date_day, lag(contract_id) over (partition by customer_id order by date_day asc) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------------
XN Window (cost=1000000000000.11..1000000000000.18 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Sort (cost=1000000000000.11..1000000000000.12 rows=5 width=12)
Sort Key: customer_id, date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(6 rows)
dev=# select date_day, lag(contract_id) over (partition by customer_id order by date_day asc) from int_subscriptions;
date_day | lag
----------+-----
5 | 4
4 | 3
3 | 2
2 | 1
1 |
(5 rows)
So, first, in the EXPLAIN
we can see a sort step. Now,
that’s weird; columns in Redshift are sorted ascending, so the sorting
order of the table already matches the combined window function
partition and order by and so what’s expected is that there is no sort
step. We can also see the order of the output from the window function
is the reverse of what is expected - the column is stored ascending,
order by is ascending, we should start with the first row
(1
) and ascending values. What we get is the
largest row, and then we descend through the values.
It then turns out if we use lead()
, the sort step goes
away; and what’s more, if we stay with lag()
, but change
the order by to desc
, the sort step goes away and
output is then in ascending order! we can see this below;
dev=# explain select date_day, lag(contract_id) over (partition by customer_id order by date_day desc) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=# select date_day, lag(contract_id) over (partition by customer_id order by date_day desc) from int_subscriptions;
date_day | lag
----------+-----
1 | 2
2 | 3
3 | 4
4 | 5
5 |
(5 rows)
So, what’s happening? well, it makes my head hurt. I think
it’s that lag()
has flipped the consequence of
asc
and desc
, and so is sorting when there’s
an asc
(which it should not) and not sorting when there’s
desc
(when it should), and also, which is how the
output is still correct, the lag is not the previous row but
the following row.
As a result, we’re seeing the rows we expect from asc
when we use desc
, and the rows from desc
when
we use asc
.
As a check, we can run this test on Postgres.
create table int_subscriptions
(
customer_id int4 not null,
date_day int4 not null,
contract_id int4 not null
);
insert into int_subscriptions( customer_id, date_day, contract_id ) values (1,1,1), (1,2,2), (1,3,3), (1,4,4), (1,5,5);
postgres=# select day_date, lag(contract_id) over (partition by customer_id order by date_day asc) from int_subscriptions;
date_day | lag
----------+-----
1 |
2 | 1
3 | 2
4 | 3
5 | 4
(5 rows)
That’s what asc
output looks like, and that’s
is the output Redshift produces - just Redshift is performing
an unneeded sort
and emitting the rows in reverse
order.
Following is the full list of window functions, and we see that
lag()
seems to be the only affected function
(median()
need a sort normally).
dev=# explain select date_day, avg(contract_id) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=# explain select date_day, count(*) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
--------------------------------------------------------------------------
XN Window (cost=0.00..0.10 rows=5 width=8)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=8)
(4 rows)
dev=# explain select date_day, max(contract_id) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=# explain select date_day, median(contract_id) over (partition by customer_id) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------------
XN Window (cost=1000000000000.11..1000000000000.18 rows=5 width=12)
Partition: customer_id
Order: contract_id
-> XN Sort (cost=1000000000000.11..1000000000000.12 rows=5 width=12)
Sort Key: customer_id, contract_id
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(6 rows)
dev=# explain select date_day, min(contract_id) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=# explain select date_day, stddev_samp(contract_id) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=# explain select date_day, stddev_pop(contract_id) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=# explain select date_day, sum(contract_id) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=# explain select date_day, var_samp(contract_id) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=# explain select date_day, var_pop(contract_id) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=#
dev=# explain select date_day, cume_dist() over (partition by customer_id order by date_day asc) from int_subscriptions;
QUERY PLAN
--------------------------------------------------------------------------
XN Window (cost=0.00..0.10 rows=5 width=8)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=8)
(4 rows)
dev=# explain select date_day, dense_rank() over (partition by customer_id order by date_day asc) from int_subscriptions;
QUERY PLAN
--------------------------------------------------------------------------
XN Window (cost=0.00..0.10 rows=5 width=8)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=8)
(4 rows)
dev=# explain select date_day, lag(contract_id) over (partition by customer_id order by date_day asc) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------------
XN Window (cost=1000000000000.11..1000000000000.18 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Sort (cost=1000000000000.11..1000000000000.12 rows=5 width=12)
Sort Key: customer_id, date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(6 rows)
dev=# explain select date_day, lead(contract_id) over (partition by customer_id order by date_day asc) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=# explain select date_day, ntile(1) over (partition by customer_id order by date_day asc) from int_subscriptions;
QUERY PLAN
--------------------------------------------------------------------------
XN Window (cost=0.00..0.10 rows=5 width=8)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=8)
(4 rows)
dev=# explain select date_day, percent_rank() over (partition by customer_id order by date_day asc) from int_subscriptions;
QUERY PLAN
--------------------------------------------------------------------------
XN Window (cost=0.00..0.10 rows=5 width=8)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=8)
(4 rows)
dev=# explain select date_day, rank() over (partition by customer_id order by date_day asc) from int_subscriptions;
QUERY PLAN
--------------------------------------------------------------------------
XN Window (cost=0.00..0.10 rows=5 width=8)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=8)
(4 rows)
dev=# explain select date_day, row_number() over (partition by customer_id order by date_day asc) from int_subscriptions;
QUERY PLAN
--------------------------------------------------------------------------
XN Window (cost=0.00..0.10 rows=5 width=8)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=8)
(4 rows)
dev=#
dev=# explain select date_day, first_value(contract_id) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=# explain select date_day, last_value(contract_id) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=# explain select date_day, ratio_to_report(contract_id) over (partition by customer_id) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(3 rows)
dev=# explain select date_day, nth_value(contract_id,1) over (partition by customer_id order by date_day asc rows between unbounded preceding and unbounded following) from int_subscriptions;
QUERY PLAN
---------------------------------------------------------------------------
XN Window (cost=0.00..0.11 rows=5 width=12)
Partition: customer_id
Order: date_day
-> XN Seq Scan on int_subscriptions (cost=0.00..0.05 rows=5 width=12)
(4 rows)
dev=#
Good evening all.
The bi-weekly (ish) cross-region benchmarks are up (I was away for a week :-)
https://www.amazonredshift-observatory.ch/cross_region_benchmarks/index.html
This fortnight’s changes;
Curious change in the disk read/write benchmark in
ap-east-1
and us-east-1
. No major changes,
but, across the board - every node type, not just the small nodes,
there’s a speed up of about 0.5 seconds (out of 3.5 to 5 seconds).
That’s a useful little boost.
ap-northeast-2
, ap-southeast-1
,
ap-southeast-2
, eu-west-2
and
eu-west-3
received the read benchmark improvement, and so
that benchmark now runs in half the time (about 0.06s vs 0.14s)
’ap-northeast-3dc2.large` nodes have lost the read
benchmark improvement, and returned to about 0.14 seconds; the other
node type have not. Question is, do AWS even know this is happening? or
is it deliberate?
ap-southeast-3
, which has always been staggeringly
slow (0.72s read benchmark, compared to the “fast” norm now of 0.06s)
has been improved, for the read benchmark only, to the current norm for
a “slow” dc2.large
node, which is about 0.14s (this being
the only node type available in this region).
eu-north-1
dc2.large
node read
benchmark went from fast to slow (the other benchmarks were slow,
already).
eu-west-1
dc2.large
by the looks of it
had a serious hiccup when running the read benchmark, as the standard
deviation in the current results is 0.25s, instead of the normal 0.00s,
with the benchmark itself being 0.43s, instead of an expected 0.15s.
Just a glitch, I’d say.
us-east-1
dc2.large
became, well, oddly
slow. It’s a third speed band, slower than the normal “slow” nodes. To
give a sense of it, on the read benchmark, fast is 0.06s, normal is
0.14s, and now we have us-east-1
at 0.2s.
That’s all, folks!
So, the existing System Table Explorer is basically a huge mistake :-)
It actually emits the pages you see; there’s no record of the original data.
I’ve now immplemented the back-end part of a properly-designed replacement, and the back-end simply makes copies of certain system tables, dumping them into Postgres.
This will allow be to produce both full listings of information on tables, views, etc, but vitally will allow me to produce diffs across Redshift versions, for anything and everything.
Writing the front-end will take a certain time, but the back-end is now operative, so I’m collecting data which will become available in the future.
Home 3D Друк Blog Bring-Up Times Cross-Region Benchmarks Email Forums Mailing Lists Redshift Price Tracker Redshift Version Tracker Replacement System Tables Reserved Instances Marketplace System Table Tracker The Known Universe White Papers