Redshift Observatory

Blog

2023-02-01

1m 50s Of Annotated Air Defence Footage From Kyiv

The chap who runs my AirBnB went through basic and is now part of a mobile air defence unit in Kyiv.

We were talking about New Year’s Eve/Day, which were busy, and he sent me some footage from the 2nd Jan (which had been quiet for me - but you only really know about whatever is close enough for you to see or hear). I’ve annotated the footage to the best of my knowledge and local experience.

00:00 : to begin with, we see (and hear) AAA fire in the sky, by dint of tracer rounds. I wondered why, given its all radar guided. My cousin (former Captain in the UK army) tells me standard ammo boxing has one round in five being tracer. I can imagine it’s useful, as it helps inform other defenders you’re firing at something.

00:07 : then a single point of light rises up into the sky; this is a surface-to-air missile, I would guess shoulder-launched. It disappears when it comes into the cloud base (which is not particularly visible, as it is dark). Missiles seem surprisingly slow, but we are looking here at something covering a couple of kilometres in a couple of seconds. I’ve seen these things myself; the CCD blurs out their exhaust a little. To the eye, they’re sharp points of light.

00:07 : while the missile is rising, there’s an quick single explosion in the sky over to the right, below the datestamp. This I think from its size and shape is a drone being blown up.

00:15 : a very quick explosion a bit to the right of the top center - I suspect this is the missile from 00:07 finding its mark. I suspect what you or I would have seen, had we been there, with the human eye, would have been somewhat different to what the CCD captures; I think that explosion was above the cloud base, and so we would have seen a white flash lighting up the clouds from above. I suspect the CCD doesn’t pick up such a sky-wide flash of light very well, but it does pick up the core of the explosion, even through the clouds.

00:16 : AAA fire begins again, from the left.

00:17 : then there follows, in line the AAA fire, first, two big explosions right next to each and one right after the other, then a short string of similar explosions, a bit further on, in the line of fire. I think the first two explosions are from proximity fuses (fuses with little radars in, invented during WW2), and then the late explosions are timed fuses, in part to try and bracket the target, and in part to stop the rounds flying off into the distance and then coming down onto residential Kyiv.

00:18 : while the AAA is ongoing, another shoulder-launched surface-to-air missile begins to rise into the sky, from the same location and so the same team who fired the first missile. As that missile proceeds, we become aware it is fact coming towards us, and it is indeed aimed at a target behind the camera, and so it flies toward the camera, and then passes overhead. The smoke trail is very clear, even in the dark.

00:24 : the missile which passed overhead finds it mark, the explosion being clear, loud and nearby. That type of bang, very crisp, is an explosion in the air. Ground explosions are longer, deeper, much more rolling and “boomy”. Air explosions are like a balloon bursting.

00:25 : more AAA rises up into the sky, from the same source, on the left. While this is occurring, a rapid sequence of explosions occur in the very top right. This I think is AAA fire, detonating around a target (proximity fuses), as the explosions are multiple, rapid, and in the same location.

00:29 : now, this I don’t quite get; we have another string of explosions, as at 00:17 - but it’s not in line with the AAA fire we can see rising into the sky. I’m not sure where this comes from. I do note the location is pretty much the same as the earlier string of explosions.

In any event, the explosions illuminate a smoke trail, hanging in the air, which given its location is very likely from an earlier shoulder-launched surface-to-air missile, from the same team we saw firing at 00:07 and 00:24.

00:32 : the blast waves from explosions are often strong enough to set off nearby car alarms, which is what now occurs.

00:34 : explosions in the top-right, the edge of which can just be seen, but which can be heard clearly.

00:36 : and here - another string of explosions, and again, unconnected to any tracer fire, in much the same place as the earlier strings of explosions, with the lingering missile smoke trail again being made visible by the light. My guess is some of the ammunition being fired had no tracer rounds.

00:40 : a big flash, at ground level, just off the left side of the screen; that was an incoming drone or missile, and it hit something, as the street lighting goes out about a second later. Exactly what it hit, I do not know. It might just have blown up a bunch of transmission wires, or it might have hit a substation or something like that directly (although it would need some luck to do so - the drones and missiles in use do not seem all that accurate). I have no idea if, or how many, people caught it.

After that, it’s over, just like that. Couple of cars drive by, there’s curfew at 11pm and I think this was after, so they would be military, or civilian authorities.

I would here now draw your attention to a larger, hopeful, and encouraging picture : that I am here in Kyiv, writing this, because the West has stepped up, and provided weapons, supplies and support, to Ukraine.

The actions of the Governments of the most of the people who will read this have been and are absolutely central to the strong and ongoing resistance by Ukraine. We are not bystanders or helpless : we are all, through our taxes put to this end, doing our bit, and it’s working.

Putin’s underestimation of the power of the West is as great as that power will prove to be.

New White Paper : Users, Groups, Roles and Privileges

In Postgres, roles replaced users and groups (both become roles). Roles in Redshift do not replace users or groups, but exist alongside them, as a third, separate, first-class type. Roles, like groups, can be granted privileges, but roles, unlike groups, can be granted to roles. Granting a role is in effect the same as adding a user to a group. Along with roles come a new set of Redshift-specific privileges, which can be granted to and only to roles (roles can also be granted the existing Postgres-style privileges). These new privileges are unlike the existing privileges (which are per-user, per-object) as they are global; they apply to everything, everywhere, always, in all databases. They are essentially fragments of the super user. Finally, note the documentation for roles is particularly egregious, and there are a few of the new privileges which are properly tantamount to granting super user, as they allow a user to elevate themselves to super user.

https://www.amazonredshift-observatory.ch/white_papers/downloads/users_groups_roles_and_privileges.pdf

Amazon Redshift Weather Report

ap-southeast-1 dc2.large benchmarked 0.3s for the disk read benchmark, vs the usual 0.06s. May just be a bad cluster.
ap-southeast-1 dc2.large has received a large upgrade to the disk-read-write benchmark and the network benchmark. Disk-read-write has gone from the historically constant and slow ~13.5s to the normal ~4.5s or so, network from the usual ~12.3s to the normal ~4s. Disk read is still slow (0.14s vs 0.06s) but this region has now seen a serious upgrade, moving from the slow lane to the normal lane.
eu-west-1 dc2.large disk read benchmark also having a bad hair day, clocking at 0.18s, rather than the usual 0.06s.

https://www.amazonredshift-observatory.ch/cross_region_benchmarks/index.html

2023-02-13

Max Ganz’s Redshift Combobulator!

I’ve been working for many months on a browser-based AMI for managing Redshift - and it brings me very great pleasure to proudly present the one, the only, the amazingly fantabulous - Max Ganz’s Redshift Combobulator!

There is now a live Combobulator running, which is pointing at a Redshift cluster I’ve populated with an artificial database.

As of 19:00 UTC 2023-02-13 the site is up, and I’m babysitting it, and I’ll leave it running overnight and tomorrow.

To avoid bots walking all over it, I’m going to obfuscate the IP address;

https://
fifty-four.
one-hundred-and-sixty-two.
fifty-eight.
one-hundred-and-twenty-seven

Note - the AMI is designed to run locally, completely firewalled (for PII) and so it has to use a self-signed certificate, because there is no IP name.

Just click through it the warning the browser gives (I promise not to use your credit cards :-)

If you have any questions, or, Heaven forbid!, problems, email me! I’ll be keeping an eye on email, so I should respond in a timely manner.

2023-02-17

Weather Report

Cross Region Benchmarks

The ra3.xplus node type in ap-northeast-2, ca-central-1, eu-south-1, eu-west-1, eu-west-2, us-west-1 and us-west-2 all experienced issues with the disk-read-write benchmark.

The usual performance, which is highly consistent with this node type is, 2.40/0.01s (mean and standard deviation). In all these regions, the benchmark now is about 3.40/2.5s - a huge standard deviation. Something is glitching in all these regions.
ap-northeast-2 dc2.largr network benchmark is back to normal, after a month of being slow (>7s vs ~4s).
ap-southeast-1 dc2.large disk-read benchmark has returned to normal after last benchmark’s unusually slow (and possibly just a bad cluster) result.
ap-southeast-3 dc2.large looks to have received the disk-read upgrade, and is now at 0.06s (rather than the slower class of 0.14s nodes).
ca-central-1 dc2.large has lost the disk-read upgrade, and is now at 0.14s. I begin to suspect there may be multiple populations of the same node type. I need to sit down and run a set of benchmarks where I take one region and simply run the benchmarks on one after another cluster, for say ten clusters in a row, and see how much variance is found.
me-south-1 ra3.xlplus processor benchmark returned to normal (2.19/0.00s now, last benchmark was 3.47/0.77s).
us-east-2 ra3.xplus disk-read-write went from slow (3.50/2.54s - the huge standard deviation again, as seen in many regions this benchmark) back to normal (2.49/0.02s).
us-west-2 dc2.large and ra3.xlplus both received processor upgrades, going from 3.95/0.08 to 2.91/0.02, and 2.05/0.06 and 2.21/0.01, respectively. Normally the processor result is pretty much bang-on identical across the bi-weekly benchmarks, so these are stark and large upgrades.

https://www.amazonredshift-observatory.ch/cross_region_benchmarks/index.html

Redshift Version Tracker

A small note to say releases seem to have been coming much more slowly over the last two or so months.

https://www.amazonredshift-observatory.ch/redshift_version_tracker/index.html

System Table Tracker

Updated a week ago with the new 1.0.46607 release.

https://www.amazonredshift-observatory.ch/system_table_tracker/index.html

Bring-Up Times

Republished today.

I think I may need to partition the results into months or something like that, so changes over time can be seen. I changed the order of the time columns from mean/sd-max-min to max-mean/sd-min, which I think is significantly easier to take in.

https://www.amazonredshift-observatory.ch/bring_up_times/index.html

Redshift Price Tracker

Not one price has changed in the history of tracking =-)

https://www.amazonredshift-observatory.ch/redshift_price_tracker/index.html

Insider View on an AWS Big Data Blog Post

The AWS Big Data Blog, on the 13th Feb 2023, so just a few days ago, published a blog post about a migration to ra3 by a company, OLX.

https://aws.amazon.com/blogs/big-data/how-olx-group-migrated-to-amazon-redshift-ra3-for-simpler-faster-and-more-cost-effective-analytics/

I worked at OLX, as a Redshift specialist contractor, for about four months, over the end of 2019, start of 2020.

I was admin on and working on the very Redshift cluster that was migrated.

I worked with Miguel (one of the two blog post authors, the other being from AWS) most days over that time.

Now, by and large, you never speak about the internal state of a client. It’s a simple breach of confidence. You would no more do so that pass on a secret your best friend had confided in you.

However, I have some concern about what has been published in the blog post. I feel it paints a significantly inaccurate picture, and that this picture is being used by AWS to persuade people to give money to AWS.

Given these considerations, I’ve written about the blog post, but with tact and discretion, and reducing information about OLX itself to the absolute minimum. I speak as much as possible in general terms.

So, having explained the situation, let’s have a look at what’s been written.

I’ve quoted only relevant sections. To see the whole blog post, go and read it.

The first thing then to mention, is that this in fact to my knowledge was not the first migration attempt. There was one before this, after I left, which was as I understood it, pretty much unplanned - suck it and see. The cluster was moved over, and it did not work well, and they rapidly reverted.

This migration, described in the blog post, was to my eye, much more carefully planned, and in the end, they were able to stay on ra3.

It is important to publish failures, not only successes, so that people can know what they really do need to do, to be successful.

(I must mention that I think Miguel would not have been in charge of that earlier attempt; what you see in the blog post, the preparation and care, that’s him - I worked with him, and I think he’s good; careful, thoughtful, thorough, and a thoroughly nice chap.)

Before migrating to RA3, we were using a 16 DC2.8xlarge nodes cluster with a highly tuned workload management (WLM), and performance wasn’t an issue at all.

Redshift is highly knowledge intensive. The admin, devs and the users must fully understand the impact sorting has upon table design and queries. When there are large numbers of ordinary users, this is highly unlikely to be the case.

As such, and speaking now in general terms about Redshift in industry, I may be wrong, but I am of the view about 95% of all Redshift clusters are incorrectly operated and the users are having a terrible time; typically, everyone involved does not know about sorting or its implications for them and their work and Redshift is used as if it were Postgres - but with the expectation that as a “data warehouse” it has much greater capacity.

In fact, Redshift is a wholly different type of database, with entirely different use cases. It is not a general purpose database, and cannot be used by all and sundry. It is extremely knowledge intensive.

I must also mention the issue of selection bias. It can be potential users of Redshift, data analysts for example, have other data processing systems available to them, and if they are not happy with Redshift, they stop using Redshift. Any data system which has issues, those users who are unhappy and can leave, do leave, and you’re left only with those who are getting on just fine, and those who cannot leave, and they I find usually give up complaining because nothing changes.

However, we kept facing challenges with storage demand due to having more users, more data sources, and more prepared data. Almost every day we would get an alert that our disk space was close to 100%, which was about 40 TB worth of data.

I think when there are large numbers of users, and their database use is uncontrolled, there is normally going to be a surprisingly rapid and ongoing increase in disk use.

With a Redshift cluster, once you get to 90% full, you start finding yourself spending time keeping the cluster running. Once it gets to 95% full, you’re spending all your time keeping the cluster running.

Being short on disk space, and having a considerable and ongoing increase use of disk space, obviously leads to a difficult situation. What’s needed are more nodes - to expand the cluster - but here we can run into administrative issues; there can be irreducible delays in getting approval for new nodes, perhaps a need to wait for the next budget cycle - indeed, there can even be political delays, where questions come to be asked as to why the cluster is growing so quickly.

(In fact, part of what I did there was write a generalized table DDL modifier, which converted a large number of small tables to unsorted, and freed up about 20% disk space. It was quite a piece of work - if you want to change a table, you have to figure out its dependencies, enumerate them all, record their DDL (views, etc) so you can recreate them, and record all privs which have been granted on everything, drop all dependencies, copy the table into its new form, then rebuild all dependencies - in the correct order - and regrant all privs. I wrote this code, although I’d written quite a lot of it in my own time prior to the gig, and just brought it into play. One off-thought about this… this was before auto for table sorting, so not specifying a sortkey meant you were unsorted. However, when auto was introduced, not specifying a sortkey means you get auto. When this feature was introduced, did it convert all those unsorted small tables to auto? did a lot of them stop being unsorted, and so end up going back to using huge amounts of overhead?)

Our usual method to solve storage problems used to be to simply increase the number of nodes. Overall, we reached a cluster size of 18 nodes. However, this solution wasn’t cost-efficient enough because we were adding compute capacity to the cluster even though computation power was underutilized.

In the usual case, when Redshift is incorrectly operated, almost all joins are hash joins, because admin and users lack the necessary knowledge to allow for merge joins. Hash joins are expensive in every way - disk space, network, compute.

When a cluster is operated incorrectly, I would always expect compute to be heavily used, because everything is a hash join.

This cluster’s performance was generally good, ETL (extract, transform, and load) and interactive queries barely had any queue time, and 80% of them would finish in under 5 minutes.

If the ETL job was processing a small amount of data, perhaps it ought to finish in 30 seconds, and if it’s taking 300, then that would mean there is a problem - it means nothing to only say how long a job takes.

I note also specifically interactive queries had barely any queue time - but what proportion of queries were interactive? and how long did they take once they were running?

To my eye, this paragraph is too brief, and by it, creates questions rather than providing answers.

Initially, we chose the RA3.4xlarge node type for more granular control in fine-tuning the number of nodes. However, we overlooked one important detail: the same instance type is used for worker and leader nodes.

Now, this is interesting. This is a basic Redshift fact. OLX may not have known this, but the AWS staff involved would have.

A leader node needs to manage all the parallel processing happening in the cluster, and a single RA3.4xlarge wasn’t enough to do so.

In the normal case, I am of the view the leader node for a cluster, regardless of node type, is just fine. I think the AWS staff though this too - and bear in mind here, OLX went to an 18 node cluster when they were trying ra3.4xlarge - but the maximum number of nodes is 128! it’s not like they were pushing the limits. I also have reports from skilled Redshift admin running very large node count clusters that the leader node can seem to become a processor-time bottleneck, when you have a lot of nodes, but apart from that, it’s okay.

To my eye then, something with the workload on that cluster was abnormally stressing the leader node, rather than the leader node in general normally being underpowered; and this caught out the AWS staff, because it’s out of the ordinary.

Use 6 x RA3.4xlarge, restored from 6 x RA3.4xlarge snapshot.

Use 18 x RA3.4xlarge, elastic resize from 6 x RA3.4xlarge.

This was likely a mistake. A new cluster of 6 ra3.4xlarge would have 4 full slices per node - a total of 24 full slices. The elastic resize to 18 nodes would have distributed those slices over the 18 nodes, so 8 nodes would have 1 full slice, and 8 nodes 2 full slices - the balance on each node, to 4 slices, being made by partial slices, which do not read data from RMS and participate in a limit number of step types only.

AWS would have known this, and they would not have told OLX - AWS tell clients “they don’t need to know about the difference”. AWS could not however I would say have said anything about that this resize would lead to what was I expect quite different performance to that after a classic resize, because it would require them to say a classic resize would be necessary, but a classic resize would take days; information too negative to share.

In case the performance of hourly and daily ETL is not acceptable, the contingency plan is triggered:

Add one more node to deal with the unexpected workload. Increase the limit of concurrency scaling hours. Reassess the parameter group.

The next morning, we woke up to what we dreaded: a slow cluster.

Which must mean queries were in fact running more slowly than before.

There are to my eye two primary considerations when moving to ra3.

The first is that dollar for dollar, you’re getting less RAM. I have heard of other skilled Redshift admin experimenting with migrations, and this seemed to be the issue they were running in to - the conversion ratio which AWS present for dc2 and ds2 to ra3 is over-optimistic, and if you follow it, you end up with insufficient memory, and performance can fall off a cliff.

The second issue is that on dc2 and ds2, all storage is local and runs at the speed of local disk. With ra3 storage is now in the cloud, and my benchmarking of ra3.xlplus indicates RMS runs at 25% of the speed of the local disk (which is a block-based LRU cache of RMS).

So, put simply, the storage devices have been reduced in speed by a factor of 4 (but now have a local cache) - in other words, storage is slower. Whatever you might say about this, you can’t say you expect performance to increase because of the move to RMS.

We triggered our contingency plan and in the following few days we ended up implementing all three actions we had in the contingency plan.

Adding one extra node itself didn’t provide much help,

Presumably an elastic resize, as it seems it was done quickly, which would not have given quite the changes OLX were expecting.

however users did experience good performance during the hours concurrency scaling was on.

I am very strongly against concurrency scaling. First, I consider it inappropriate - palliative. Redshift scales by operating sorting correctly which means queries are lightning fast. You do not need to run many queries in parallel, because the queries which are issued are over very quickly. When Redshift is operated incorrectly, and queries are slow because sorting is not being used, concurrency scaling can help; but it’s expensive. You pay at the per-second rate.

In this situation, it was expedient.

The concurrency scaling feature allows Amazon Redshift to temporarily increase cluster capacity whenever the workload requires it. We configured it to allow a maximum of 4 hours per day—1 hour for free and 3 hours paid. We chose this particular value because price-wise it is equivalent to adding one more node (taking us to nine nodes) with the added advantage of only using and paying for it when the workload requires it.

Right, but I would expect all the CSC time is being used.

Therefore, we decided to try auto WLM with the following configuration.

I am very strongly against AutoWLM. It is a black box, which has a troubled history, and it changes without notification from AWS. You cannot knowingly design a correct system when that system contains black boxes, let alone black boxes which change without warning.

What I currently know of AutoWLM (information and rumours slowly emerge, one way and another) is that every 200 queries, AutoWLM recalculates slots and memory allocation.

Problem is if you run a bunch of light queries, and so get tiny slots, and then run a big query, suddenly you have a problem.

What I hear is that people using it see it runs fine for a while, and then sooner or later, a big backlog of queries forms.

I would be interested to know more about OLX’s experience with AutoWLM, and I am not sure that we are necessarily getting the whole picture here.

After applying concurrency scaling and auto WLM, we achieved stable performance for a whole week, and considered the migration a success.

Stable performance for a week, I have to say, feels like an underwhelming attainment; but I think here Miguel’s innate truthfulness as an engineer is manifesting itself.

We are now at 300 monthly active users. Cluster costs did increase due to the new node type and concurrency scaling,

Which brings to mind the blog post title;

How OLX Group migrated to Amazon Redshift RA3 for simpler, faster, and more cost-effective analytics

Cost-effective, by being more expensive?

AWS will have picked the blog post title, of course, not Mig or OLX.

but we now feel prepared for the future and don’t expect any cluster resizing anytime soon.

So, I may be wrong, but to my eye, what’s happened here is that the disk space issue has been resolved, from the point of view of the Redshift admin team.

Previously, with dc2, new nodes were needed, and they were regularly needed, and this was a problem. Now with ra3, there’s RMS, the rate at which new data is coming in hasn’t changed at all - and it’s being paid for, RMS is not free - but it no longer intrudes. There’s no longer a discrete particular bill for another reserved instances; RMS costs are just a quiet part of the ongoing AWS bill.

This benefit has come at a price; the cluster is slower, and so now needs concurrency scaling and concurrency scaling is expensive for what it is, and the nodes in the cluster itself are more expensive than before.

I have to say it, also, I can’t see anyone here asking the users how things are for them; and changing the node type is of course not going to influence whether the cluster is correctly or incorrectly operated.

(To be fair, no one ever does ask users. Users to my eye are like tax payers using free at the point of use health services; they have no power or influence of any kind. The only way users have power is if they pay for the services they use, and can go elsewhere.)

Let’s have a look at pricing.

Originally, 18xdc2.8xlarge, no concurrency scaling.

In the Frankfurt region, with one year reservation and all up-front, this costs 35,910 per node, which is 646,380 USD in total.

Now we have 8xra3.16xlarge, I’ll work out RMS and concurrency after the nodes.

In the Frankfurt region, with one year reservation and all up-front, this costs 90,065 per node, which is 720,520 in total.

RMS is 2.4 cents per gigabyte per month, so 40 terabytes, which is 40,960 gigabytes, is 983.04 USD per month, which is 11,796 USD per year in total.

CSC is three hours per day (I’ll assume weekdays only) at the per-second rate.

For ra3.16xlarge, that’s 15.578 USD per node per hour, so three hours a day, 260 working days (give or take one or two) per year, eight nodes, so 97,207 USD per year.

Grand total : 829,523 USD.

Previous total : 646,380 USD.

Increase : 183,143 USD.

Jesus. I could save these guys so much money :-)

Know what would be interesting? figuring out the hardware cost of the nodes, and then thinking about just buying them (one off cost of 150k maybe? including 80 TB storage) and running clustered Postgres on them.

2023-02-20

WW2 Memorial, Kyiv

The best photo I’ve taken in Kyiv.

2023-02-25

Node Performance Variation

In the bi-weekly benchmarks, from fortnight to fortnight, there’s a lot of variation between dc2.large clusters.

Now, it could be that those clusters are identical if you fire them up one after the other, and the changes seen in the bi-weekly benchmark reflect longer-term changes in the nodes or their environment - or it might be clusters vary a lot, even if you’re just making one cluster directly after another.

I’ve been meaning to investigate this for ages.

So this evening, finally, I fired up and benchmarked ten 2xdc2.large clusters, one directly after the other.

Here are the results;

 cluster_id |     bring_up_start_utc     | bm_name | node_id |  max_duration   |  avg_duration   |  min_duration   
------------+----------------------------+---------+---------+-----------------+-----------------+-----------------
 pIDw0      | 2023-02-24 18:01:29.844692 | dr      |       0 | 00:00:00.06885  | 00:00:00.065582 | 00:00:00.063572
 MjdMZ      | 2023-02-24 18:07:20.308433 | dr      |       0 | 00:00:01.051501 | 00:00:00.716193 | 00:00:00.346439
 Cz6Gl      | 2023-02-24 18:13:32.535197 | dr      |       0 | 00:00:00.063639 | 00:00:00.062683 | 00:00:00.062258
 dga69      | 2023-02-24 18:18:09.281297 | dr      |       0 | 00:00:00.06043  | 00:00:00.059754 | 00:00:00.059462
 pCsny      | 2023-02-24 18:26:21.088642 | dr      |       0 | 00:00:00.064252 | 00:00:00.062574 | 00:00:00.061975
 GnLyz      | 2023-02-24 18:39:44.874939 | dr      |       0 | 00:00:00.064635 | 00:00:00.062722 | 00:00:00.060401
 lvmWk      | 2023-02-24 18:50:59.088157 | dr      |       0 | 00:00:00.217796 | 00:00:00.115286 | 00:00:00.0661
 cC1v9      | 2023-02-24 18:58:49.007792 | dr      |       0 | 00:00:00.07829  | 00:00:00.068956 | 00:00:00.063517
 MBVwj      | 2023-02-24 19:10:12.009875 | dr      |       0 | 00:00:01.046866 | 00:00:00.729639 | 00:00:00.349607
 QoJ3w      | 2023-02-24 19:16:48.52368  | dr      |       0 | 00:00:00.692264 | 00:00:00.524915 | 00:00:00.25648
 pIDw0      | 2023-02-24 18:01:29.844692 | drw     |       0 | 00:00:08.214498 | 00:00:05.606798 | 00:00:03.041891
 pIDw0      | 2023-02-24 18:01:29.844692 | drw     |       1 | 00:00:10.011731 | 00:00:05.299469 | 00:00:03.147677
 MjdMZ      | 2023-02-24 18:07:20.308433 | drw     |       0 | 00:00:06.893068 | 00:00:04.382553 | 00:00:03.062129
 MjdMZ      | 2023-02-24 18:07:20.308433 | drw     |       1 | 00:00:10.461527 | 00:00:06.719437 | 00:00:04.182195
 Cz6Gl      | 2023-02-24 18:13:32.535197 | drw     |       0 | 00:00:12.364854 | 00:00:05.640658 | 00:00:03.179939
 Cz6Gl      | 2023-02-24 18:13:32.535197 | drw     |       1 | 00:00:11.404383 | 00:00:07.081053 | 00:00:03.948546
 dga69      | 2023-02-24 18:18:09.281297 | drw     |       0 | 00:00:07.586488 | 00:00:04.774773 | 00:00:03.066495
 dga69      | 2023-02-24 18:18:09.281297 | drw     |       1 | 00:00:07.84923  | 00:00:05.224053 | 00:00:03.169702
 pCsny      | 2023-02-24 18:26:21.088642 | drw     |       0 | 00:00:09.104722 | 00:00:04.521909 | 00:00:02.987616
 pCsny      | 2023-02-24 18:26:21.088642 | drw     |       1 | 00:00:07.463868 | 00:00:05.39949  | 00:00:02.967225
 GnLyz      | 2023-02-24 18:39:44.874939 | drw     |       0 | 00:00:13.019179 | 00:00:07.253439 | 00:00:03.191199
 GnLyz      | 2023-02-24 18:39:44.874939 | drw     |       1 | 00:00:13.98035  | 00:00:07.360802 | 00:00:03.044719
 lvmWk      | 2023-02-24 18:50:59.088157 | drw     |       0 | 00:00:07.401188 | 00:00:04.765769 | 00:00:03.098318
 lvmWk      | 2023-02-24 18:50:59.088157 | drw     |       1 | 00:00:12.000361 | 00:00:06.5008   | 00:00:03.297659
 cC1v9      | 2023-02-24 18:58:49.007792 | drw     |       0 | 00:00:06.068048 | 00:00:04.266258 | 00:00:03.046517
 cC1v9      | 2023-02-24 18:58:49.007792 | drw     |       1 | 00:00:09.708544 | 00:00:06.930116 | 00:00:03.12746
 MBVwj      | 2023-02-24 19:10:12.009875 | drw     |       0 | 00:00:07.388561 | 00:00:04.866444 | 00:00:03.137688
 MBVwj      | 2023-02-24 19:10:12.009875 | drw     |       1 | 00:00:10.531674 | 00:00:07.835683 | 00:00:04.639588
 QoJ3w      | 2023-02-24 19:16:48.52368  | drw     |       0 | 00:00:06.425808 | 00:00:04.532703 | 00:00:03.205344
 QoJ3w      | 2023-02-24 19:16:48.52368  | drw     |       1 | 00:00:07.122254 | 00:00:04.543377 | 00:00:03.122562
 pIDw0      | 2023-02-24 18:01:29.844692 | n       |       0 | 00:00:05.375796 | 00:00:04.289109 | 00:00:03.95347
 MjdMZ      | 2023-02-24 18:07:20.308433 | n       |       0 | 00:00:04.489905 | 00:00:04.354318 | 00:00:04.236918
 Cz6Gl      | 2023-02-24 18:13:32.535197 | n       |       0 | 00:00:04.255905 | 00:00:04.033638 | 00:00:03.814545
 dga69      | 2023-02-24 18:18:09.281297 | n       |       0 | 00:00:07.373131 | 00:00:04.93187  | 00:00:04.128055
 pCsny      | 2023-02-24 18:26:21.088642 | n       |       0 | 00:00:04.203148 | 00:00:03.981212 | 00:00:03.702442
 GnLyz      | 2023-02-24 18:39:44.874939 | n       |       0 | 00:00:04.552793 | 00:00:04.399579 | 00:00:04.257074
 lvmWk      | 2023-02-24 18:50:59.088157 | n       |       0 | 00:00:05.207047 | 00:00:04.702628 | 00:00:04.289306
 cC1v9      | 2023-02-24 18:58:49.007792 | n       |       0 | 00:00:05.844316 | 00:00:04.76548  | 00:00:03.953268
 MBVwj      | 2023-02-24 19:10:12.009875 | n       |       0 | 00:00:05.663703 | 00:00:04.941265 | 00:00:04.649054
 QoJ3w      | 2023-02-24 19:16:48.52368  | n       |       0 | 00:00:09.035593 | 00:00:05.742293 | 00:00:04.7567
 pIDw0      | 2023-02-24 18:01:29.844692 | p       |      -1 | 00:00:02.955424 | 00:00:02.926266 | 00:00:02.896877
 MjdMZ      | 2023-02-24 18:07:20.308433 | p       |      -1 | 00:00:03.720681 | 00:00:03.093161 | 00:00:02.881999
 Cz6Gl      | 2023-02-24 18:13:32.535197 | p       |      -1 | 00:00:03.316626 | 00:00:03.068281 | 00:00:02.881514
 dga69      | 2023-02-24 18:18:09.281297 | p       |      -1 | 00:00:03.104251 | 00:00:02.95971  | 00:00:02.904142
 pCsny      | 2023-02-24 18:26:21.088642 | p       |      -1 | 00:00:03.814585 | 00:00:03.195639 | 00:00:02.906804
 GnLyz      | 2023-02-24 18:39:44.874939 | p       |      -1 | 00:00:03.114035 | 00:00:02.981399 | 00:00:02.895505
 lvmWk      | 2023-02-24 18:50:59.088157 | p       |      -1 | 00:00:03.432149 | 00:00:03.061231 | 00:00:02.919551
 cC1v9      | 2023-02-24 18:58:49.007792 | p       |      -1 | 00:00:03.363392 | 00:00:03.063311 | 00:00:02.870398
 MBVwj      | 2023-02-24 19:10:12.009875 | p       |      -1 | 00:00:03.255543 | 00:00:03.008074 | 00:00:02.893565
 QoJ3w      | 2023-02-24 19:16:48.52368  | p       |      -1 | 00:00:03.236271 | 00:00:03.049509 | 00:00:02.904692
(50 rows)

This needs a little explanation.

Each row is one benchmark.

Each benchmark has five iterations, and the max/avg/min are for all five.

Ten clusters were benchmarked. All in eu-west-1, benchmarked one after the other, with at most a few tens of seconds pause between a cluster being shut down and the next being brought up (although bring up usually takes a couple of minutes). The order of rows is by benchmark type, then within the benchmark type, by cluster, in the order the clusters were brought up.

(Usually I discard the fastest and slowest, but the point of this test is specifically to look to see how much variation there is.)

Here are the columns;

column	description
cluster_id	the first five letters of the AWS cluster ID
bring_up_start_utc	well, what do you think? :-)
bm_name	benchmark_name (dr = disk-read, drw = disk-read-write, n = network, p = processor)
node_id	each cluster is 2x`dc2.large` (so four nodes), for some benchmarks only one node produces a result, for others half the nodes produce a result (note node -1 is the leader node)
max/av/min duration	what you think they are, all times in seconds

The benchmarks are;

disk(r) : Sequential read 512mb.
disk(rw) : Sequential read and then sequential write 512mb.
network : Sequential read 512mb and send to another node.
processor : Generate 2^23 single column rows in memory.

The PDF describing the benchmarks is here.

Without going into too much detail, because the numbers speak for themselves, we see a great deal of variation.

Particularly intriguing are the clusters “MjdMZ” and “MBVwj”, for the disk-read benchmark.

Their times are almost identical. I suspect I was allocated the same node both times.

We see also great variation between nodes in the same cluster for the disk-read-write benchmarks, where all nodes in the cluster produce a result.

Note though that in the bi-weekly benchmarks, all other node types demonstrate a lot less variation. I need to run the test here on ra3.xlplus, and I also need to re-run this test but on a single cluster, running the benchmarks ten times, to see how much variation we get from a single cluster.

The normal-to-worst differences;

for disk-read mean is about 10x (perhaps a particularly bad worst node).
for disk-read-write mean is about 40%.
for network mean is about 20%.
for processor is about 5%.

There seems to be a need for software to check for slow nodes, so you can make clusters until you have decent nodes.

(I implemented this a while ago, but it’s not published yet - I’ll add it to the Combobulator AMI.)

Regarding the Combobulator AMI (browser-based PII-safe Redshift management AMI), it’s ready for release, but I’ve just run into what you have to do to get into AWS AMI Marketplace, and it’s profoundly intrusive and bureaucratic, and it’s given me pause, and I’m thinking now about to get to market.

Home 3D Друк Blog Bring-Up Times Cross-Region Benchmarks Email Forums Mailing Lists Redshift Price Tracker Redshift Version Tracker Replacement System Tables Reserved Instances Marketplace Slack System Table Tracker The Known Universe White Papers