I spent 600 USD on Redshift last month =-)
A paper describing materialized views has been published.
I’m looking to stick to a weekly rate, publishing on the weekend, but this one was a lot of work - I kept on figuring out new things.
As such, it’s published, the content is good, but I’m not happy with
the presentation of SQL in the results section. In future I’m going to
start using pglast
to format SQL. It’s an extra dependency
(the only dependency now is boto3
) but the results have to
be readable.
Redshift 1.0.30840 is out. The System Table Explorer now has a dump from this version.
I’ve finally had a decent idea about how to begin comparing RA3 node disk performance to non-RA3.
An ra3.xlplus
has 932 GB of SSD.
So - make 932 1 GB tables - table_0
,
table_1
, etc, to table_931
.
I know how many rows you get in one raw block of say
int8
, so that’s straightforward to do with
INSERT
and self-join, to make the first table, and the
later tables will just select from the first table.
Now create one more table - this should evict the first table from the cache.
Now benchmark a full scan (sum(column_1)
, or something
like that anyway as sum()
can fail due to overflow) on the
most recent table, and a read on the first table.
We should see a big difference (SSD vs S3).
We can repeat the test on a dc2.large
, where of course
it’ll all be SSD, and then get an idea of the cost of cold read from S3
for RA3. Be interesting to do the same on a ds2.2xlarge
,
but then I’ll need to make 2048 tables - that probably will take
hours.
Actually, this leads to another thought - maybe there is a much simpler way.
When an RA3 cluster is rebooted, is the SSD cache flushed? if it is, I only need to make one table (well, I mean, some extra whatever so I can benchmark cleanly from SSD and S3, but you get the point).
If reboot doesn’t flush, would making a snapshot and restoring the cluster from snapshot clear the cache?
I think I may well try both before taking the original, slow route.
I just had another thought, too… so, imagine the 1 GB table I benchmark is equally distributed across all RA3 slices - okay, I benchmark it. Now imagine it’s skewed, and the entire table is on a single slice/node. Will I find only one slice/node is pulling down data from S3, and so the read time is much longer?
With an SSD, this wouldn’t matter much - it would mean only one slice was accessing the SSD, but then that entire slice will get all the bandwidth from the SSD, so total read time doesn’t change.
With S3, if there is bandwidth per slice, and you only use one slice, then it matters.
One thing though - this is a great start, measuring read performance, but what I really, really want to know about it write performance. S3 lets you perform reads of arbitrary byte ranges, but it only allows you to write an entire file at once. That’s a fundamental difference to SSD/HD. I really want to know what’s happening with that in practise.
I suspect though the writing to the SSD is decoupled from the writing to S3. If I write 932 1 GB tables, and wait, then all of them will be in S3, and writing another table will not be slowed down in any way; the oldest table will simply be immediately evicted, as it’s already in S3.
If I can think of a way to measure write performance, then I’ll have enough (with the read performance) for a white paper.
Had an idea for a method to perform some decent basic initial testing with auto vacuum. It’s worked well. Script is written and tested, run time I think will be about 85 minutes, will kick it off in the morning. Once the results are in, I’ll do the write up and publish.
(Sneak peak - auto-vacuum is a black box, which was known, but a surprise is that one of the inputs looks to be a second black box.)
So a while back I created some code which would analyze tables - for example, checking primary keys, foreign keys and uniques were valid (no duplicates), figuring out the smallest possible data type for columns, checking if a column can be made NOT NULL, count(distinct) of values, min() and max() and so on - and produce a very nice PDF report.
It’s an excellent way to get to know a database, also to validate existing databases, and to see where some types of improvement (data types) can be made.
Problem is, how to make this code available?
Answer is an AMI.
You spin up the AMI in your Redshift cluster VPC, so you have security - the AMI can’t emit data outside of your VPC.
The AMI runs a web-site, which provides a WWW based interface.
The interface guides you through creating a read-only user with the minimum privileges necessary to generate the report - you enter a username/password and the schemas/tables you want analyzed, and it emits the list of SQL commands necessary to create and configure the user - and then you kick off report generation.
Report generation takes as long as the amount of work you give it - so it can be ten seconds or some hours. You can view a list of running reports, and of course you can view the list of generated reports, PDFs, and download them.
I’ve spent today working on an MVP.
Most of it is done, I have to debug now the code which kicks off report generation (which means firing up a cluster and so on) and check reports are produced correctly.
One more day of dev time will have the system ready, then I need to produce an AMI for the first time, and test that out inside a VPC and so on.
In the longer term, there’s actually a number of other cluster-level utilities I’ve written (such as copy and pasting a schema from one cluster to another) and those too can become functions offered by the AMI.
Right now, there are three data types for time; date
,
timestamp
and time
.
date
is four bytes, and delta
compression
works really well - you get one or two bytes of signed integer of
difference, where one unit is one day. That’s most of a year, and most
of one hundred years, respectively.
timestamp
is eight bytes and has microsecond resolution,
so delta
(which I think but I don’t think I’ve proved) has
one unit being one microsecond, which is useless. All you can do is
zstd
.
We could imagine a variant of timestamp
which has a
resolution of seconds. Then the two delta
encodings would
be about two minutes and about half a day, which looks like it could be
useful.
Finally, we have time
, which is nuts - it’s still eight
bytes, because although it’s only HH:MM:SS, it’s still microseconds,
which means delta
is again no use.
So we’d also want a variant of time
which is seconds
resolution only. This would be a four byte data type.
Then in fact what we could do is store a timestamp
as
two columns, one date
and one time
, four bytes
each, both with delta
. You’d get a year or one hundred
years on the date
(for one and two bytes per value), and
two minutes or half a day, on the time
. So you’d probably
go from eight bytes with zstd
(what you have to do now) to
three bytes with two columns, delta
for date
and delta32k
for the time part.
Been doing finally again some more development work on the replacement system tables.
Modified lf.views_v1
to show the type of the view
(normal, materialized, late-binding).
Modified lf.objects_v1
to use the new view code to
distinguish between view types.
As ever when working with the system tables, you find issues.
First, with the improved lf.views_v1
code, I made three
test views (one of each kind). If I run the query manually, and in the
where add a clause indicating I’m only interested in schema ‘public’,
only one row is returned. If I make the query into a view, and query the
view and indicate I want schema public
only, I get all
three.
Second, there was a problem for quite a while with the system tables
where in a query which used the leader node and work nodes you could no
longer simply join to a system table if it had leader-node only type
data types; even though you were not using those columns, the query
planner would still reject the query. I had to modify a lot of code in
the replacement system tables, converting joins to system tables (like
pg_namespace
, which contains an array) into sub-queries,
where I pulled out only the columns I actually wanted.
That problem seems to have gone away.
Will have to return the code to its original form, at some point.
Also, just had cause to examine the official docs for STL_HASH. It documents 23 columns. Then I looked at the doc made by the System Table Explorer. It lists 27 columns, and comparing the two documents also shows 6 of the columns in the official docs no longer exist (although I’ve not yet spun up a cluster to manually check this, to make sure).
I’m pretty sure all the system table docs are like this. The thing is the system tables are a catastrophe. I think no one is using them, because they’re impossible to use, so no one notices the docs are staggeringly inaccurate.
Fortunately the one column I care about right now is still there :-)
Very happy - finally found a way, and a fairly good way, to get right justified text :-)
Given a string column, use a window function to find the maximum
length in the column, and then use greater()
to chose the
larger of that maximum length, or the length of the column title - call
that resultant length field_width
.
Then use
lpad( string, greater(max(len(string)) over (), length('column_title')) ) as right_justified_string
.
Ta-dah!
I think if you compute the length in a subquery and pass that column
name to max()
, the query planner might be able to do
something efficient to figure out the max value, but I’ve not
investigated to see if this is so.
A fascinating question turned up today on StackOverflow.
A Redshift user found that ILIKE
was not working
correctly for accented characters, but if he issued
UPPER(string) ILIKE 'pattern'
, it worked (the user in this
case passing in an upper case version of the pattern).
What’s going on here is that Redshift is not actually Unicode aware.
You can store any bit-pattern you like in a varchar
.
However, some functions are Unicode aware, like
upper()
and lower()
- they have to be, you
can’t change case on a multi-byte Unicode character without knowing
about Unicode.
The core code base, though, and that includes the operators,
such as ILIKE
(rather than functions, which are written
separately) do not understand Unicode.
So it turns out LIKE
and ILIKE
, being
operators, are not Unicode aware, and perform byte-by-byte
comparison.
The upper()
and lower()
functions are
aware, though.
So what you have to do is help LIKE/ILIKE
out. You use
upper()
(or lower()
, either will be fine) on
both the string and pattern, so they have the same case in Unicode, and
then LIKE/ILIKE
will work.
Home 3D Друк Blog Bring-Up Times Cross-Region Benchmarks Email Forums Mailing Lists Redshift Price Tracker Redshift Version Tracker Replacement System Tables Reserved Instances Marketplace System Table Tracker The Known Universe White Papers