Redshift Observatory

Reddit

My account on Reddit, and so as I am the founder also the sub r/AmazonRedshift, appear on 2023-09-30 to have been banned by an automated system.

The sub appeared to be working normally, I posted about the Amazon Redshift Serverless PDF, and then Reddit began behaving oddly.

After some investigation, and some guesswork, I concluded my account had been silently shadow-banned, and the sub banned (and then shortly after, deleted).

(Shadow-banning means when you log in as yourself, you see all your posts, and you see them in the threads where they were made. If you view Reddit when logged out, you then see all your posts have been deleted.)

Two years of posts and the sub disappeared, instantly, abruptly, without warning, reason, appeal process or notification, and Reddit is trying to lead me into thinking my account is still active. Make of that what you will.

About a year ago, a post was pinned to all subs, as if it were posted by the moderator, exhorting readers to use the sub lounge (a new real-time chat mechanism). This post was deliberately hidden from the moderators - if the moderator viewed their sub while logged in, the post was specifically hidden from them. I discovered this by accident, when by chance i happened to browse the sub when not logged in. Then I found this post at the very bottom of the (very long) list of all posts, completely out of chronological order. I could then delete it.

Then more recently the API pricing debacle occurred, when all major subs and untold numbers of minor subs went private to protest the actions of Reddit management, who adjusted pricing in such a way as to kill off all third party software for reading Reddit; it wasn't just that this was done, but how it was done, and the conduct of management in their response to the protests.

As such, I've been working towards moving away from Reddit, which is why I've been developing the forum on this site. That day has now come, unfortunately without me being able to inform readers where to go instead - however, they all know the site, so I think they will find their way here, and so also find out what happened.

I have now used a bulk-delete utility to scramble all my posts on Reddit. You cannot in fact delete posts on Reddit - there is a button which says "delete", but that's not what it does (it merely removes your name from the post) - so what's done is that every post is edited and the content replaced by garbage. I have also deleted my account.

Reddit Post Archive

A well-wisher, who wishes to remain anonymous, emailed me an archive of my Reddit posts, which I've posted here so they can be searched.

Timestamp Text
2021-04-26 17:32:47

I'm not familiar with Big Query. I am extremely familiar with Redshift. I have read the Snowflake white paper. I can offer an opinion about Redshift, and I have a view about Snowflake.

Redshift is a sorted column-store data warehouse. When and only when operated correctly, it offers and only offers timely SQL on Big Data. It does not offer lots of users, lots of queries, rapid updates or inserts, etc. Critically, both the developers who create systems in Redshift need to understand how Redshift works, and the end-users of those systems need to understand Redshift, because Redshift is only able to offer timely SQL on Big Data when operated correctly.

Snowflake is also a sorted column-store database, and so is the same technology as Redshift. Snowflake however is fully automated; the only adjustment you can make is how much you spend. This automation is Snowflake's strength and its weakness. With Redshift, you absolutely have to know what you're doing and if you do not, you will have a terrible time. With Snowflake, if you don't know what you're doing, the automation rescues you; you will still be okay - but this is only true up to an certain extent. The issue is that automation isn't really great. It's a ton better than developers and users who don't know what they're doing, but it's much less effective than developers and users who \*do\* know what they're doing. So Snowflake rescues you at the low end of knowledge, and holds you back at the high end.

A serious problem with Redshift though is finding out the knowledge needed to use it correctly. The official docs are no use at all for this.

Amazon are currently investing a \*lot\* in Redshift and doing a lot of work on automation. I'm concerned the features they add may needlessly obstruct developers and users who do know what they're doing. I'm also concerned Redshift is basically always going to be playing catch-up with Snowflake, in terms of automation, by being hindered by its heritage.

2021-04-26 17:44:45

1. If you TSV files are gzipped, that cannot be your actual SQL, because you are missing the GZIP argument to COPY.

2. What compression level is gzip using when compressing the TSV files?

3. Why is MAXERROR set to 1?

4. How many nodes, and of what type, do you have?

5. What other load is present on the cluster while you load data?

6. How many source files are you loading?

7. How long does the load actually take? you say it takes longer than you'd like, but not how long it actually takes : it might be what you are experiencing is actually about right.

8. How many table are you loading?

9. How many columns are in those tables?

2021-04-26 17:52:02

I'm hearing from cluster admins that the actual conversion ratio of dc2.8xlarge to ra3.16xlarge is about 5:3, not 2:1. The problem is that the ra3 nodes are relatively short on memory.

2021-04-26 17:54:40

>You'll speed up the copy by going into bare tables and then adding the bells and whistles after.

I've not benchmarked this, so I can't authoritatively say, but I very much doubt this. The loadng bottleneck is almost always disk I/O. Compression when loading costs processor time but reduces disk I/O and so I would expect it to significantly reduce loading times. You do need to make correct encoding choices, though. The wrong choices will increase the size of a column, and so increase the amount of disk I/O.

2021-04-26 18:00:43

LZO is faster to encode than ZSTD, but slower to decode. ZSTD is slower to encode than LZO, but faster to decode (which is usually what you want). ZSTD has significantly better compression. I regard LZO as obsolete; ZSTD should always be used instead.

AZ64 in my testing so far behaves like a runlength encoder. I've been advised by a dev it is actually a delta based encoder. It is not repeat NOT comparable with LZO or ZSTD - they are different types of encoder. Until I finish investigating AZ64 and figure out what it's doing, I categorically advise it is not used, because no one knows what kind of data it should be used with. I'm afraid you may well be getting absolutely no compression from it all.

2021-04-27 13:10:00

I'm not sure what you have in mind when you write about index generation and partition management. Postgres et al have indexes, and they are indeed major factors in load performance, but Redshift does not have indexes, or anything I can think of which it calls partitions (which I think is something in Postgres).

Distribution keys don't really have a good or bad - the normal situation is they evenly or nearly evenly distribute rows equally over all slices. If a key does not do this, it is catastrophically bad and an outright gross failure. This is an extreme and unusual situation. We can reasonably I think assume rows are being distributed normally over all slices; that's not a special case, or something where there is a continuum of good-to-bad. Either you are equally distributing, or you're not, and not means all bets are off.

Regarding disk, when we come to perform work there then a certain amount of processor time, network and disk, required. For a given piece of work, the resources required are what they are. Increasing the provisioning of one or more of those types of resource will reduce the wall-clock time needed for the work to complete, but it does not change the amount of each resource which must be consumed for the work to complete. I was arguing that disk is normally the largest part of that work.

In fact, writing this now, I have some thought to disagree with myself :-) I am thinking it might actually be network. The disk now can perform gigabytes per second, which I suspect is going to be on the same order as the network bandwidth to a node.

2021-04-27 13:21:07

1. Did you omit anything else?

2. I'm very surprised that works. I would have expected a decode error.

3. No. You want MAXERROR 0. MAXERROR [n] allows for [n] broken rows without causing COPY to fail and back out. It's a really, really bad idea to have a value other than 0 (I realize it was your intent to have zero - I am speaking in general). On the other hand it often allows me to know immediately if an ETL system is any good. In poor systems, it's usually set to the largest possible value (which used to be 10k, but is now, sigh, 100k).

4. If the number and type of nodes is unknown, then it's not really possible to advise you, because what you're experiencing could be absolutely bang on the money, or it could be bizzarely slow, but there's no way to know. You can't reasonably ask people to spend time helping when in fact you might have no need at all to be helped.

5. Okay.

6. Okay. I suspect you wouldn't do any harm by making the gzippd files larger. A gigabyte would be fine.

7. I can't know if this is reasonable or not, because the cluster spec is not known.

8 and 9. What's the row width, roughly, in bytes? do you have any long strings?

I think I recall that this is a staging table, and so presumably will be truncated between loads? out of curosity, what are you doing about VACUUM in the final destination tables?

Can you provide the DDL for this table? have you specified the distribution key and sort order? if so, what are they? also, have you selected an encoding for every column?

Ah, something occurred to me. In your COPY, specify COMPUPDATE OFF. It's better to be safe than sorry. It's functionality you never want, but unless you actually specify OFF, it can occur unexpectedly. The docs as ever are opaque, but my understanding from experience is that if your table is empty and when you made the table you specified no encodings (or if you have but you specified them all as `raw`, which is not what you expect because you *have* explicitly specifed encodings), then Redshift will - and silently - set the encoding choices for each column. This is bad, because Redshift makes very poor encoding choices. I lost an evening of work from this, once, because I thought a column was raw when Redshift had changed it to LZO. After that, I always explicitly turned this functionality off.

2021-04-27 20:42:51

Be cautious with advice from Support. Don't fall into the trap of unconsciously assuming their advice, since it comes from Support, must be correct and appropriate. I've worked with Support a lot, over a number of years, and I don't think they particularly understand Redshift or the advice they give out, and that advice is, in my experience, rather superficial and rote-like. For any given broad subject, say COPY or queries being slow or whatever, there's a fixed set of advice given to end-users, regardless of the actual situation, and I often have had considerable disagreement with that advice. To be fair, they're in a position where there's a revolving door of end-users, so they can't spend much time with any particular end-user, and I also understand AWS internally is profoundly metric driven and I think this acts to discourage Support from spending time with end-users (as their performance metrics will be reduced).

Oh - and watch out for their boilerplate replies which are there to shut you up. If you start asking questions which they cannot or will not answer, you start to get misleading boilerplate replies, things like "We can't comment on your message, but something like that must be happening". There is no truth whatsoever in these formulations. Don't be - as I initially was, many years ago - misled in the development of your understanding of what is or is not going on.

2021-04-28 06:59:48

I just checked out catting gzip, and indeed this is legal. Gzip has a header of course, and then a body, but indeed a single gzip file is permitted to have multiple header-body units.

2021-05-03 11:01:33

Redshift is a cluster, which is to say, composed of multiple nodes. There is a single leader node and one or more worker nodes. The leader node was originaly (a long, long time ago, in a galaxy far, far away) Postgres 8. It to this day retains everything Postgres 8 has - unsorted row-store tables, indexes, data types like arrays, arbitrary length strings, a massive range of functions, and so on.

The worker nodes are independent of Postgres 8, do not come from that code base, and have a minimialistic set of functionality - a few, basic data types, no arrays, a minimal set of functions, and so on.

If you issue a query which uses *any* leader-node specific functionality, that query can only run if it runs on and only on the leader-node. If that query uses any functionality which recruits the worker nodes, the query planner will block the query, with the error that you have seen.

The "system tables" (which are mainly views), including the system_information tables, are used by Redshift to keep track of information about itself, and are mainly stored on the leader-node and often use leader-node specification functionality. When they do, they must as described above run on and only on the leader-node. If you issue a query which takes the output from such a system table and try to write it to a normal table of your own making, the act of writing to a normal table requires the worker nodes to be used and the query planner blocks the query.

I never use the information_schema views, so I don't know them, but I do know the normal system tables extremely (painfully) well. They are a catastrophic mess. If I remember correctly, the problem with the relevant table/column infomation tables is the ACL column, because it is an array. You can use `pg_class` and `pg_attribute` to get column information by specifically using only columns with data types which are supported on worker nodes.

Note however some vital information is present in columns which are extensions to the postgres tables (`pg_`), where these extensions are not documented in the Redshift official docs, and of course cannot be documented in the Postgres 8 docs (which the official docs point you at). You can only discover this by programmatically listing all the system tables and seeing what columns they actually have (or by examining the SQL which forms the system tables, which is available in `pg_view` - that SQL).

A word of warning about the official docs : they're absolutely no good. They are superficial, often misleading, sometimes flatly factually wrong, I think not reviewed by anyone technical, and obfuscate everything whih is not a strength. Do not trust them. You are on your own, and you need to figure most things out either from first principles or from experimentation.

2021-05-09 08:29:49

Just offhand - you're using `MANIFEST`, which I think means the `FROM` parameter needs to be the S3 pathname of the manifest file. You seem in your command to be giving though your *bucket* to `FROM`.

As an aside, I would advise you with `COPY` always to set `COMPUPDATE` to `OFF`, to ensure Redshift does not change column encoding types, as it makes very poor choices.

Also, I advise always specify column names to the `COPY` command, to forestall any transposition errors.

2021-05-09 16:01:23

`NULL AS \000` will not help you, because you are not using binary zero for NULL (and the text in the documentation about `NULL AS` is quite new, and I have to say it's terribly written).

Right now I'm guessing you intend empty CSV fields to be NULL, but for strings I think maybe they will be empty strings, rather than NULL, but I am not completely certain and anyway, I don't know what the data type is for those fields, because the table DDL is unknown.

In any event, can you paste the DDL for the destination table? I think it will be necessary.

Offhand, I'm actually not sure what will happen if you try to give an empty CSV field to a non-string column. I guess you'd get NULL, but I wonder if you might get an error.

2021-05-10 11:17:02

Jesus.

Silent data type changes are completely insane. Redshift does that at times, too, unfortunately (silently changing encoding types on tables).

2021-05-12 12:06:50

What's a "DR"?

So, if I understand you correctly, you want to programmatically duplicate a Redshift cluster? have a way in which to move it from one AZ to another?

I doubt very much you can do this with lambda. If you have a large table, individual steps you need to take (such as exporting the data) will take more than 15 minutes.

In principle you could do this task programmatically, by extracting from the source Redshift all the necessary information on views, tables, functions, procedures, users, privileges, all the data and so on (and in the correct order - you might have views which depend on views which depend on tables, so you need to know the dependency order to rebuild correctly), and using that information to rebuild the cluster elsewhere. That approach is complex enough to be impossible for anyone who is not an absolute expert, and who has already invested a great deal of time - a year or more - in solving most of those issues, so they can just roll some fairly superficial code for you.

The other way :-), I think, is to create a snapshot - I recall multi-zone snapshot support was added a year or two ago? snapshots are stored as diffs from the most recent snapshot, so if you have an up-to-date snapshot and make a new snapshot, it's a quick process; something which could happen in a lambda function. The time taken for new cluster bring-up however depends on two factors, the size of the snapshot (larger means slower) and the size of the cluster (larger means faster). In any event, bring up will very nearly almost certainly take more than 15 minutes. I recall the minimum time (zero size snapshot, small cluster) for snapshot bring-up is about 30 minutes.

On the face of it, if I understand your requirements correctly (and I may well not), you are trying to do something which is impossible, in which case you will need somewhere, somehow, to change your requirements.

2021-05-16 13:05:19

I strongly advise CTAS is never used, as it leaves Redshift to pick column encodings, and Redshift makes particularly poor and inappropriate choices.

2021-05-16 19:07:57

I may be wrong, but you sound a little annoyed. I think there has been a misunderstanding. I was and was only making an observation about CTAS, since it is often used and in my view, should never be used. I was not commenting upon, positively or negatively, the proposed solution as a whole.

As it is, I'm not completely sure an SQL-only solution currently works, not with this particular view. I suspect I may have done so in the recent past, but for a little while now Redshift when using leader-node only tables is blocking queries due to leader-node only data types *even when the columns which use those data types are not used in the query*. Redshift did not use to do this.

You have now to actually issue a subquery on a system table with leader-node only data types, which picks out only the columns you wish to use, rather than just using the system table. This broke a number of my own views (as did the quite recent wholesale conversion of most system tables from being tables to being views - either the Redshift dev team appears to accept breaking user code as a part of Redshift development, or they are unaware of the impact of their changes. Both explanations are alarming.)

The `information_schema.columns` view does not do this, at least, not last I looked (two or so months ago, I think).

However, that view is only composed of a query you can perfectly well issue yourself on the leader-node system tables, so the OP can in principle work around it.

2021-05-17 05:14:18

I may be wrong, but I think you've missed the critical point : tt may not be possible to use an SQL query to copy this table. The query planner will block the query.

(I suspect you may not be aware of the issue? as `PG_TABLE_DEF`, last time I looked, used a leader-node only function. It specifically cannot be used to create normal Redshift tables, because of the same issue which may mean that now-a-days, no SQL query will be able to make a copy of `information_schema.columns`.)

2021-05-17 13:03:15

Redshift is full of surprises, alas! glad you're finding out some new stuff here, as I am too - that solution is interesting, but I suspect it's not quite as nice as you'd like.

Normally a `SELECT` on `PG_TABLE_DEF` will be leader node only. What's happening here I think is that the results from querying that are going into a cursor, and those results are being iterated over to write into the destination table. I've not used that as an approach. I did read just recently, the last day or two, someone describe it as an approach, but it hadn't sunk in.

However, I have to say, in terms of efficiency, this approach, where there's one insert per row, is for Redshift death and ruin.

You're looking, for writing a single row, at 4mb of disk I/O per column, per row; I know, because I've investigated it. There's 2mb of reads, and 2mb of writes. 0.5m rows, four columns, so you're looking at 7.6 *terabytes* of disk I/O.

Redshift is very, very, *very* inefficient for small writes, and single row writes at the smallest writes of all.

2021-05-17 15:22:59

BTW, as an aside, `PG_TABLE_DEF` is a stupid view - it's actually based a couple of Postgres tables (`pg_class` and so on) which only have one or two columns which cannot be used on worker nodes; it just uses one function which is leader-node only (and all of these leader-node/worker-node issues are almost completely undocumented, and this view is probably one of the first views most people use - it's a terrible first experience).

It's straightforward to produce exactly the information `PG_TABLE_DEF` emits, but in a way which can be used with worker nodes.

2021-05-23 16:46:18

I may be wrong, but I'm not sure I can concur that hyper-threading is meaningfully harmful to cache performance. Caches experience rapidly diminishing returns as they become larger. Halving the size of a modern, large cache is going to knock off, I would expect, at most a few percentage points from the cache hit rate. The gain in exchange for this, a second thread running, is on the face of it absolutely worthwhile.

This does however bring us on to the question of additional stress on the memory sub-system. This though is dependent on what you're doing with your threads.

As an aside, I have seen VM setups where the virtual cores were badly mapped to physical cores. Some physical cores had no virtual cores; some had many. On the machine in particular I was looking at there were eight virtual cores, on top of eight hyper-threads (four cores, two threads each) but in fact only five of the underlying hard hyper-threads were in use. Was quite shocking, since these were expensive VMs provided in a private cloud.

A problem these days though with hyper-threading is security, a la SPECTRE and MELTDOWN et al. By default, where possible, I disable hyper-threading.

2021-05-23 21:06:47

If you have say 1mb of cache and this gives you a 95% hit rate, and you then go to 2mb of cache and this gives you a 97% hit rate, the different in performance is small, even though the miss cost is high, because such a small number of misses have been converted into hits.

A cache hit will take something like 10 to 30 cycles (longer for L3 cache - 100 cycles maybe), where an access to memory will take something like 300, except normally then you'll be needing to operate the virtual memory mechanism, so you need a bunch of memory accesses, which will total about 1500 cycles or so.

The problem caches face is not memory performance, which is as fast as you need it to be, but the travel time of the signal, actually covering the distance between the processor and the cache, and the distance which the cache circuitry itself occupies.

2021-05-24 03:46:32

> Hyper-threads aren’t real threads, they’re just a second set of registers that allow the O/S to emulate two threads, and rapidly swap between them.

I may be wrong, but I think that's understating hyper-threading a bit.

A processor has a given set of resources - ALUs, registers, pipelines, in-flight instruction state information, etc.

With hyper-threading, the hardware necessary to store the state of a thread is duplicated, and the processor then schedules and allocates its resources between those two threads, which are genuinely, to the extent permitted by available hardware resources (and so also to the extent permitted by their instruction flow, since that defines the requests for hardware resource), running concurrently.

> The problem is, they both share the same L2 (and L3, in the case of Xeon) caches. So unless the O/S is careful to make both hyper threads correspond to software threads in the same process, one thread will invalidate all the other thread’s cache entries, and vice-versa.

I'd need to go back and revisit cache design - it's been a couple of years. I know what you mean, but the thought occurs that a single hardware thread, running as it does any number of separate executables, is performing just as much self-cache-invalidation as two hardware threads, running the same set of separate executables.

2021-05-25 22:04:33

I may be wrong, but I am of the view Spectrum is intended to pick up where Redshift leaves off; once you get over a petabyte.

Using Spectrum with small amounts of data requires proper understanding of how Spectrum works, to avoid the problems you can rather easily run into (files too small, too many files, catalogue performance, too many queries, queries written incorrectly for Spectrum and so bringing into Redshift most or all of the data in S3, etc, etc).

Redshift itself is knowledge-intensive; Spectrum more so, since you need to understand about Redshift, but then need additional knowledge about Spectrum.

2021-05-26 01:24:29

You can do this in Postgres, and as such, you're getting nothing by going to Redshift, except the very considerable constraints and restrictions involved in its correct operation. Redshift is knowledge-intensive and has a particular and narrow use case; that, when and only when it is operated correctly, you will have timely SQL queries on Big Data. It gives you nothing else; not small data, not inserts, not updates, not lots of users, not timely small queries, etc, etc. Redshift is appropriate if you have Big Data and get requests to process that data from a WWW based interface, but it is absolutely, totally and utterly inappropriate for a web-site *per se*.

2021-05-27 15:51:34

FWIW, I've been told by support the system tables keep a maximum amount of data in terms of size; they don't keep data based on days. If you have a busy system, you have less days covered.

2021-05-29 13:48:42

You put this on SO and a couple of questions have been put to you about your setup, and about some of the things you've said there. I may be wrong, but I feel you should respond to them, rather than not doing so but reposting the question here.

Additionally, although I may be wrong, what's written seems unclear. Here you write you have a database of 100m rows, and the largest sample you want is 10m rows; on SO you write the database - in some way - "is broken out by country" and the largest county (sic - not country) is 10m rows, and you do not specify the maximum number of random records, but indicate that sample would come only from this set of 10m (which rather implies you really have something more like one database per county).

2021-05-03 19:58:10

Just a word about the decision to migrate.

I am a Redshift specialist. I've read the Snowflake white paper. My take on these two data warehouses is that Snowflake is better when you don't know how to use sorted column-store (which has bee then case with every client I've worked with, in part because the docs are no good, and in part because if you ask AWS if you can use Redshift for something, they *always* say "yes"), and that Redshift is better when you do know how to use it, and this becomes increasingly true as you move up towards larger and larger data sets; Snowflake at the high end becomes disproportionately expensive. For "better" read cheaper for a given performance.

This is because of the automation in Snowflake. With automation, you get what the automation gives you; there is nothing else - and that's both good and bad. At the low end, it's much better than a human who doesn't know what they're doing. At the high end, it's much worse than a human who does know what they're doing.

AWS is currently investing a great deal into automating Redshift a la Snowflake, but I think they'll always be playing catch-up, held back by fundamental limitations in the core code base. I also think the devs have no idea about making systems *using* Redshift, as opposed to making Redshift themselves, and I worry about the changes they are and will be making.

Regarding your table DDL, I've no clue what Snowflake offers, but Redshift extremely recently introduced a command to obtain table DDL, which may be of use. (Before that you had to know how to get all that from the system tables - I spent I can't think how long getting such code working, and now it's obsolete :-) View SQL has always been available from `pg_views`. Procedure and function SQL likewise are in `pg_proc`.

2021-05-03 22:14:24

<https://forums.aws.amazon.com/ann.jspa?annID=8641>

> Amazon Redshift: Redshift now supports utility to display the exact User Specified DDL using SHOW TABLE and SHOW VIEW commands.

Note the docs were last updated late December last year, so the new commands are not yet documented.

Regarding getting data out, yes, UNLOAD is the and the only way to go. It's scalable and emits to S3, which Snowflake will read from. The main issue is figuring out what format to emit (CSV, parquet, etc). Parquet can be fiddly in ways CSV is not, in that it has both physical and logical specifications for data types; so a timestamp, for example, is a logical specification, but it has a number of different actual physical implementations. You need to ensure the source and destination can both understand all the physical types in use.

2021-05-03 23:36:08

I have read the Snowflake white paper, and so I have an opinion with regard to comparing Redshift and Snowflake in one particular matter, as described above.

Beyond that, where I know nothing more about Snowflake, I have written nothing more. That's different to underestimating.

2021-05-04 17:32:56

Although I may be wrong, in my view, there are two key challenges to use Redshift.

The first challenge is to find someone who actually and genuinely knows enough about Redshift to use it correctly. Finding such a person is problematic, because it is problematic to come to know enough about Redshift to use it correctly. This is because both the official docs or the support team are absolutely no use whatsoever for this.

Actually finding out for yourself is a full-time, multi-year effort. If you're a big team, you will I suppose find it easier to access sufficient budget to hire one of the very few qualified people.

The second challenge, once you have access to this knowledge, is the massive adjustment that has to occur in the expectations of how Redshift can be used. Redshift is a highly specialized, knowledge intensive, narrow use case data warehouse. It is *not* a kind of "big database" - it is *entirely different*. Where people normally have no idea about Redshift, and where AWS when you ask them will always tell you Redshift can do whatever it is you want to do, there has in my experience always been the widest gap between expectations and reality. There's no use hiring someone who knows what you can use Redshift for if you then ignore their advice because you have a firmly or politically entrenched set of expectations.

For a small team, unless you have a developer who is one of the very few people who know about Redshift, I would have to advise Snowflake. If your usage begins to become heavy, and the relatively lower scalability of Snowflake is beginning more and more to translate into excessive cost, then you may come to the point where it is worth it, financially, to find someone who does know about Redshift, and endure the cost and upheaval of platform change.

The underlying problem which is the cause of all this is the absolute lack of accurate, honest and comprehensive documentation from AWS about Redshift.

2021-05-06 16:33:51

Your wish is my command :-)

I'm currently constructing a web-site for an on-going series of white papers about Redshift internals, plus the usual suspects - blog, mailing lists, forum.

I've written the software for the blog, mailing lists and forum (I wasn't happy with anything that was out there now), and then I want to finish off an initial set of white papers, and then I'll go public. I'll be posting the white papers in r/aws, too.

2021-05-17 08:37:21

I can only speak with authority regarding Redshift.

Redshift has a very narrow use case; it, when and only when correctly operated, offers timely SQL on Big Data. Nothing else. Not lots of users, not fast/small updates, nothing.

To operate Redshift correctly is knowledge intensive. You cannot just stroll up to it and use it correctly.

Correct use, although covering quite a range of issues, essentially centers around selecting sorting orders for tables, which will define the queries which can be issued while being performant, such that the queries which can be issued while being performant are those necessary to meet your business requirements.

Snowflake is you might say a fully automated version of Redshift. The automation is much better than a Redshift admin/dev who does not know what they are doing, but much worse than a Redshift admin/dev who does know what they are doing. As such, Snowflake is much more efficient than Redshift in the former, and much less efficient in the latter, where efficiency is how much costs you to obtain a given level of performance.

2021-05-17 16:43:10

Snowflake is, as I understand it, fully automated. Note I have not used Snowflake. I have read the white paper, and I've discussed the matter with people who are using Snowflake.

As such you get whatever the automation gives you; and the performance efficiency automation provides is much better than you obtain on Redshift with an admin/dev who does not know what they are doing, but much worse than you get on Redshift with an admin/dev who does know what they are doing.

2021-05-21 09:53:18

Redshift is a very narrow use case, highly knowledge-intensive data warehouse. I would, in the strongest possible terms, advise you not to use it; firstly, it is absolutely the wrong technology choice for your use case. Secondly, you will need to learn too much to use Redshift correctly - it will take years. I would suggest you use Postgres.

2021-05-27 15:53:19

There's a lot to say here.

First, be aware that ingesting data is not entirely separate from using data. Data is usually loaded unsorted, but using data requires sorting the data, but only one `VACUUM` (which performs the sorting) can run at a time *on a cluster* (not on a database, or table - a cluster).

You need to have a cluster-wide vacuum policy, so that people who need to use vacuum are able to do so, when they need to do so; and your particular ingestion process needs to play ball with the system wide vacuum policy.

That means you need to be aware of your own vacuum needs.

Second, just briefly, the second suggestion rather feels like you would do lots of small inserts. This is death and ruin to Redshift. Do not do this; it will not end well. Redshift has huge overheads in getting queries going; but, if you are operating Redshift correctly, once they get going they go like the blazes. You do not use a 10 ton lorry to move a letter; you do not use Redshift to perform small inserts.

Now, regarding the details of your initial approach. When it comes to classic archive formats, such as bzip2 or gzip, each slice in the cluster concurrently loads one file. Having just one file means just one slice is loading your data; you have lost all your parallelism. Do not do this. You should look to have at a number of files which is a positive integer multiple of the number of slices in the cluster.

You mention you're thinking to unzip the files before you upload to S3. Why? Redshift can load gzip just fine, and keeping the data files compressed in S3 is absolutely the right thing to do. (Although there is a bug in the compression handling code - if you have a file which is tiny when compressed, but decompresses to a very large number of records, you get an out of memory error. It doens't sound like this is your situation.)

Finally, back to the orignal point : loading the data is only the beginning; it has to make sense in the larger context of using Redshift correctly. The key issue with Redshift is that tables are sorted, you must pick their sorting orders, and having done so, the sorting orders them define what queries can be run in a timely manner - and you need to pick your sorting orders so that the business queries you need to run are part of the queries which can run in a timely manner.

2021-05-27 17:52:27

When using a Big Data system, the ETL sub-system must also support Big Data, because the day will come when you need to reload your system from scratch; if that takes a week, because the ETL does not scale, you have a problem. Moreover, with Redshift, resizing clusters is problematic. The real solution is to have an ETL system which supports multiple concurent clusters; you bring up a new cluster, of the new size that you wish for, get it running, move users over to it, and then delete the orignal cluster. To do this obviously you need to be able to bring up a new cluster in a timely manner. A step where you have to ungzip large single files and produce multiple gzipped files form them is only really viable if you can make is scale (and you could, I think, perhaps with say EMR), because although you could do that work on an on-going basis and cache the results, the day may come when you find there was a bug, you missed say the final line in every file, and you need to remake them all - and so once again, if a part of your ETL does not scale, you need to wait for a week.

2021-06-04 15:57:40

I am a Redshift specialist.

Data aggregation is not in any particular way connected to data warehouses. To the extent that there is a connection between the two, it is that data warehouses allow you to do away with aggregation, because they support very large volumes of data. Aggregation is something you do when your database is unable to cope with lots of rows.

2021-06-04 16:02:43

> I realised that OLTP dbs are not meant to be used for running frequent queries. And that we have data warehouse for that (Please correct me if I'm wrong).

This is wrong, or, at least, what's written is wrong. I suspect you may in fact mean something a bit different to that which you have written.

Redshift gets you one thing and one thing only; that when and only when correctly operated, you will have timely SQL on Big Data. You get nothing else. No lots of queries, no lots of users, no rapid/small updates/inserts, nothing. You do not get *frequent* queries.

Redshift is knoweldge-intensive and has a pretty narrow use case. It comes with a considerable number of restrictions and constraints which must be honoured to operate it correctly. When not operated correctly, it is far worse than a normal database.

I advise people to use Redshift only when they have so much data, and they want SQL, that they cannot use a conventional database. We are talking here at *least* a terabyte.

You, at 120 GB, are *WELL* within the performance envelope of a normal database.

I may be wrong, but I think the problem here is possibly that your database is not set up correctly, and/or the table design is poor, and/or the queries being issued are not well formed; certainly and absolutely, with so little data, you can have timely queries.

Redshift is knowledge-intensive and has a lot of constraints : I would absolutely advise you, in the strongest possible terms, not to use Redshift for your use case. You do not need it.

I would advise you to investigate the setup of your existing system, rather than to replace that system.

2021-06-04 17:18:47

> But someone else suggested that I just use a PostgreSQL instance instead of going for a data warehouse. I kinda think that it's makes sense.

> Does that sound like a reasonable thing to do?

Yes. I could be wrong, but I would absolutely recommend that course. Get a normal database, which is in your control, get the data into there and then you can arrange it and query it efficiently.

BTW, for 120 GB you do not need a powerful instance. This is not a lot of data. Something normal will be fine.

2021-06-08 11:31:39

> Can I reorder columns in the table without modifying the parquet files, as long as I don't change the column names?

I would need to double check, it's been a year or so since I was investigating parquet file behaviour with Spectrum, but I think I recall the anwer is yes. Redshift maps the names in your table to the names in the parquet file.

> If I want to change column names, do I need to update the schema in each relevant parquet file?

Again, I would need to double check, but I think so. If you have a column name in Redshift which is not present in the parquet files, Redshift will treat this column as if it exists, but with all values as NULL.

2021-06-10 12:14:01

I may be wrong, I'd need to check, but I think that setting is for ORC only.

2021-06-16 21:20:15

So, let me get this straight.

You have a table. You are loading data - using `COPY`? and this is taking a long time - so long that you have given up hope and want to abort? and to do this, you are dropping the table?

2021-06-13 16:07:27

I can speak with authority regarding Redshift.

Redshift is knowledege-intensive, and has a rather narrow, particular use case. I would *strongly* advise you not to use it; you do not know enough.

I've not used Snowflake, but I've read the white paper. I think it has the same narrow, particular use case as Redshift, but where it is fully automated, it performs much better than Redshift when operated by users who don't know what they're doing. (Conversely, Redshift is much better for users who do know what they're doing, and this difference becomes acute and expensive as data volumes become large).

Unless however you have at *least* a terabyte of data, then - for the love of God - stick to a conventional unsorted, row-store database, such as Postgres. If you have a terabyte, and less than ten, I would actually advise you to turn to an unsorted column-store database, such as (what used to be called) memSQL (I forget its new name).

Actual sorted column-store is something you only need if there is *no other way*, as all the Big Data solutions by the choices they must make to support Big Data comes with a wide range of constraints and restrictions upon their use, but you are typically free to wholly ignore those constraints and restrictons, and then performance is poor and erratic and the cluster becomes expensive, since it is inefficient and so need a great deal of hardware to improve performance.

2021-06-14 06:25:45

When I wrote "no other way", what I mean to say is, if you want SQL, and of course you will need it to be timely, and you have Big Data - then it's sorted column-store time. There is no other way to have timely SQL on Big Data.

A fundamental issue which clients in my experience absolutely are unaware of (in part because AWS obfuscate everything which is not a strength, and also tell clients Redshift can do anything they ask for) is that with sorted column-store you cannot use essentially arbitrary data with essentially arbitrary queries (as you can in Postgres et al); and so it can be your use case is in effect *not supported*, and you cannot in fact use sorted column-store. Clients imagine Redshift to be a "big database", and so to have the same operating characteristics as a normal database. It does not, not even remotely.

Additionally, the user base has to understand how to correctly operate sorted column-store, which is to say, how to write queries which adhere to those supported by the sorting order chosen for the tables used by those queries. Redshift cannot be used by users who know only SQL. Users must also know how to correctly write queries for sorted tables. This requirement is also in my experience completely unknown.

2021-06-14 09:59:41

Redshift elides a wide range of functionality found in Postgres - even in Postgres 8, which is ts origin, and indeed part of what is no longer available is that ability to define types.

However, this is absolutely *not* the key issue.

The key issue is that sorting is required to handle Big Data, as sorting allows the use of min-max culling which in turn is able, when correctly operated, to reduce the number of blocks read by a query to only those required by the query. However, to correctly operate min-max culling, the sorting orders of the tables involved in a query must be appropriate to the work being performed by the query; and so when you select your sorting orders, you are definiting both the set of queries you *can* run but then necessarily also the set of queries you *cannot* run.

The challenge then is to define sorting orders such that all of your business queries can run; and this can be problematic, because defining sort orders such that *some* of those queries can run often means you have defined sort orders such that others *cannot* run. Solving this problem is the key to using sorted colun-store, and it is an art, not a science. It often revolves around high-level data design, conducted with full knowledge of how that design will ramify down to table sort orders, which often leads to quite unusual data designs to allow the necessary queries to be viable.

It also requires the user base to understand sorting, so they do not just go ahead and run queries which are completely hostile to the chosen sorting order, since this will hammer the cluster and led to poor and erratic performance for all users.

2021-06-22 06:20:50

> your size is small you can do it with whatever db you want

Redshift is *extremely* inefficient with small data. Redshift will *work* of course, SQL is SQL, but it would be a poor choice; small data mitigates *against* certain choices, rather than indicating anything can be used.

2021-06-27 13:27:35

Computer hardware exhibits certain properties, for example with disk, seek is slow but contiguous read is fast.

Given these properties, when you come to design a database, of any kind, from scratch, you are faced with a series of choices, where you can have one property for your database, or another, but not both - because of the properties of computer harware. There's no getting around it.

Having made your choices, you end up with a database. Of all the possible combinations of choices, you end up with about five of so orthagonal and optimal sets of choices - you could make choices other than these, but you end up with a database which is inferior in its capabilities to one of the five or so optimal sets of choices.

Those five or so optimal sets of choices basically give us;

1. map-reduce

2. key-value (graph is superimposed on this)

3. relational (three different flavours - unsorted row-store, unsorted column-store, sorted column-store) (technically, sorted row-store as well, but to my knowledge this has never been implemented, because sorting is needed when you have Big Data, and when you have Big Data, you want columns).

If we imagine a database which supports all of these types of database, they would still be pretty separate when it comes to querying. For example, you can't offer full SQL on map-reduce, because you can't do joins. You can't offer small/rapid inserts on sorted column-store, because you need to sort - and so on. You can get some cross-over, I think - for example, you could allow equijoins to key-value from relational - but certainly nothing really generalized.

What you really have then here is not *one* database, but five different databases superficially amalgamated, with three more-or-less seperate query languages (relational databases all give SQL), where even the SQL offered by the different relational databases has to be used very differently between the sorted and unsorted types.

Additionally, even given all five, there are still workloads which cannot be handled by any of them; for example no database offers SQL on Big Data with large volume small/rapid data updates/inserts, because it's fundamentally impossible with current computer hardware.

I had a brief look around on the SingleStore web-site. I found no end of content telling me how great it was, but almost no content telling me what's actually been implemented, so I have no opinion about it, except to note it seems to support indexes, which you can't use with Big Data. SingleStore seems to support multiple table types, so maybe the sorted column-store types do not support indexes.

(As an aside, their site has a pop-up on what looks like every doc page asking if the page was useful. It's *incredibly* annoying. SingleStore, if you see this, I'm trying to read - I do *not* want mobile page elements hoving into view at random times, and on *every* page.)

2021-07-07 13:17:34

Loading 500 rows at a time is *completely* inappropriate for Redshift. It is a Big Data system. You should be looking to load gigabytes at a time.

The usual loading method would be to store the entire data set to S3 and load using `COPY`.

I can't tell you what's going wrong - there's far too little information for that - but I can advise you to use a completely different loading method in the first place.

2021-07-10 22:03:09

Not Redshift. It is knowledge-intensive, for both developers and users, and has a wide range of constraints and restictions upon its use, and the minimum data volume is one terabyte. Anything less and you can and so do use a conventional unsorted row-store database.

2021-07-18 20:00:47

You are correct. The leader node is the same type as the worker nodes.

2021-07-19 17:58:47

I may be wrong, but I think your queries must be operating Redshift incorrectly. Terabytes of data is *not* so much, and broadly speaking any query which correctly operates Redshift with that amount of data should return in at most some tens of minutes or so.

If Redshift is being operated incorectly, then it will be being overwhelmed : you will have a lot of swapping to disk. Probably what is happening is that the query drags itself along the floor on its elbows for as long as it can manage it, and then finally the cluster gives up and the query dies.

In fact, I've just looked more closely at your error message. That is the wonderfully misleading message given when the cluster kills a query, and this normally happens due to resource exhaustion.

2021-07-19 19:07:41

does the query run elsewhere, e.g. is it an rstudio problem? I doubt it is, but its good to check.

&#x200B;

Does it work over a alimited subset of data?

&#x200B;

/u/MaxGanzII's view is quie likely.

2021-07-20 06:56:50

First, I'm afraid that's a billion miles away from containing the information necessary to know if a query is operating Redshift correctly.

Second, what you've written here implies an `INSERT` with a `SELECT`. However, the error message you posted refers to an `UPDATE`. I guess you're doing more than just inserting from A to B. Updates are seriously bad news for Redshift.

I would also guess you're doing some work in the `SELECT` part of the `INSERT`, and it might be it is that work which is the issue.

Basically, you need to ensure when you have two Big Data tables being joined you're using a merge join, and if you have a hash join in use, either one or both tables are small (which is to say, fit into memory once hashed). With a hash join you can only have one or two tables being a Big Data table. Obviously, nested loop joins are completely out of the picture.

Additionally, if you're using `UPDATE`, all bets are off. Then it's a different set of issues and concerns.

I would say to that Redshift is a fairly narrow use-case, knowledge-intensive data warehouse. Both the developers and users have to know how to operate Redshift correctly, because it is only capable of handling Big Data in a timely manner when operated correctly.

2021-07-29 07:54:34

I very recently needed to figure out how to start Redshift clusters, via boto3, in a VPC.

It took me *forty man-hours* of work to figure it out the basics; I'm still improving the script to make it robust. That's five days. It should have taken five *minutes*.

It was a nightmare.

I've written a white paper about how it is done; this is part of the introduction;

> Getting to grips with this complexity, understanding it, is problematic, as the documentation is, for a number of reasons which will be discussed, not useful or indeed actively harmful, and because the systems through which you work offer no, or very nearly no, feedback, status or error information : in short, a complex, undocumented system, with no or almost no debugging information.

AWS have a rather fragile veneer for the complexity, absence of documentation, and lack of debugging information; the default VPC.

Users can delete it; and they have no idea why it's there, didn't create it themselves, and either are not using it (if they're using Classic) or may well be unaware they're using it (since it is used silently, by default).

What could possibly go wrong?

I came over time to understand there are five different scenarios, each of which you must handle, to be able to launch Redshift into a VPC using boto3;

1. Classic

2. VPC in regions with Classic and with the default VPC

3. VPC in regions with Classic and without the default VPC

4. VPC in regions without Classic and with the default VPC

5. VPC in regions without Classic and without the default VPC

Before VPC, when it was Classic only; none of these problems existed. There was one scenario, Classic, and you simply launched a cluster. Life was simple.

Retiring Classic will reducing the current scenario count down to two (VPC with or without the default VPC). This is an improvement and I am in favour of it, but the actual underlying problems, the complexity of networking configuration, the documentation, the lack of debugging information, are unaffected.

2021-07-30 04:47:02

I don't know about *recently*; I specialize in Redshift, I've spent the last three years working on a book about Redshift internals, and I've only ever contacted AWS Support with regard to Redshift.

About two years ago, I stopped paying for Support, not because it was too expensive (it's not - it's a token fee) but because *it wasn't worth having*.

The support engineers did not understand Redshift, were unable to comprehend bug reports which were more subtle than causing an assert, at times provided flatly incorrect information, and I realised when I asked questions they did not want to answer were *actively misleading me* to make me go away.

In the end I realised I wasn't getting anything from them; they weren't worth having *at all*, regardless of price, even if the price had been zero.

I had built up though something like 130 support cases, and I intended to go through them all to check for any useful information.

When I finally got round to doing this, some months ago, I discovered about half the cases were missing.

I asked Support about this. It was explained to me that Support cases last for one year only, and that this is documented (it is - one line in a huge FAQ, below a question about where I can find AWS documentation in Japanese), and that there was no way I could get my data back, and with the reply the case was *closed*, so there was clearly a desire not to listen to any reply I might make.

You'd have to come away from that thinking Support just don't care.

I note the difference between my experience and the slogan I constantly hear from AWS, that they put the customer first. I think it used to be true, in the earlier days, but now that's mainly AWS saying that to itself.

AWS I think has fallen prey to the classic problems faced by very large companies. They have no idea what their customers are actually experiencing, and even if they did know, no way to respond to it; internal politics is now a more powerful force than customer demand, and we can also see the usual insane large company management trying to make services like customer support fungible, so they can just "add more resource".

This is I think one of the outcomes of following metrics, rather than actually talking to staff and customers. I bet customer support has amazing metrics; it's just all the customers are having a terrible time.

That problem cannot be fixed, because to do so would be to say a significant number of senior managers are unfit for their positions. No one who works under them, who is close enough to customer to see what's actually happening, is *ever* going to tell their manager's manager manager that he's unfit for his job, and even if he did tell them, he would be ignored before he was eventually dismissed for rocking the boat ("not fitting in").

2021-07-31 13:38:27

What I mean to say is that you need to understand there *are* five scenarios, so you can then figure out a way to start clusters which always works.

If you don't understand all the possible situations the script can be in, you could end up writing a way to start clusters which doesn't always work.

For example, if you use the VPC, your script is not always going to work.

A second example is that if you didn't know about the default VPC; you could actually be using it without realising that you are, and then you're really in trouble - you've been misled, and you now misunderstand your situation, and your script is going to fail when you thought it was all going to be fine.

The way I adopted is to make my own VPC and its associated networking objects, configure them all, use the VPC, and then when the cluster is done, deleting them.

Figuring out how to do this was the programming equivalent of tooth extraction sans anaesthetic. No meaningful docs, no meaningful debugging information, complex system with lots of configuration. Every developer reading this knows exactly what that's like.

Regarding the VPC analyzer, I had no idea it existed, it never came up during the days of working and searching for information, and although I've not tried it, I wonder if it would have been of any use; looking at it, I get the feeling it presupposes the existence of a sanely arranged and configured VPC and associated objects in the first place - then it tells you what config is wrong; it doesn't tell you which objects you need to create in the first place and which other objects to create, and what hooks up to what.

2021-07-02 21:29:33

u/MaxGanzII it's nice to see people bring this level of knowledge to a discussion! I am an engineer at SingleStore, and would like to clarify our architecture a bit and provide some followup resources to learn more.

But first, thank you for the candid feedback. Both issues are noted and I have escalated your comments internally. The team is already working on resolution.

Let's talk about SingleStore. In your comment you state "for example no database offers SQL on Big Data with large volume small/rapid data updates/inserts, because it's fundamentally impossible with current computer hardware". While there are certainly tradeoffs in every approach, I think you may be interested in how we handle this particular workload.

For anyone who is unaware, SingleStore is a scale-out multi-model database suited for both transactional and analytical workloads. We focus on providing the ideal solution for what we call "[data-intensive apps](https://www.singlestore.com/blog/data-intensive-applications-need-modern-data-infrastructure/)". The rest of this post is specifically highlighting features which help us handle Big Data alongside a large volume of small/rapid data updates/inserts - but that's not to say that this is all we can do. Jump to the end of this post if you want to learn more.

**Sharded data**

SingleStore is unlike many relational databases in that it is scale out. It achieves that by splitting up the data into a fixed set of partitions. We allow the schema to specify how table data is mapped to each of those partitions via a specified shard key, or the user can decide to randomly shard the data. Because we use a fixed set of partitions, every node in our distributed architecture can easily determine where a given row lives (assuming it's not randomly sharded) without coordination. In addition, our query optimizer can often accelerate joins and group bys by pushing down the work into each partition (i.e. the join is operating on the shard keys).

This feature provides the foundation to handle "Big Data" as well as allowing engineers to load balance "small/rapid updates/inserts" over many partitions. In effect, it lets us split up "Big Data" into many pieces of small data which naturally makes things more manageable.

**Network first durability**

Since we handle "system of record" OLTP workloads we need to provide data durability. In our architecture we decide to focus on network durability rather than disk durability in order to provide better availability guarantees to our customers. In effect, this means we acknowledge writes to the client once transactions have been written to a pair of partitions such that the pair of partitions are hosted in two different failure domains (traditionally servers on different racks). These partitions acknowledge the write once it's safely in memory, but before it's written to disk.

This particular set of tradeoffs allows us to provide low-latency durable transactions while also ensuring that if a node fails we can immediately serve data from the pair set of partitions for that node.

**Universal Storage**

SingleStore stores data column-oriented by default. This means that we store each column of data packed into vectors (called blobs) and organized into sets of rows (called segments). For a deep dive into how this works, check out the [whitepaper](https://www.singlestore.com/resources/research-paper_columnstore-engine-for-real-time-streaming-analytics/). As pointed out by u/MaxGanzII, column-store engines often tradeoff small/rapid inserts in order to maintain sorted runs of data - but this is not the case with SingleStore. To do this we build on top of the traditional LSM tree architecture of most column-stores with the following features:

*OLTP optimized segment*

In order to handle small/rapid inserts/updates/deletes, we store recent writes in a storage engine we call "rowstore". Rowstore is optimized for OLTP workloads and will be discussed in more detail later on in this thread. Leveraging this technology ensures that inserts, updates, and deletes offer minimum latency while also allowing us to efficiently serialize long sorted segments into the column-store.

*Seekable compression*

Most on-disk column oriented storage engines have excellent compression due to adjacent records in a column vector sharing the same type and often a reduced distribution. SingleStore takes advantage of this property by dynamically determining the best compression to use on a per column vector basis. Unfortunately, a common tradeoff to this approach is needing to decompress most (or all) of the column vector in order to retrieve any individual element. To deal with this, SingleStore implemented custom data encodings which offer a combination of good compression along with the ability to seek for an individual element without decompressing the entire column vector. This investment has provided us with numerous advantages over our competitors such as the ability to efficiently handle key/value workloads.

*Hash indexes (+unique!)*

While seekable compression provides us with the ability to lookup individual elements in a column-vector, we still need the ability to quickly find individual rows. To do this we built unique and non-unique hash indexes on top of column-store.

With the above features, Universal Storage provides high performance update, replace, and delete queries at low latency without giving up the analytical advantages of column-oriented storage. You can learn more about these features in the Universal Storage series of blog posts linked at the end of this post.

**Rowstore**

As mentioned above, Universal Storage leverages a storage engine called rowstore to handle small/rapid inserts/updates/deletes. Customers can optionally pin entire tables to this data structure allowing them to optimize entire portions of their workload for OLTP.

Rowstore stores rows contiguously in a memory-optimized data structure called a lock-free skip list. This sorted data structure provides highly concurrent low-latency reads and writes as well as efficient scans. Secondary indexes are also only stored in memory which allows us to only write raw data mutations to disk and network.

**Distributed query planner**

Finally, SingleStore ties all of these features (and many more I didn't mention) together in our distributed query planner. This subsystem is responsible for compiling an arbitrary SQL query into a highly optimized program leveraging all of our data structures and algorithms to produce the desired result as fast as possible. And, as icing on the cake - our plans are compiled down into a custom bytecode designed by and specifically for SingleStore. This bytecode is then both interpreted during initial query execution as well as compiled to x86 machine code via a quick journey through LLVM ([more codegen details here](https://docs.singlestore.com/db/v7.3/en/query-data/run-queries/query-concepts/code-generation.html)).

**In conclusion...**

For the reasons above, SingleStore provides a fantastic backend for data intensive applications. If you are interested in learning more, I encourage you to take SingleStore for a spin. We are currently offering [$500 worth of credits](https://www.singlestore.com/try-free) on our managed service which is more than enough to get a real feeling for our tech. Or if you want to see a reference architecture of how to use SingleStore for large scale logistics workload check out this project I recently built:

* [Logistics Blog Post](https://www.singlestore.com/blog/scaling-worldwide-parcel-logistics-with-singlestore-and-vectorized/)

* [Logistics Github Repo](https://github.com/singlestore-labs/singlestore-logistics-sim)

**Recommended reading to learn more about SingleStore:**

* [In-memory Skiplist Architecture](https://www.singlestore.com/blog/what-is-skiplist-why-skiplist-index-for-memsql/)

* [Columnstore Whitepaper](https://www.singlestore.com/resources/research-paper_columnstore-engine-for-real-time-streaming-analytics/)

* [SingleStore 7.5 Release Highlights](https://www.singlestore.com/blog/singlestore-db-7-5-now-available-for-preview/)

* [Universal Storage](https://www.singlestore.com/blog/memsql-singlestore-then-there-was-one/)

* [Universal Storage Episode 2](https://www.singlestore.com/blog/memsql-singlestore-memsql-7-1-episode-2/)

* [Universal Storage Episode 3](https://www.singlestore.com/blog/singlestore-universal-storage-episode-3-revenge-of-the-upsert/)

* [Universal Storage Episode 4](https://www.singlestore.com/blog/singlestore-universal-storage-episode-4/)

* [What is HTAP?](https://www.singlestore.com/blog/what-is-htap/)

* [Overview of Data Intensive Applications](https://www.singlestore.com/blog/data-intensive-applications-need-modern-data-infrastructure/)

2021-07-03 12:50:58

1/2

> But first, thank you for the candid feedback.

Thankyou for your constrained and diplomatic response. I more fully expressed my irritation as I was alone and it was in writing, than I would have done in person or in company, and I am fault for having done so.

> Sharded data

> SingleStore is unlike many relational databases in that it is scale out. It achieves that by splitting up the data into a fixed set of partitions. We allow the schema to specify how table data is mapped to each of those partitions via a specified shard key, or the user can decide to randomly shard the data. Because we use a fixed set of partitions, every node in our distributed architecture can easily determine where a given row lives (assuming it's not randomly sharded) without coordination. In addition, our query optimizer can often accelerate joins and group bys by pushing down the work into each partition (i.e. the join is operating on the shard keys).

Support for multiple nodes rather than a single machine is necessary for scaling, but it's insufficient; hardware alone is not enough to support SQL with Big Data - you must have sorting, and if you have sorting, you must operate it correctly.

The classic SQL databases (all unsorted row-store), Postgres, mySQL, MS SQL, they're all single machine. They represent an earlier generation of technology.

> This feature provides the foundation to handle "Big Data" as well as allowing engineers to load balance "small/rapid updates/inserts" over many partitions.

Distributing rows causes all queries to be load balanced over many nodes.

It's not specific to small/rapid update/insert queries.

It helps with all queries, but the basic problem with small/rapid update/insert is sorting, and that is not solved by having multiple nodes.

> Network first durability

> Since we handle "system of record" OLTP workloads we need to provide data durability. In our architecture we decide to focus on network durability rather than disk durability in order to provide better availability guarantees to our customers. In effect, this means we acknowledge writes to the client once transactions have been written to a pair of partitions such that the pair of partitions are hosted in two different failure domains (traditionally servers on different racks). These partitions acknowledge the write once it's safely in memory, but before it's written to disk.

It's a choice on the continuum of choices between risk/durability. Quite a nice choice, I think, although for really critical databases the risk of data-center wide power-failure might be too much.

You get some more performance up-front, but in the end everything will need to be written to disk, you can't *not* do that, and it's a producer-consumer scenario; if you add data more quickly than the disks can write, sooner or later the memory allocated to this task is full and then you're once again running at the speed of disk.

> This particular set of tradeoffs allows us to provide low-latency durable transactions while also ensuring that if a node fails we can immediately serve data from the pair set of partitions for that node.

One question is always how much performance is affected during recovery; if you're still up, but you can no longer meet the current load, you may as well be down. Redshift "solves" this problem by just going down immediately when a node fails =-)

> Universal Storage

> SingleStore stores data column-oriented by default. This means that we store each column of data packed into vectors (called blobs) and organized into sets of rows (called segments). For a deep dive into how this works, check out the whitepaper. As pointed out by u/MaxGanzII, column-store engines often tradeoff small/rapid inserts in order to maintain sorted runs of data - but this is not the case with SingleStore. To do this we build on top of the traditional LSM tree architecture of most column-stores with the following features:

Redshift, last I knew, does not use LSM. However, I may be wrong, and it may have changed; figuring out how VACUUM is working internally is challenging. The system tables do not reveal very much. My current understanding is that there is a k-way merge between sorted blocks of rows.

Broadly speaking, when it comes to needing to sort your data, but having a large number of small inserts, you fundamentally handle the problem - by whichever particular mechanism you use - by buffering up a number of inserts and then applying the buffered set in one go; and by breaking up the data stored into separate units/files/blocks, so that when re-sorting has to occur, you can look to minimize the number of units/files/blocks which need re-sorting (if you only insert one row, only one unit/files/block will need sorting, rather than the entire table).

So for the latter continuum, at one extreme you take all the data and do a single sort and at the other extreme, you sort for every insert. In-between the extremes, you buffer and sort every now and then. (Note also sorting say 10 small blocks of data is inherently less costly than sorting 1 block of data the size of 10 small blocks).

Broadly speaking, the less records you buffer, the quicker any given query runs (you make your insert and that query is then fully serviced and is done, so it's fast *from the point of view of that query*), but the slower the system is as a whole (if you buffered all the queries and then did one sort, the time per query would be much less, but each query would have had to have waited for all the work to be done).

For the other, you have one unit/block/file per row (ouch) or you just have *one* unit/block/file for all rows. In-between you might have say 1mb blocks, or 16mb files, or what-have-you. (Of course, if you insert data such that you manage to need to insert a row in most of the units/files/blocks, you're gong to need to re-sort a lot, so it doesn't save you then - you can try to rescue yourself by buffering up writes so each sort consumes not one row but many, this brings us back to the trade-off between individual query performance and overall system performance.)

The smaller the blocks, the more disk seeks you must perform when reading the data, and so the slower performance (and disk seeks are THE issue when it comes to performance), but the quicker you can insert new data.

It's all trade-offs, it's all choices on a continuum, and the more you choose to have one thing (say rapid insert) the less you have of something else (rapid read).

There's no magic bullet. You pays your money, you makes your choice. Redshift I would say went for maximum system performance, without worrying about individual query performance. SingleStore is elsewhere on that continuum.

> OLTP optimized segment

> In order to handle small/rapid inserts/updates/deletes, we store recent writes in a storage engine we call "rowstore". Rowstore is optimized for OLTP workloads and will be discussed in more detail later on in this thread. Leveraging this technology ensures that inserts, updates, and deletes offer minimum latency while also allowing us to efficiently serialize long sorted segments into the column-store.

Same issue, though. It's a kind of buffer. When it's full, you're back to the performance of the underlying system - in this case, how quickly you can shift data out of "rowstore" - and if you didn't have it, overall system performance would be higher (but individual query performance, lower or much lower). Still a trade-off.

2021-07-03 12:52:17

2/2

> Seekable compression

> Most on-disk column oriented storage engines have excellent compression due to adjacent records in a column vector sharing the same type and often a reduced distribution. SingleStore takes advantage of this property by dynamically determining the best compression to use on a per column vector basis.

I'm strongly against this. I don't like the databases making choices for me. It's a black box, controlled by someone else, and I have no idea what choice it will make, when it might change its mind (and changing compression is expensive), how it makes choices or when how it makes choices will change. You cannot knowingly design a correct system when that system contains black boxes controlled by third parties.

(This is a growing problem in Redshift, as they work to become more Snowflake-like.)

Moreover, what if those choices are no good? Redshift makes a terrible, terrible mess of its automated encoding choices (in part I think because of their internal politics - it's not even just a technical problem, although there are also in this technical problems); I have to put in non-trivial effort to make sure it never makes choices for me and certain functionality Redshift offers cannot be used as-is because it forces encoding choices on you (you could still use it, but by changing the encoding choices manually afterwards, which is crazy and expensive). If you do not provide a way for users to make a choice, you force any mistakes you make upon them.

Also, encoding choices are *not* hard for humans to make. Not always, but very nearly almost always the correct encoding method stands out a mile.

> Unfortunately, a common tradeoff to this approach is needing to decompress most (or all) of the column vector in order to retrieve any individual element. To deal with this, SingleStore implemented custom data encodings which offer a combination of good compression along with the ability to seek for an individual element without decompressing the entire column vector. This investment has provided us with numerous advantages over our competitors such as the ability to efficiently handle key/value workloads.

I could be wrong, but I've never seen any database, ever, need to decompress an entire column to get a single row. That would be crazy. Redshift compresses on a per-block basis (block is 1mb). Each block is fully stand-alone.

As it is, I'm not clear enough about what's being described here to understand it.

It may be you mean to say you have the entire column compressed, and without any decompression, can seek to an individual row and decompress it, without decompressing any other row.

In this case, you won't get this functionality for free. Basically speaking compression works by reading in a lot of data and figuring out the best alternative smaller representation for that data - but to go from that smaller representation, you have to read it all to be able to figure out what you did to make it in the first place. If you do not allow that, then compression cannot occur on a lot of data at once but only on one row at a time, and in that situation compression is terrible, compared to what you get if you can compress big chunks of data in one go.

It might however be that you compress on a "blob" basis (assuming a blob is like a Redshift block, independently compressed) and decompress that to get the row out (because here you're not decompressing the entire column, as described above).

In this case, you're doing what Redshift does, and it's a natural thing to do (would be strange not to).

> Hash indexes (+unique!)

> While seekable compression provides us with the ability to lookup individual elements in a column-vector, we still need the ability to quickly find individual rows. To do this we built unique and non-unique hash indexes on top of column-store.

Ah. I see. I may be wrong, but I think you're dived ahead a bit of yourself. "Elements" is a new term, and in fact you used it for the first time above, and I took it to mean rows.

So... you have blobs, and blobs contain elements, and elements contain rows? and you can seek to elements (that sounds fine - they may be the things that get sorted when new data is added, so they may even be separate files), and within an element you can index to a row.

That seems a bit puzzling. An element would need to have quite a lot of rows in to make a hash worthwhile - up to a certain point its faster just to read the block and all the rows. If you do have a lot of rows, and so a hash is worth having, I'd then think you would run into the costs of hash maintenance, which is why they're not used with Big Data.

> With the above features, Universal Storage provides high performance update, replace, and delete queries at low latency without giving up the analytical advantages of column-oriented storage. You can learn more about these features in the Universal Storage series of blog posts linked at the end of this post.

I'd say it is a particular set of choices on a number of continuum of choices and so offers a particular set of characteristics, which I would so far, very tentatively and broadly speaking, say sacrifices some OLAP performance to offer better OLTP performance.

> Rowstore

> As mentioned above, Universal Storage leverages a storage engine called rowstore to handle small/rapid inserts/updates/deletes. Customers can optionally pin entire tables to this data structure allowing them to optimize entire portions of their workload for OLTP.

That sounds basically like support for row-store tables, as well as the support for column-store. That's fine, I think; the row-store will I guess not be sorted, so if you join row-store to column-store, you'll make (or already have) a hash on the row-store and do a single pass over the column-store and look up every row from it in the hash.

> Rowstore stores rows contiguously in a memory-optimized data structure called a lock-free skip list.

Does it now? :-) lock-free data structures are fundamentally problematic to debug. There was a very nice piece of work a while back on using a modified GCC to partially explore all possible execution paths, and that to my (not very up to date) knowledge is the best method yet.

How did you debug that data structure? are you *sure* you're using it correctly? if its from a third-party, and you trust them, and they do know what they're doing, then they *may* have got it right; if you made it yourself, you better be really, *really* sure you're getting it right.

It's very easy to get race condition which show up say one time per day when you have say 100,000 clients, and that's a problem for a database.

> This sorted data structure provides highly concurrent low-latency reads and writes as well as efficient scans. Secondary indexes are also only stored in memory which allows us to only write raw data mutations to disk and network.

But this is kinda irrelevant. CPU/memory access performance is not the issue. It's disk. It's always disk, for databases If your hash is fully in memory, that's great - it's *never* going to be a problem, lock-free or not. But if it spills to disk, then you have a performance problem. If you say it's *always* in memory then you're really saying "tables can only be of a certain maximum size", because any larger than that and the hash spills to disk.

Of course, with OLTP you *do* want to say "don't make your tables large", but I am pretty sure offhand there are plenty of use cases where the hash is larger than the memory which is available for it (what happens when you have lots and lots of large tables? lots of disk, not much memory? or do we say "you must have n GB of memory per n GB of disk"?) (which is why hashes normally go to disk, as well as for persistence across reboots of course - I guess you are also persisting to disk, or start-up times will be slow where all the hashes are being rebuilt).

> Distributed query planner

> Finally, SingleStore ties all of these features (and many more I didn't mention) together in our distributed query planner. This subsystem is responsible for compiling an arbitrary SQL query into a highly optimized program leveraging all of our data structures and algorithms to produce the desired result as fast as possible. And, as icing on the cake - our plans are compiled down into a custom bytecode designed by and specifically for SingleStore. This bytecode is then both interpreted during initial query execution as well as compiled to x86 machine code via a quick journey through LLVM (more codegen details here).

Redshift does this. The problem is that it gives you a significant delay before a query can start, because it has to be compiled. It used to be, Redshift again, in problematic cases (BI tools) you could have a *ten minute delay for compilation*. AWS largely solved this problem recently by offloading compiles (queries are very parallel - each query ends up being many separate binaries) to supporting EC2 instances, so now a novel query (the compiled binaries are cached, which is fairly effective, but of course irrelevant for a new query) takes 5 to 7 seconds to compile.

Redshift has limited number of concurrent queries, too, so one of the reasons it doesn't do small/rapid is that if the queries do need to be compiled, you used to end up with many (easily tens) of seconds of compile delay, and even now with 5 to 7, it's still a fundamental barrier. If you start doing lots of queries, query backlogs rapidly develop, entirely regardless of the issues of actually getting data down to disk.

I've no idea if SingleStore has limits on the number of concurrent queries.

2021-07-07 00:47:40

u/MaxGanzII, I appreciate you taking the time to write out such a comprehensive response. We are quickly approaching the limits of what can be easily discussed in a Reddit thread, so I will try to keep my responses brief. I plan on organizing a lot of the information we discussed into an upcoming blog post and the SingleStore documentation.

**Sorting, big data, and scale**

You have reinforced multiple times that sorting is required for big data. This is true in most cases. But it's very important to clarify that sorting, or the lack thereof, is a property of the schema - not the database. All of the relational systems you have mentioned (SingleStore, Redshift, Postgres, MySQL, MS SQL) provide sorted data structures as a mechanism of storing data. It's ultimately up to the schema designer to decide how or if they want to sort their data, and if they decide not to, they need to live with that decision.

**Benefits of distributed data**

In your comment you state "Distributing rows causes all queries to be load balanced over many nodes" which is false. In SingleStore, we distribute rows based on the table's shard key which allows us to restrict queries to a subset of partitions (and thus nodes) in many cases. For example, in most OLTP workloads, many queries deal with either reading or writing a row associated with a primary key. In SingleStore, those queries will be routed to the partition that row is contained in - allowing for horizontal scalability in key/value workloads.

**Durability**

In your comment you state "although for really critical databases the risk of data-center wide power-failure might be too much". This is addressed in most data centers via failure domains attached to dedicated power and network stacks. That said, for our customers who require cross-region durability, SingleStore helps customers architect their application with this requirement in mind. We also offer cross-region replication with all of the compromises it's associated with.

**Memory constraints**

A couple of your points mention memory being a limitation. Of course it's true that if you are using memory as a buffer and that buffer is full then the system's behavior will suddenly change. But that's true for any resource constraint. Consider what happens when a database runs out of disk - most fail while others start flushing out to much slower network attached storage.

The point I am trying to make is that when you run a database you need to size the system to your workload. SingleStore can be used to satisfy most performance profiles - but is unique in that it can be used to provide predictable performance for critical 24/7 workloads. In these cases, it's not acceptable for a system to suddenly be reduced to disk write performance. For SingleStore customers running high performance workloads, we work closely with them to ensure that they are right-sizing their clusters and have the right monitoring in place to know when to scale up.

**Performance during recovery**

You make an excellent point that any system like ours which stores many complex data structures in memory will have to figure out how to deal with slow recovery times. SingleStore solves this by spreading our partition pairs throughout the cluster such that the failure of any given node results in only a small increase in load on many individual machines. In addition, since the pairs are usually within a couple seconds of being fully recovered, the runtime load during a failure is barely noticeable in most cases.

Of course, there are some edge cases with this approach - but as most of our customers operate us at load 24/7, optimizing for the always online case ensures we have predictable query performance at the expense of higher memory usage.

Note - the above only applies to data that is pinned to memory using our row-store. The recovery time for Universal Storage is nearly instantaneous.

**Blackbox compression**

I agree with your comment: "I'm strongly against this. I don't like the databases making choices for me."

Databases should respect developers and give them the tools to optimize as they like (and possibly shoot themselves in the foot while they do it). At SingleStore we want to make sure that our technology is able to handle whatever workload you throw at us, and that means fully automated decision making is likely to make the wrong choice at least some of the time. Because of that we allow customers to optionally specify which encoding is used on a per column basis.

**Seekable compression**

In addition to general purpose compression algorithms like lz4 or gzip we also provide data type and distribution specific encodings which provide performance optimizations in addition to excellent compression. Please [refer to our documentation](https://docs.singlestore.com/db/v7.3/en/create-your-database/physical-database-schema-design/concepts-of-physical-database-schema-design/columnstore.html#encoded-data-in-columnstores) if you want to learn roughly how these encodings work and what kinds of optimizations are provided.

**Organization of Universal Storage**

Apologies for the confusion regarding the term "elements" with regard to the internals of Universal Storage. To be clear we:

- split up a table into "segments" - each segment contains one million rows by default

- split up a segment by column - each segment-column is called a blob

- thus, a blob is a vector of values - which I called "elements" in my original comment above

Please [read the research paper](https://www.singlestore.com/resources/research-paper_columnstore-engine-for-real-time-streaming-analytics/) if you are unclear about the architecture.

Going back to the comment in question, our column-store has secondary hash indexes (unlike solutions like Redshift). These indexes leverage our storage architecture to enable fast single-row lookups via offset indexing without needing to store a copy of the row data within the index.

Based on extensive testing, benchmarking, and real world customer usage, Universal Storage provides great OLTP performance without sacrificing any OLAP performance.

**Lock-free skip lists are hard**

You are completely right that building a lock free skip list is hard which is why we have spent over eight years perfecting ours. In general, most real world systems problems do not have the luxury of correctness proofs and must be verified via comprehensive testing and use in the real world. Considering that most of our customers store and manipulate terabytes of data using our lock free skip list and we run over a million tests a day, I would guess we have processed well north of an exabyte of data without a single failure attributed to our skip list. If you want to see how badass our internal systems are in person - I encourage you to [apply to work at SingleStore](https://www.singlestore.com/careers/).

**Table size limited by memory**

You are correct that if a customer elects to pin a table to memory using our OLTP optimized storage engine, they will be limited to what they can fit in the memory of the cluster. This is certainly a tradeoff which must be planned for by our customers, but by no means is it an issue. OLTP workloads generally are constrained in size unlike their OLAP cousins which helps our customers size their clusters correctly. As their workload grows in size, SingleStore conveniently supports scaling both vertically and horizontally allowing our customers to seamlessly handle more data without losing the predictability of in-memory query performance. Of course, for portions of a customer's workload which don't need the extreme performance of our OLTP engine, those workloads will run just fine on top of Universal Storage.

**Compilation delay**

Indeed, Redshift has optimized for large scale analytical queries rather than interactive or short lived dynamic queries. Because of this they haven't invested in a more sophisticated query compilation system. On the other hand, SingleStore has. Our query planner emits MBC within milliseconds of receiving the query. MBC is a custom bytecode developed by our compiler engineers. During the first run of a query, we interpret this bytecode in order to start processing data with zero delay. In parallel, we compile the same bytecode using LLVM into heavily optimized machine code. This final optimized version of the code is then hot-swapped into the running query without any interruption. In this way we allow our customers all the benefits of heavily optimized query performance without the downsides of losing interactivity. And yes, this is very hard to do which is why we have hired people who worked on LLVM and GCC.

**Concurrency**

SingleStore has no explicit concurrency limit out of the box, but we do allow customers to tune us to their workload. Customers want to control concurrency, resource usage, or other properties of query execution - all of which is possible in our system. Because of the unique properties of SingleStore, many of which I have described in this thread, our ability to handle highly concurrent workloads is best in class.

I hope that clarifies some of the followup questions and points you have raised.

2021-07-12 07:04:14

I am a Redshift specialist.

The minimum data for Redshift is a terabyte, and it comes with a wide range of constraints and restrictions upon its use. In particular, both developers and users need to know how to correctly operate Redshift; you cannot just issue arbitrary SQL.

Also, you almost cannot run BI tools directly on Redshift; the only way I know to use BI tools is to make an aggregate table of about 100k rows, per dashboard, and run each dashboard from and only from that table.

There's a lot of other things to say, but I've said them so many times now; there's a huge user base out there who are, or want to use, Redshift in the wrong way, and I no longer have the motivation to keep repeating exactly the same points. I will in the future write up a decent intro document and distribute that.

BTW, Snowflake is also completely the wrong choice for this use case, both as it was originally misconcieved and also at is it now seen, with small data.

I would advise you to begin with a conventional unsorted row-store database, such as Postgres.

2021-07-15 21:41:06

I am a Redshift specialist.

I know of no sources, except perhaps the white papers I will soon begin publishing.

I can tell you one thing, very simply : do not use `EXPLAIN`.

Explain plans omit *so* much information they are actively misleading. You can go from what is really happening to an explain plan, but you cannot go from an explain plan to what is really happening.

You need to look at the system tables which contain information about steps, such as `STL_SCAN`.

Aside from that, you need to understand database kernels, learn a lot about what Redshift is doing internally, and you're good to go :-) problem is, there's almost no information on what Redshift is doing. You have to find it out by years of practical investigation. The system tables are unfortuntely a catastrophic disaster, and it takes a year just to master them; they have no organiazation at all, so you have to learn them all to know where what you want is, and they're *FULL* of bizzare, unhelpful, undocumented and sometimes inexplicable behaviour.

Oh - I can tell you one more thing : don't bother with the official documentation. I am of the view it obfuscates everything which is not a strength, and that is *not* proof-read by technical staff. It's basically propaganda, combined with a smattering of flat-out factual errors and the omission of vital information, which you would never in a billion years imagine someone would leave out.

2021-07-16 07:23:05

I'm afraid it's really not right to say joins are less optimized.

In SQL we have as we know joins. Under the hood, Redshift has three methods to implement all SQL joins. The entire point of Redshift is to permit one of these under the hood methods, the merge join, which allows two Big Data tables to experience an SQL join. There is no other way to join two Big Data tables in a timely manner. This method is staggeringly efficient - vastly more so than Postgres allows - and is one of the three main methods by which Redshift supports Big Data.

Changing subject, note in `EXPLAIN`, in Redshift, the plan costs are arbitrary.

In Postgres, costs reflect a best-effort estimate of the actual cost of an operation - so for example reading a row from disk might cost 1, processing a row in memory might cost 0.00001.

In Redshift, this approach was dropped (and I can see no good reason for it - my guess it was done to kind of crudely control the query planner, but I think it a vast blunder) and the costs for operations are arbitrary. For example, if I remember correctly, a sort has a cost of 1,000,000,000 plus one per row - which means a sort of a *single row* is considered more expensive than reading a billion rows.

This leads to the query planner making mistakes (choosing a slow plan because it mistakenly thinks it's the fastest plan), and it also means you can't compare explain plans across queries (which you can in Postgres).

Note also there's at least one bug in the costs, I think due to an uninitialized variable, leading to some staggeringly high values (billions of times more than the number of atoms in the Universe). I *did* try to explain this to Support, but someone had told them "query plan costs can be high" and it was *impossible* to get them to understand a number could be high *and be wrong*. They literally would not grasp the idea. Deflector shields were up. Support IME have always been quite rigid; I guess they do not have software engineering backgrounds, and they do not really understand Redshift.

2021-07-21 06:24:40

I can't see any mention of `VACUUM`. How do you deal with this?

2021-07-23 17:30:19

Do I understand you *never* run a `VACUUM DELETE`?

2021-07-24 09:24:26

BTW, I remembered something - `VACUUM FULL` is faster than `VACUUM SORT ONLY`. These are the the bottom line results from my testing;

| Vacuum | Mean |

| :--- | :--- |

| Delete Only | 34001.674 ms |

| Full | 38764.919 ms |

| Sort Only | 51064.960 ms |

I think the delete work is being done at a point during the vacuum work such that the sort has less work to do, and that saves a lot of time. Of course, if you have no expired rows (rows which have been deleted and also can no longer be seen by any transaction), it won't make any difference at all.

2021-07-24 20:18:11

All of the answers are wrong. You can't use Redshift, or Redshift Spectrum, like this; making table designs which support Big Data is fundamentally antagonistic to ad hoc querying, and if you don't need to support Big Data, you absolutely and categorically do not use sorted column-store.

The proposal to use Spectrum for complex queries is in particular alarming because AWS doos not fully document what Spectrum can or cannot do, so even if you users *do* know how to correctly write sorted column-store queries (and they won't), they *still* wouldn't fully know what Spectrum cannot or cannot do, and if Spectrum cannot do something, you're likely to be seeing all the data in S3 being brought into Redshift, for every query, which can perfectly well bring a cluster to a grinding halt.

I have long seen though that AWS support, account managers, etc, have been taught to believe Redshift can do anything; clients when asking them if Redshift can do whatever it is their use-case is, are always told "yes".

I'm very much of the view AWS, at least as far as Redshift is concerned, obfuscates everything which is not a strength. This is done I think also in the material used to train staff, so AWS ends up believing its own censored, kind of "politically correct" view of Redshift.

2021-07-25 10:53:59

You can do complex just fine (and the more skill the devlopers and users have, the more complex you can handle) - it's *ad hoc* which is normally a major problem. If you want *ad hoc*, you often cannot take advantage of sorting, which means you're in effect back to being unsorted, which is the same situation as a cluster based unsorted column-store, such as memSQL (or whatever it's called these days), which is a better choice since it has things like indexes (since it can't handle Big Data) and has a bunch less restrictions and constraints than Redshift, and a *lot* less knowledge and skill is required of developers and users. AWS do not offer a cluster based unsorted column-store database.

Generally though I find a lot of Redshift users have nowhere near enough data to justify using Redshift, regardless of use case or anything else, and would be fine on Postgres.

2021-08-04 19:34:02

I am a Redshift specialist.

I work as a contractor.

Every single client I have worked with has been using Redshift incorrectly, which makes it extra-ordinarily inefficient, and not knowing that they are doing so, the only mechanism available to them is to add more hardware to the cluster, which leads to Redshift being (and having a reputation as being) expensive.

I may be wrong, but what you really need is someone who actually knows how to use Redshift. I am of the view that *all* AWS documentation, training and support with regard to Redshift is absolutely misleading and no use *at all*; that it obfuscates everything which is not a strength. I consider it a sugar-coated cheerleader squad; you can never learn what Redshift is *not* to be used for.

That person, should you find them, will then explain to you what Redshift can actually be used for, and it is in my experience the case that what Redshift is being used for is not what it *can* be used for; corporate expectations have always been absolutely incorrect, and I am of the view this occurs because, as described above, all answers and information are always "yes, Redshift can do what you want".

You will then need to re-assess your use of Redshift and find out if you actually have use cases where it is the correct solution.

2021-08-05 08:04:01

If you are issuing queries where the queries do not match up the sorting orders of the tables, and/or you are adding data to tables incorrectly, moving away from Redshift will give you an enormous efficiency boost. For much less money, you will obtain the same performance.

2021-08-08 18:44:04

I may be wrong, and I've not examined the behaviour of this functionality, but the choices Redshift so far makes with regard to column encoding are, in my view, *extremely* poor, and somewhat politically influenced, rather than purely technically influenced - there's a strong preference for AWS's proprietory encoding method, when it is not in fact an appropriate choice *at all*. Indeed, that method is even used with interleaved tables, where it is a catastrophically incorrect choice.

Moreover, any automated method is fundamentally limited in the information it can process to make its choice : it is not a human. It does not understand overall design, or what might come in the future. It may end up repeatedly swtiching between different choices, as query load varies over time.

In short, it is likely better - assuming the actual choices made are sound, which is *not* an assumption I would make - than a human who doesn't know what they are doing, and worse than a human who does know what they are doing.

My great fear is that it will not be possible, either now, or made so in the future, to disable this functionality, and so it will actively harm the clusters of people who do know what they're doing.

2021-08-08 23:09:48

> It's good for when you are dealing with a scale that one can't actually sit down and look at each column, when you have 100s of tables with 100s of columns that automatically generated from the data.

The use case for Redshift is timely SQL on Big Data. If you don't have Big Data, you don't use Redshift - too many constraints and restrictions.

If you *do* have Big Data, timely SQL is only possible if the sorting orders of tables match up with the queries being issued.

If there are so many tables they cannot be handled by a human, and they needed to be inspected for encoding, this implies their design is lacking and that means Redshift makes no sense, coming or going. Either you have so little data the lack of design is irrelevant (and then you shouldn't be using Redshift), or you have Big Data and then the lack of design makes Redshift unable to perform timely SQL because sorting is messed up - it basically degenerates to unsorted column-store.

I may be wrong, but I think this feature is part of the onging basically nonsensical attempt to be Snowflake-like. It makes no sense; it looks to me like bad management decisions, made by people who don't really understand their own product or it's place in the market.

2021-08-19 06:48:26

> First thing, your code, raw data, and results show that you dealt with 1k, 100k, and 10M rows. The white paper talks about 1k, 1M, 10M.

The source code of the stand-alone Python script which is used to generate evidence for the paper is published along with the paper, and the paper contains the raw and markdown-prettified versions of the results of its run of that script.

As such, the values used in the results are authoritative. What happened was that during development, I changed the number of rows being used, and by the looks of it, since it is just "number of zeros" and so easy to miss, I've not adjusted all the places in the text which refer to the number of records.

I'll check them after I post this and make corrections.

For referecence, the values in the code are;

```

table_row_counts = [ 1000, 100000, 10000000 ]

number_columns = [ 2, 400, 800, 1200, 1600 ]

```

> Second, in this highly artificial, contrived scenario, the choice of queries may be skewing your results towards your "conclusion".

Mmm. I'm absolutely happy to consider that the test method is *flawed*, but not that it was directed. There was no selection of outcome. The test was designed before the results were known, and the design was intended to reveal behaviour, whatever that behaviour happens to be.

> By the time you run the query using first and last, you had already touched both, separately, in individual queries that scanned through the whole columns. However, when you run the query using first and second, it's the first time ever you're touching second.

Yes. I hadn't seen that.

Hmm. So, what happened was originally there were three tests, not four. There was no test involving the second column. When I came to write up the results, I could compare sum(first) with sum(last), no problem - one column each, I can say "last column appears to be faster than first", but when I came to using first and last, I had nothing to compare it to. So I then added first and second. It wasn't part of the original design, and when you change something in the course of use, you are more likely to overlook the consquences of that change.

The question then comes as to whether or not using a column for the first time matters. The result cache of course is turned off, but there's always seemed to be other layers of caching, so much as it would be lovely for testing to imagine each query is completely independent, probably they are not.

So, yes, I think what needs to be done is that a new table is created for each query, so each table is having one and only one query run on the table.

On the face of it also there should be a couple more queries, where two columns are used but the `sum` is `sum(last)`.

I need to think about this a bit, because the run-time for this test is already long.

> You're running 5 times, discarding the slowest and fastest, and then estimating a normal distribution out of 3 data points. You'd probably get better results by running this more times.

Already takes I think about two or three hours run-time. Part of the idea of publishing the script is that people can run it. The longer the test becomes, the less that goal is achieved. Of course, actual accurate results matter most, but it is still a consideration.

> Compare the results for 10M rows, 2 columns, on the "first and second" and "first and last" tests. They should lead to absolutely identical conclusions, as in these cases "second" is, by definition, identical to "last".

I'm not seeing what you're seeing.

(Results are means/stddev).

For `dc2.large`, I see 0.042/0.000 vs 0.041/0.000.

For `ds2.xlarge`, I see 0.033/0.003 vs 0.036/0.003.

For `ra3.xlplus`, I see 0.031/0.001 vs 0.030/0.002.

Those variations are within normal differences between test runs. In fact they're all within the standard deviation found for both test runs.

As an aside, sadly, posts in this subreddit appear to be deleted if they show AWS in a poor light, and the white paper prior to this has been deleted by the mods, I think because it revealed the implementation of cross-database queries is very different to that expected by users and puts AWS in a poor light.

I messaged the mods, but no reply.

As such I was this morning about to delete the white papers posted here and move to r/dataengineering, but now you replied, which of course by the time and effort you've taken requires a proper response (and indeed, corrections to and republishing of the white paper). So I'll leave this post up for a bit so you can see the reply, and delete it later.

BTW, do you want a credit on the white paper? your review of it is about to lead to large changes and improvements, so credit is due.

2021-08-19 18:28:00

I've made the following changes to the test method;

1. I've moved over to using blocks per slice per column, rather than total number of rows. I was aware of this before, of course, but I was reluctant to move over to blocks, even though they are the correct approach, because I knew I would run into problems with Redshift trying to closely control block use, and I did. It looks like `even` has off-by-one errors, and so I needed to roll my own `key` based solution. So the test now specifies blocks, and the blocks are exactly full.

2. The test table is now dropped, re-made and re-populated before every test query.

3. I've improved table building performance by optimizing the `insert` queries, but I also lost performance by the `key` based work-around I had to make because `even` has issued. I've reduced the columns tested to 2, 800 and 1600, to help reduce overall test time, which is now I suspect going to be many hours - I wouldn't be surprised to see six hours.

I have one final bit of coding to finish, and then I'll kick off the test. Once I have the results, I'll update and republish the white paper.

2021-08-19 23:08:34

> Apologies, I thought it was a repost. I undeleted it.

This may be where it was my first post. The UI for posting was not clear and it took a number of attempts to make a post more or less as I intended it to be.

I may be wrong, but I think this now may mean the post existed at the top of the new stream for a short time only, before being deleted, and being now undeleted is submerged by being old; it will not be much seen. Is that correct? if it is, then if you don't mind, I will delete it and post it one more time, so it has the normal visibility accorded to a post. It takes about a week to produce one white paper, and that's effort I wouldn't want to see partially lost.

> Yes please.

What name do you want used? and any URLs associated with that name?

2021-08-20 06:56:00

> However, I mistakenly thought that the second post was a repost of the first, and so I removed the first post at the same time as I approved the second.

Ah, where the titles have a long identical starting phrase.

> When I re-approved the first post, it went back in the "new" listing in the same place it was before I removed it.

> It didn't get "buried" by being removed and then approved, because it had already been approved.

I'm not clear about this. So, question : what is the "new" listing ordered by? posting timestamp, or approval timestamp? or something else?

> However, traffic in this subreddit means that most posts are mostly invisible after 12-24 hours.

Yes.

> So, your first post was already mostly invisible at the time it was removed.

Ah. It had been approved and not deleted for some time, before it was deleted? then that's fine, it's had its fifteen minutes of fame.

> There was a delay before your first post was approved, which didn't help it.

I thought I understood, and then I realised I didn't :-)

So it looks like the "new" listing is ordered by posting timestamp, and so if there is a delay in approval (something like 20 hours in this case), when approval does come, the post is down the "new" listing by the duration of delay for approval. Is that correct?

> I am fine if you would like to repost and/or delete again. :)

Good. However, this can only occur if it's actually right, which is to say, if the post was disadvantaged in some way or ways and so unusually less visible than every other post. If it was seen as any other post would be, then it's done. I do not yet solidly understand the mechanisms here involved and so cannot yet know if this was the case.

> I will never remove negative feedback from any of my subreddits, reason being, those are frequently posts that help improve the field, no matter what it is.

I have been censored by AWS once before, for technical critique of Redshift, so I'm more sensitive to the matter than the average Joe, I suspect.

> Re credit, you don't need to do that.

This is a matter of fundamental correctness : *I did not compose the thought and work which led to the changes*. It cannot be then that by action or inaction it appears the work was done by me, unless annonymity is requested, of course. You did the work; your experience and wit, facilitated by a more objective position, allowed you insight unavailable to me. It is for you to receive the credit, just as if you did a day of work, it is for you to receive the wages for the day.

One of the great privileges and necessities of writing papers is making sure people receive every single bit of credit they deserve.

Right now you are named simply as an "r/aws moderator". You're sure you would not prefer a real name and URLs of some kind? in any event, if not, and you change your mind later, let me know; I can make those changes at any time.

2021-08-20 06:57:00

The method to work around the bug in `even` is so slow that I'm now working on ways to improve test table population times. If I cannot, then I think there can now be only one iteration. As it is, I have some hope of success.

2021-08-20 14:04:32

Got there in the end, enough at least. I estimate run time about five hours. I kicked it off an hour or two ago.

2021-08-27 10:05:07

> Hey, sorry for the delay in response, busy week!

No worries. I thought you were not going to reply, so I'm simply glad to see it :-)

> > I'm not clear about this. So, question : what is the "new" listing ordered by? posting timestamp, or approval timestamp? or something else?

>

> The "new" listing is ordered by posting timestamp.

Okay.

> BUT there's an exception, see next point.

Okay...

> > So it looks like the "new" listing is ordered by posting timestamp, and so if there is a delay in approval (something like 20 hours in this case), when approval does come, the post is down the "new" listing by the duration of delay for approval. Is that correct?

>

> No, due to the exception. IF the post is sent to the moderation queue by reddit's filters (eg. suspected spam), AND IF a moderator approves the post within 24 hours, THEN the post is placed at the top of the new queue (out of order) when it is approved. If you look carefully at the new queue you'll occasionally see posts out of order, this is why.

Got it. Then it is not correct to repost the article; it had the normal exposure of any article.

> The delay on your original post was just under 12 hours, so it would have gone to the top of the new queue when approved. That doesn't guarantee five minutes of fame though as there are other factors (eg. traffic on the site at the time). If you wanted to be sure others had seen it, you could perhaps include a link to it in your other posts. Or, just repost, I am familiar with this particular caper now, so all will be well. :)

Heh. Thankyou. I will not be posting it though, there is no justification.

The cross-database query document will be reposted, though, since it has been superseded by a later and corrected/improved version.

> > I have been censored by AWS once before, for technical critique of Redshift

>

> I feel that blocking this info breaches the spirit of customer centricity which is #1 on their list of principles,

I am (and sadly, but it seems inevitable with large companies) strongly of the view Amazon jumped the shark a while ago, and that LP is what Amazon now constantly tells itself; but in reality actual customers and their actual experiences are in fact not even a dim, reomte blur. Amazon now, I aver, like all large companies, responds to its internal politics infinitely more strongly than anything else and is oblivious to customers, both in terms of possessing informtion in the first place, and even if that information were avaliable, acting upon it.

Such appears to be the fate of all large companies and organizations.

> so counter-productive, and definitely not how we do things in this subreddit. I want to see numbers like that, and I'm sure engineers inside Amazon do too. Please do continue. Quote "LP #1" at the next person who says no.

To be clear, that kind of censorship has *not* happened here, I am not worried about it happening, and I cannot imagine needing to quote that to anyone.

> Credit - really I was just objecting to being considered a hindrance. That's not my purpose here. You don't even need to mention us. But r/aws is the name to use,

"r/aws" has as is your wish been used.

> I'm not here for accolades either. :)

I am absolutely and categorically of the view that leadership can never confer gift or gain upon the leader, for such inherently compromises their position. Leadership can only be a burden.

When such a leader can be found, although they need other attributes to succeed, one of the main ways in which leaders fail has been eliminated.

For you to be like this is of enormous importance and value to r/aws.

I will note though that the credit due to you for the improvements in the paper are *entirely independent of your leadership position*, and that I will always do my best to accolade you as appropriate and correct :-)

2021-08-05 14:37:55

I am a Redshift specialist.

> Downstream BI tooling that directly queries Redshift

You cannot use BI tools directly on sorted column-store relational databases, when you have Big Data. BI tools have no idea about sorting, and this means their queries cannot utilize the staggering efficiency made possible by sorting, and so when you have Big Data - which is to say, when the data overwhelms the hardware and the only way to obtain timely SQL is by the correct use of sorting - BI tools are death and ruin.

2021-08-05 18:31:24

Wow! I never believed that my fast and short reply to this topic will spawn such amazing and enlighten technical debate.

A big thanks to both u/MaxGanzII and u/carlsverre for all the deep insight information.

Kind regards

2021-08-06 06:12:42

I may be wrong, but I think you have not really seen what I've written, and are thinking something very different to what I am saying.

You can of course set up a BI tool and use it on Redshift, or any SQL based database - and I think *this* is what you have in mind.

This works fine as long as you have so little data that the hardware overwhelms the data (in which case you do not need Redshift).

When you come to the point where the data overwhelms the hardware - when you have Big Data - then you run into a wall. BI tools are unable to operated sorted databases corectly, which is to say, issue queries which honour the sorting orders of the tables, and as such when Big Data is present BI tools are unable to issue timely SQL, and their use is no longer practical, because they are so slow, and because they hammer the cluster, which makes performance poor and erratic for everyone else.

2021-08-12 08:58:37

> I think it'll be super expensive to store images on redshift.

What data type did you have in mind to store images in Redshift?

There are no binary data types.

2021-08-18 17:00:29

> While the storage is very cheap on cloud data warehouses, is data duplication across is DB's still not a good idea?

Say you have ten databases. You then duplicate very table in every database into every other database. What you now have are ten copies of one database. Why not just have one copy, and keep the original data in S3 for backup?

Moreover, if we're talking about cost, we're talking here about ten times the amount of store. So twenty Redshift nodes instead of two. This is not cheap.

> In your tests, is there any sweet spot on the data size pulling from remote DB on cross-database queries?

The larger the remote table, then the more data which has to be copied, and so the longer it takes. There's no inherent sweet spot.

2021-08-22 18:38:20

Good question. I don't know the answer. Data sharing is already on the list to be investigated.

I'm about (finally) to post another white paper or two, I received a review for another white paper (not yet posted here, because of the review), which had valid criticisms of the test method, and I've spent the last four days improving the test method to resolve those issues, and repeating the test and re-writing the discussion and conclusions.

2021-08-22 21:26:07

Thankyou for your encouraging words!

I've just posted a white paper about the commit queue.

Will be posting one a day for the next few days, and then kicking back to about one a week, which I intend to be the long term pace.

2021-08-23 01:22:33

This feature is only on RA3. RA3 stores all data in S3; the local SSDs are cache only. I would expect the share to be reading from S3, and S3 I imagine is what is being considered "shared" between the clusters.

2021-08-23 01:39:02

The quoted space is S3 store. AWS do not publish the SSD cache size, but you can see it in the system tables.

Last I looked, it was like this;

| Node Type | Number of disks | Storage per Disk | Storage per Node | S3 store per Node |

| :--- | :--- | :--- | :--- | :--- |

| ra3.xlplus | 1 | 932 GB | 932 GB | 32 TB |

| ra3.4xlarge | 1 | 3.18 TB | 3.18 TB | 64 TB |

| ra3.16xlarge | 4 | 3.26 TB | 12.74 TB | 64 TB |

Remember you pay an additional charge for S3-based stored, based on data volume.

2021-08-23 02:28:28

Not offhand - I don't have it laying around and I can't remember exactly which table I used. I think it was probably `STV_PARTITIONS`. Note you have to know how to use the table - not documented of course, nothing is - as it lists all the k-safety disk as well as the normal disk. You need in the where clause to have "owner = host".

2021-08-23 02:29:10

I'm not sure what you're asking. So there are three columns; how many SSD disks, how big each disk is, and then the total store (disks * size).

2021-08-23 02:55:15

I think you have not seen the reply below, which explains which table was (IIRC) used, and how to operate that table correctly.

2021-08-23 03:17:58

To be clear for future readers, I think you checked only for the node type you have? you're not confirming the stats for all the different node types.

2021-08-25 17:02:54

Sorting is the key. Columns are secondary.

2021-09-10 17:08:42

I might guess, and I'll check later once I spin up a cluster, that a late binding view exists in `pg_class` but it has no columns in `pg_attribute`, so there are no rows to store the comment.

2021-09-13 06:42:04

> Your "findings" on the floating point numbers seem to be due to misunderstanding IEEE-754.

It's *always* possible *any* of my findings are due to misunderstandings :-)

Peer review is vital for real quality, but I'm the only author.

I *very* much appreciate comments and assessment by readers, such as this.

It's all about trying to get the correct answers, and that's never easy. Software is kind to us - it can go wrong is very obvious ways - but data and investigation fail silently.

> It's recommended that numbers outside the range that can be represented should be treated as +/- infinity, so nothing unexpected on the behavior there.

Yes and no.

I did think about this.

So, yes - numbers outside the representable range should be +/- infinity.

But that's *not* what happens with `float4`. With `float4`, if you try to insert a number too large, or a number too small, `INSERT` baulks and fails the insert. Numbers are never converted, silently or otherwise, into `infinity`.

I could well be wrong, but I think that is the right thing to do. SQL databases are intended to have strong typing. If you exceed a range, getting an error seems more appropriate than silently modifying what is stored.

Now, we do *not* see this behaviour with `float8`.

With `float8`, the number stored becomes larger (or smaller, if negative), you first store numbers (as you should), then as you get larger or smaller enough, the number is silently converted to infinity, and then as you keep getting larger or smaller, finally `INSERT` baulks and fails the insert.

Obviously, if you're going to convert to `infinity`, you're always going to do that. You wouldn't do it for a bit, and then fall over to failing `INSERT`.

I can then (to the extent I've not messed up getting basic results!) at least say behaviour is inconsistent between `float4` and `float8`.

(Both types support storing +/- infinity by storing the string `'+Infinity'`and `'-Infinity'`.)

> Also, the representation of numbers is inherently lossy - it's naïve, at best, to expect a number with a bazillion digits to be kept exactly as you typed it. In fact, the documentation is explicit about how many significant digits are to be expected for float4 and float8 - way less than what you're experimenting with.

Yes, floats are inherently inaccurate. I believe (but I didn't investigate and prove it) that this is why it is possible to store numbers somewhat larger than the theoretical maximum - the number is being converted, by floating point inaccuracy, into the theoretical maximum and so can be stored.

I'm not sure what you means about the test method using less significant digits than those supported. The test method is actually using normal notation - plain integers, written out in full, all 38 digits for `float4` and 308 digits for `float8` - and is always testing with the maximum possible number of digits. It turns out you can pass such fully-written out integers to `INSERT` for `float4` and `float8` and it works.

> Regarding drivers - it's well known that different drivers do different processing before sending data to a datastore, and not only with Redshift. Also, the docs call out in multiple places that different drivers may produce different results.

The official Redshift docs?

> Heck, even sending queries with different drivers result in having to recompile the queries.

Yes, where the communication API compiled into the binaries Redshift produces from the SQL query varies.

> So, again, nothing unexpected nor undocumented here.

Mmm. My point was that the mechanisms for connecting to Redshift are buggy. That's something we expect from experience with all software, but still... speaking for myself, it caught me out, here. I did not expect or think about the particular connection mechanism actually distorting the results I was getting when trying to find minimum and maximum values. I thought they would be more passive, and also more reliable.

I was particularly surprised by `psql` *modifying* the data stored - converting for example the date `4714-12-31 BC` to `4713-12-31 BC` - that to my mind is certainly out of the ordinary and I'm petty sure is undocumented :-)

2021-09-16 06:27:37

I may be wrong, but I think maybe you are not aware of how much local SSD there is, or how it is operated.

On a single `ra3.xlplus` node for example there is 932 GB of local SSD, but the node allows 32 TB of store in S3.

All data on the SSD is a duplicate of data in S3; the only time there is data on the SSD which is not in S3 is the very brief period when there has been a write and the data has not yet had a copy made in S3. That copy is always made, regardless of whetherthe data is remaining on SSD for immediate continued use.

In short, the SSD is a local cache of S3.

2021-09-17 07:12:58

> When data is loaded into Redshift with ra3 nodes, the data isn't offloaded to S3 managed storage until it becomes cold. If it stays hot, it stays in the ssd storage. Data in our cluster is generally hot for close to a day with constant ingestion and querying.

On one hand I think this may well be so, because it's possible to write to SSD much more than you can write to S3, so trying to keep S3 up to date with the SSD on like a per-write basis obviously isn't going to work, so you have to buffer up; on the other hand, the k-safety for RA3 nodes is *entirely* provided by S3, and so I think it will be constantly making consistent copies, even if they become out of date immediately.

I think it is the automated snapshots (mandatory for RA3) which are the basis for k-safety, and I would think those snapshots continue to be produced on an ongoing basis, no matter how much disk activity you have.

I very much want to examine the RA3 nodes to determine their characteristics. It will be a lot of work - I'll need to fully populate the SSDs on two ra3.xlplus nodes, and that's 932 GB each. I'll then need for example to start loading data into them as quickly as I can, and I'll need to see if I can find a way to know how quickly data in moving from SSD to S3 - and I'm afraid that information is not available in the system tables, and then I'll need to find a way to find out indirectly. It's not going to be easy, and it's going to be a lot of work.

> Our tables are so large, that vacuum operations cannot complete, and we've been advised by AWS to break out tables into a time series conveyer belt of daily tables. Those daily tables most likely don't hit S3 until the next day.

But automated snapshots continue being made, right?

2021-09-27 18:11:07

Not sure what to make of the upvotes - are people approving of the project, or are they signalling approval of its departure? =-)

2021-10-02 21:52:34

PL/pgSQL is a *nightmare* to work with. It's the worst programming language I've used. I am of the view that for this reason *alone* you would *never* use procs, *ever*, unless you actually *had* to; so we're talking mandatory reasons only.

Security is one possible mandatory reason. Another would be functionality only available in Redshift via a proc (can't think of examples off-hand for data processing, as opposed to admin work).

As a relevant aside, note proc performance if I remember correctly is about twenty times slower than a query (and Python function performance about twenty times slower again).

2021-10-02 22:04:10

About two or three years ago I used Matillion for six months, with a client.

Now, it may have changed since then, but based on my experience at that tie, I would advise you in the *strongest possible terms* to avoid Matillion *at all costs*. It was catagorically unfit for use. I wrote a document about some of the major issues - let me keep it short and tell you just one; the files Matillion saves are JSON, and they are *fully re-ordered on every save*. Every single line changes. This means only *one person* can work on a given Matillion document at a time, because it's impossible for source control to merge work from two people - because every line has always moved. We had to have the "Matillion Hat", and whoever had the Matillion code checked out was the only person who could work on Matillion.

2021-10-02 22:08:37

> The data from redshift is referenced by various BI platforms

You can only do this if you have an aggregate table per dashboard, about 100k rows tops, and a dashboard uses and only uses that one table.

If you do not do this, and you let BI tools see many tables in Redshift and issue queries, you will be issuing queries which are for Redshift - a sorted relational database - absolutely insane. Completely and utterly hostile to the technology.

Redshift is *NOT* repeat *NOT* a general purpose database.

It has a pretty narrow use case; it offers timely SQL on Big Data *when and only when operated correctly*. It offers nothing else - no lots of users, no lots of queries, no small data, nothing. Certainly not BI tools. I've yet to meet a client who knew this. The problem, in my view, is Amazon; there is no meaningful information about Redshift, and AWS staff are trained to think Redshift can do anything.

2021-10-25 04:22:38

I actually have a white paper about query compilation just about ready to go; it may well be out next week.

In some regions (I've tried a bit to find a list in the docs/press releases, but no luck), query compilation is offloaded from the leader node, and in those regions, that issue for BI tools has gone away, at least as much as it can; query compilation there takes 5 to 7 seconds, regardless of the number of compilations. In other regions it's a few seconds per compilation, and BI tools can as you know issue pretty crazy queries which require a lot of compilation.

The key problem with BI tools is the SQL generator.

The SQL issued is no good *normally*, and the only works in two situations, the first being that the data is so small it is overwhelmed by the hardware, the second being the use of extracts (which is a Tableau concept, I'm guessing other BI tools may well offer similar functionality).

That SQL, already normally no good, is death and ruin on sorted relational databases - the data volumes are now so large the data overwhelms the hardware and the and the onl way to get timely performance is to issue queries which are correct for the sorting orders of the tables, and the SQL generator has absolutely no idea this issue even exists.

The other problem with BI tools is that they're running on a single machine, with a single network connection to the database; they also need to make sure they don't try to bring too much data back from the database.

So the problems are; avoid the SQL generator, and ensure not too much data comes over the network (which also means not too much data will have to be handled by the client).

1. If you use normal BI tools, for each dashboard make an aggregate table with about 100k rows, and have the dashboard use and only use that one table. You lose a lot of flexibility, but actually you did not really have that flexibility in the first place, because performance was so poor the system wasn't viable. (Also, if you're using Redshift, you *must* run in a region which supports query offload).

2. If you use normal BI tools, hand-craft an SQL query which the tool will use to generate the output it uses for the dashboard. That way you can ensure the query is correct - but this of course now brings you properly into the constraints and restrictions involved in sorted relational data, which is that when you select the sorting orders for your tables, you are also inherently selecting those queries you can run and those you cannot, since the queries must match up to the sorting orders of the tables; the problem being making it so you can run *all* the queries you want.

If you have Big Data, and you want SQL, then you must have sorting; there's no other way. If you have sorting, you cannot have SQL generators, because they don't know about sorting.

As such, with relational data sources, BI tools are fundamentally small data applications. To make them work, you must find a way to present them with small data, or control their SQL such that they emit SQL which correctly operates a sorted relational database.

There's no way around that, that I know of. There is no way, with current technology, to directly use Big Data in a timely manner with SQL without sorting.

2021-10-25 10:39:38

I would advise you in the strongest possible terms *not* to use materialized views. You cannot select column encodings, and the only VACUUM they receive is through auto-vacuum, and auto-vaccum appears wholly ineffective.

https://www.amazonredshiftresearchproject.org/white_papers/downloads/materialized_views.pdf

https://www.amazonredshiftresearchproject.org/white_papers/downloads/auto_vacuum.pdf

2021-11-06 14:25:58

I may be wrong, but it seems not unreasonable to argue a point that new accounts need so much work done to them to make them secure, and then the account owners are blamed for not securing their accounts.

2021-11-06 14:36:10

I activated AWS 2FA and came within a hair's breadth of permanently loosing access to my account. The AWS recovery process did not work for my case, and there was no other way forward. I was able to get back in purely by luck, with no help of any kind from AWS, and I instantly disabled 2FA, as it was for me profoundly more dangerous than the threat it was to protect against.

2021-11-27 16:53:59

Be aware `bool not null` is 1 bit per value.

`int4 not null` is 32 bits per value.

2021-11-05 19:29:21

I've never used Python for this. I use psql, from the command line. You can give it a script, and it will execute the contents of the script.

2021-12-30 18:02:25

There are two caches I know of; the query result cache, and the compiled segment cache (which, far as I can tell, has two layers - one local to your account, one more distant and slower, and presumably larger, which fits informal information I've been given in the past, where originally the cache was on a per-account basis (so one cache for all Redshift clusters in your account), but then became global).

I'm *guessing*, since you mention database queries, you have in mind the former, and from how you write, are perhaps not aware of the latter.

The query result cache is surprisingly small. When I investigated it (maybe a year ago) it was only about 16mb, on dc2.large nodes. I guess it's stored in memory on the leader node.

You cannot control the cache at all. When you issue a query, if the results are considered "small enough" (no definition given), and if they meet a range of other criteria (some of which are given in the docs), the results are cached.

You cannot use it to store your own data or key/value pairs. There is simply no mechanism to do so.

2021-12-30 18:04:08

No. There is no mechanism to do so. If you want to do this, use a table. However, single inserts/updates are expressly and explicitly extremely bad news for Redshift's, given how it works, so this is not a good idea.

2021-12-31 01:56:35

> Materialized views are implemented as a normal table, a normal view and a procedure, all created by the CREATE MATERIALIZED VIEW command, where the procedure is called by the REFRESH MATERIALIZED VIEW command and performs refresh. The table is created by CREATE TABLE AS, which is why column encodings cannot be specified. The encoding choices made by Redshift are extremely poor. A full refresh makes a new table, populates it, and uses table rename to replace the existing table. An incremental refresh uses an insert followed by a delete, using the system columns deletexid and insertxid to keep track of which rows have changed, and as such runs a full refresh when any of the tables used by the materialized view have either manually or automatically been vacuumed, as vacuum resets the values in the deletexid and insertxid columns and so invalidates the book-keeping information held by the materialized view, this information being stored in extra columns in the materialized view, one plus one for every table used in the materialized view SQL. The table underlying the materialized view is never vacuumed, except by auto-vacuum, which I suspect running so infrequently as to be inconsequential. Auto-refresh is an undocumented black box, likely subject to ongoing unannounced change, and its behaviour is unpredictable. On a small single column table on an idle cluster refresh occurred normally after about 55 seconds; with one row inserted per second, refresh occurred after between 54 to 1295 seconds (twenty-one minutes).

https://www.amazonredshiftresearchproject.org/white_papers/downloads/materialized_views.pdf

I advise you, in the *strongest* possible terms, *never* use to materialized views.

2022-01-01 14:10:50

The OP is about materialized views, and so is the reply.

Where did caching come from, and what caching do you have in mind?

2022-01-01 21:09:00

As far as I see it, that - create your own table and populate it - is what has to be done.

2022-02-04 13:19:59

It's a bug in EXPLAIN. In some circumstances, the cost is uninitialized, and so you see staggeringly large values (trillions of times the number of atoms in the Univese). I tried to tell Support. Someone however had told them explain plan costs could be large values, and I couldn't get past that : they couldn't think the idea that it could be a large value *and also be a bug*, because before coming to that point they already had an explanation. Support however in my view do not understand Redshift, or really *think*; this is by no means the only issue of this type which I've run into with them.

2022-02-04 13:20:47

BTW, don't use EXPLAIN. It omits so much information that you will be misled. You can go from what's really happening to an EXPLAIN plan, but you cannot go from an EXPLAIN plan to what's really happening.

2022-02-04 13:23:39

Ah finally just FYI, the costs in Redshift are arbitrary and usually large numbers. They are not repeat NOT connected to the actual time taken to perform an operation. For example, reading one row has a cost of one, but a sort step has a cost of I think it was one billion, plus one per row - so if we took costs as real, then we'd be saying reading one billion rows costs less than sorting one row, which is not the case.

Postgres does it's best to make costs realistic, but Redshift abandoned that approach.

2022-02-04 13:23:56

No indexes in Redshift.

2022-02-04 13:25:20

Seconded. Redshift is superb, but when and only when operated correctly, and you need a lot of knowledge to operate Redshift correctly, and that knowledge is not available, because AWS do not publish it. Snowflake is the alternative, which is fully automated, but, by this, is also less efficient than Redshift *when you know what you're doing*. However, when you *don't* know what you're doing, Snowflake is miles better.

2022-02-04 19:57:31

I hope you hired carefully. Where AWS publish no meaningful information about AWS, not all but very nearly almost all people who claim to know Redshift have superficial and incorrect knowledge. They have *used* Redshift, and built ETL on it and so on; but that's utterly and completely different to understanding Redshift and knowing how to operate it correctly.

However, as long as you have small data, rather than Big Data (and assuming your queries are not pathological), none of it will matter, because the hardware you have will overwhelm the data. It is only when you move into Big Data, and the data overwhelms the hardware, that correct operation becomes necessary and its absence obvious.

2022-02-06 18:35:05

I posted with two images attached. They appear not to be present in the post. Any ideas why not?

2022-02-13 13:44:40

Redshift will not help you.

It is a very specific, narrow use-case database. It is not a "big Postgres"; it is an *utterly different* type of database. It is for and only for Big Data, and it provdes timely SQL on Big Data when and only when operated correctly. It is *not* for many users, small data, many queries, BI tools.

To operate Redshift correctly is extremely knowledge intensive, for both the developers and the users.

If you do not operate Redshift correctly, and by the sounds of it you will not, it will mean sorting is not being used correctly and so Redshift performs in the same way as Postgres. All you get which Postgres does not have is the cluster, but you also do not get indexes or the enourmous additional functionality in Postgres.

If you wanted a sorted relational database, but you do not have the necessary knowledge to use Redshift, then use Snowflake.

2022-02-13 13:45:26

What are you doing about VACUUM?

2022-02-26 16:09:09

Redshift is not inherently slow. The problem is that sorted relational databases offer timely SQL on Big Data when and only when operated correctly, and no one knows how to use Redshift correctly, because AWS publish no information about how to do so.

Using Redshift correctly is *EXTREMELY* knowledge intensive. It is also a narrow use-case database; it is not, in any way, shape or form, a general purpose database. I would advise the OP, in the *strongest* possible terms, since they lack this knowledge, to use Snowflake, which is largely automated.

Redshift, when operated correctly, is much better than Snowflake; but Redshift, when operated incorrectly, is much worse than Snowflake.

2022-02-26 22:38:11

The article linked to is - like all the blog articles - utterly superficial. It says nothing about the core issues, only about trivial stuff which anyone can understand in a paragraph. All of the blog posts are like this, and many are relentlessly positive, and many obfuscate all kinds of issues and weaknesses. IMO, those posts are harmful, *particularly* so to people who are unfamiliar with Redshift.

2022-02-26 22:40:09

Not to my knowledge.

I run a web-site where I publish my investigatory work. URL in my profile.

Fairly soon now I'm going to write a white paper, "Introduction to Redshift Fundamentals", which will lay out the actual, real core issues, which are all to do with sorting.

2022-02-27 23:09:24

Don't hold your breath though. It will be a month of work to write, and I have an AMI to release first.

2022-02-27 23:14:09

I'm an IT contractor, so I work about three months every two years; the rest of the time, for the last four years, I've been investigating Redshift, full-time, by writing code to actually see what it does (rather than, for example, what the blog posts say it does). I'm pretty certain no one else has done that. I've also been using Redshift since the day it was released, and I've written my own database from scratch, so I'm in a good position to figure out what's going on.

2022-03-13 18:21:10

Redshift is a sorted relational database. It provides and only provides timely SQL on Big Data - nothing else, not lots of users, not lots of queries, not small data, not fast data and in particular *not* BI tools - and it does so when and only when operated correctly. When operated incorrectly, Redshift doesn't even provide timely SQL on Big Data.

It is only possible to use Redshift with BI tools if you create an aggregate table of about say 100k rows, and have each dashboard using and only using that single table.

2022-03-13 18:59:49

Broadly speaking, when it comes to data, we are in one of two situations; first, the situation where the hardware (processor, memory, network, disk) overwhelms the data.

In this situation you can do basically anything you like; it just doesn't matter. Everything takes at most seconds.

Second, though, is the situation where the data overwhelms the hardware; and in this situation, only sorting can save you, and for sorting to save you, it must be operated correctly.

2022-03-14 07:40:55

This is incorrect.

Redshift is a relational database. It uses relations - tables - in just the same way every other relational database does.

I have the feeling you may be mixing up normalization with relational.

As an aside, by and large, normalization is still widely practised in a sorted relational database. It can be done perfectly well, and is vital for data quality.

2022-03-14 10:08:47

Sorry, _Oce_, but this is incorrect, and fundmanetally so.

That's not a problem, I'm not here to try and save the world from incorrectness :-) but I do by this reply want to try to ensure other people who read this do not end up thinking it is true, and so be misled.

2022-03-24 20:12:06

750k records every 30 minutes - means on mean average 416 queries per second (assuming one query per record).

Redshift is designed to execute at most about ten concurrent queries, each of which is working with Big Data.

It is not designed for hundreds of queries per second with tiny data.

You are, I would say, using the wrong type of database.

2022-04-20 22:48:34

When writing SQL, depending on what you want to do, you could be interested in a number of different "current times". You might want the time the current transaction started (which would always be the same for every query and for every occurrence of asking for that time). You might want the time for the start of the current query; or you might want the time when you actually call the function which returns that time.

These "current times" are orthogonal. You cannot replace one with another, and so all must be provided.

`sysdate` provides the time of start of the current transaction (and so, here, note your description of calling `sysdate` many times and receiving a different timestamp each time is inaccurate; this is not what happens).

My general point in the blog post is that these three orthogonal current times are not provided. We have, as it were, one and a half of them, and that's all (and then we have the undocuemnted functions, which do not help, but it's good to know what they are).

2022-04-20 23:43:17

> in my years working as a DBA and with Redshift I cannot recall a single instance when I needed the current time in my query. Can you provide an example of what you would use this for?

I maintain a set of replacement system tables. There are a couple of cases where I need the current time, for example, to figure out how long a currently running query has been running.

Most recent client had one or two use cases where in a similar way there was a need to compute a duration from a start up to the current time.

> Redshift is meant to be an analytical warehouse-type system, not like an RDBMS. If you want times when queries started there are other system views for that.

I am very strongly of the view that the Redshift system tables are an *unbelievable* mess. A total catastrophe. I think they are not safe to use, because the information they carry is barely - and often incorectly - documented, and they display a great deal of strange, unexpected and often inexplicable behaviour, which the inexperienced will not know of and will be led to failure by. I regard the system tables are completely unfit for purpose. They are the reason I've spent years developing a replacement set of system tables; because I did need to know what was going on, and I could not find out from the system tables.

> If you have an actual requirement, I'd love to hear it. And if you are working with Redshift in your position of employment you can always contact your AWS TAM and raise a PFR.

This assumes competency on the part of the dev team and its management. I am very strongly of the view both lack real-world experience, and as a part of that will be unable to effectively judge feature requests. I came to view some time ago that talking to AWS in any way was futile.

2022-04-23 09:18:15

Note that Redshift is a sorted relational database and as such, although it like all relational databases offers SQL, the internal implementation of the database is *profoundly* different to an unsorted relational database and as such although you can *issue* any query you like, the set of queries which will complete in a timely manner are *profoundly* different to the set you are accustomed to.

In fact, I am of the view Redshift has quite narrow use-cases, and requires extensive knowledge of how it works on the part of both the developers *and* users.

To put it another way; Redshift offers and only offers timely SQL on Big Data, nothing else - not lots of users, not lots of queries, not small data, not fast data - and it offers this when and only when it is *operated correctly*, which is an extremely knowledge intensive task.

When operated incorrectly, Redshift is essentically identical to Postgres, but without indexes, with the cluster, and with about 10 years of new Postgres functionality missing.

I may be completely wrong but the impression I have from your post is that there is no awareness of these issues. Redshift is - as it *always* the case - simply being seen as a higher-performance conventional database.

IT IS NOT. It is COMPLETELY DIFFERENT to a normal database.

BI workloads are utterly hostile and inappropriate for a sorted relational database (there's one method where you can to a certain extent manage this, but high cost). ML I'm not familiar with; it might be something which could work well, if the knowledge to implement them correctly on Redshift is available.

2022-05-05 15:34:44

"Redshift converts SQL queries to C++ and this compiles to multiple binaries (known as segments) which are then distributed to, and executed by, the worker nodes. Historically, and still in some regions, this work is performed on the leader node and then compiling a segment on an unloaded cluster typically takes about 4 to 8 seconds, where a complex query might have ten segments. More recently, a compiled segment cache was introduced, which appears by its performance and rapid expiry of segments to be on the leader node, and if a segment is in this cache, it is not compiled but served and extremely quickly (thousandths of a second) from the cache. More recently still, an ancillary system was introduced, the compile off-load system, which off-loads compilation work from the leader-node. This system appears to have its own cache and when a query is compiled, if one or more segments are not in the cache, then there is a slow first segment (typically the usual 4 to 8 seconds), during which all segments are compiled in parallel, and after this the leader node the obtains segments from the compile off-load system at the rate of about a tenth of a second each. However, if all segments are in the compile off-load segment cache, then the slow first segment does not occur. In all cases, increasing the number of concurrently compiled queries increases compile time, but in a somewhat better than linear fashion. Finally, a serious vulnerability was discovered, where by a normal unprivileged user can instantly crash Redshift. This is achieved by creating an empty table with a sufficiently large number of columns, and then issuing a query which has one window function per column, the core issue I suspect actually being simply to issue a query with between (inclusive both ends) 980 and 1664 segments."

https://www.amazonredshiftresearchproject.org/white_papers/downloads/query_compilation.pdf

2022-05-11 14:14:25

Redshift can be used when you have Big Data and a UI offering *controlled SQL queries only* (so you write them and in particular you control the joins), which need to process that Big Data and return a result in a timely manner to the user.

Remember that you will only be able to allow queries which are viable giving the table sorting order you have chosen.

If none of this is familiar, you should not be using Redshift, because you will not be able to operate it correctly and in that scenario it is *never* the right choice - there are other solutions which do better.

Also remember though you can only run a few queries at once, so this is a not a many-user, many-query solution.

I am not familiar with Athena.

2022-05-11 14:16:35

That interface may well be ideal but the underlying database engine is designed for different use cases. You can stretch it to some web-serving, but with particular constraints and restrictions on use.

2022-05-11 14:18:23

But note that serverless is expensive.

I'm not investigated serverless, but my expectation is that it's much more show than substance. Redshift stuff almost always is. To my eye the only really sound new functionality introduced in the last five years was compile offload.

2022-05-11 14:19:17

Spectrum is tricky to use. If you mess up the query - issue SQL Spectrum can handle - you'll get a lot, even all, of the data in the query being brought into Redshift proper, for the query to execute there. This can kill a cluster, no problems.

2022-05-11 16:09:07

Data sharing probably means a full local copy of the remote table. I've not investigated yet, but that's how cross-database queries work. This is crazy.

Concurrency scaling its crazy, from first principles. If it's needed, Redshift is being operated incorrectly in the first place - Redshift scales by the queries being efficient - and CSC is an expensive band aid. It should *never* be in use.

Query editor, production systems do not use this functionality =-)

2022-05-29 22:16:50

I used workspaces with a client.

1. Copy'n'paste was disabled. This let to idiotic work-arounds using a local Teams, and a Teams in the workspace.

2. The workspace would reboot itself, without warning, every week or so. Any work you have open is lost.

3. The Linux you get is to my eye very sparsely provided with packages.

4. I don't like MATE at all.

All in all, you can use it, but expect a non-trivial productivity hit.

2022-05-02 23:48:01

Redshift is not suitable for all of the data sanitization / validation work.

You need to at least some on the CSV files *before* you load them.

Simple example : Redshift only understands a single case-sensitive string for "NULL". What happens if you start getting a CVS where someone is using "NULL" and "null"?

You have to pre-process before loading.

It's also generally a lot easier. SQL is not great for arbitrary data validation, not just because it requires data which is SQL valid *in the first place* to be able to load it to use SQL for validation...

2022-06-18 06:35:12

Redshift provides and only provides timely SQL on Big Data.

No lots of users, no lots of queries, no lots of inserts, no lots of small data, etc. Nothing else, in fact.

2022-06-02 21:17:44

No use if the query takes longer than 15 minutes, as it might, given this is a Big Data database.

2022-06-09 22:21:42

This is absolutely and catagorically the wrong way to use a sorted relational database; Power BI has no idea about sorting and is issuing queries which are incorrect for the tables.

There's also the question of what's being done about `VACUUM`; you do not mention it, so I guess you're relying on auto-vacuum, which as far as I can tell does nothing. This doesn't actually matter, because the BI tool is issuing incorrect queries anyway; but it would be another and profound design consideration if Redshift was being used correctly.

You are getting away with it because the data volumne is small compared to the hardware; the hardware is overwhelming the data. It will not scale, and as data volumes rise, you will run into a brick wall.

2022-06-20 18:40:05

To be sure, you appreciate that Spectrum supports a fairly limited set of SQL functionality? any work Spectrum cannot handle (and that may perfectly well mean the entire query) will see *all* the data being brought into the cluster, for Redshift proper to perform the processing.

This then for timely SQL requires a cluster sized appropriately to your data.

Also in general it means you need to understand how to write SQL such that it executes primaily or wholly in Spectrum, not in Redshift.

2022-07-04 14:48:40

A VPC is a set of IP addresses. A VPC - and so its IP addresses - belong to a region. If you want to have your cluster using those IP addresses, it necessarily must then be in that region.

2022-07-05 17:49:06

VPC setup in AWS is a living hell. It is a nightmare. It is a stinking, sinking swamp miles from anywhere in the pitch black of night.

The documentation is worse than useless, because it contains errors and misleading, confusing information and statements. Do not read it.

It took me a *week*, FULL TIME, all day every day, to figure out how to start Redshift in VPC.

If you use the default VPC stuff life is easier, but this is a dangerous solution because it is fragile veneer over a catastrophe. Moreover for me it was no use, because I was writing code which needed to bring up VPC for arbitrary clients. It turns out there are no less than *five* fundamentally different VPC situations client accounts can be in, and you have to know of them all, and handle them all, to have a reliable solution.

VPC networking in AWS is a complete and utter catastrophe.

2022-07-12 15:31:50

Sort-of rhetorical question : imagine you find a bug in your ETL code, or in your data, or in Redshift, and you need to reload either 25%, 50% or 100% of your data.

How will you handle these scenarios?

The general point here is : are you making a one-way ETL system? data flows inwards, and that's all that's done? such systems to my eye are fragile. They do not cope well with any need to reload data, and rather assume this will never occur (which is to say, they assume there will be no bugs).

2022-07-12 19:02:54

You can't validate or sanitize properly inside Redshift. Firstly, if the data is borked, you won't be able to load it to sanitize it (which means you can't ignore only the rows which are broken), secondly, SQL is not strong at sanitization. You need a general purpose programming language for that.

2022-07-12 19:04:22

I may be wrong, I've hardly touched Glue, but I don't think Glue handles (or is intended to handle) this at all. Glue takes input data and emits output data, all in S3. *You* have to handle loading (or reloading, or partially reloading) to Redshift.

You mention using Glue and Spark, but - and I could be completely wrong, hardly used Glue - that seems a duplication.

Glue will emit to S3, so you need to pre-process to sanitize and validate, before you can use COPY to load.

But to pre-process and validate you need a Big Data capable system (in case you need to reload all your data from scratch because of a bug - if it takes a month to do so, because you have Big Data but a small data pre-processor, you're toast).

Spark or EMR will be find for this, both scale to Big Data - but then if you're using them, why use Glue? you can do what Glue does in Spark or in EMR.

2022-07-15 08:23:27

Redshift is centered around max-min pruning. Make the year the first sort key and put all the data in a single table. When you query, specify the year. You will read and only read the blocks which are in that year.

Breaking these rows out into separate tables is completely and utterly unnecessary, from a performance point of view.

2022-07-27 20:08:14

You're performing a cross-database query. This will make a complete copy of the remote table in your local database and then query it. A good start would be to avoid cross-database queries.

2022-07-15 08:26:11

Redshift is for data volunes of *at least* 1 TB. If you do not have this, then for the love of God do not use Redshift; it requires deep and extensive knowledge on the part of both its developers *and* its users, to operate correctly, and when operated incorrectly, it makes no sense; there are better choices in that situation.

As it is, I suspect your data volumes are low. Use Postgres.

2022-07-16 21:49:33

I'm puzzled by the use of "data warehouse" here.

You need a database, no? with SQL. I'm wondering what you have in mind when you say data warehouse, rather than database.

2022-07-16 21:51:46

Sanitization and validation in Python (SQL is not great at this - obvious problem; if data is broken enough you can't load into a table, with its hard data types, how do you then santize/validate?)

Actual data processing in SQL, once you know the data is sane. Do not use a programming language for data processing, you end up with all of its issues in your data processing. SQL has a lot less issues.

2022-07-16 21:53:10

You can turn a cluster on and off at will.

You can also save a snapshot, and bring a cluster up from that.

Bring up times vary by region and by node type, from about 1 minute to about 15 minutes (means).

https://www.amazonredshiftresearchproject.org/bring_up_times/index.html

2022-07-19 17:12:24

I may be wrong, but I think you've got the wrong end of the stick.

Redshift does not offer indexing because indexing is not viable for Big Data; it's too slow, to update, and to operate.

Instead, Redshift uses min-max culling, which is a much more restrictive technique and a lot more knowledge intensive; it is used with Big Data, because indexing can't be used. It must be operated correctly, or there is no point in using Redshift - there are better solutions, such as Postgres.

2022-08-04 17:26:22

Snowflake last I knew were opaque about pricing. They charge you in an abstract unit, and then there's a ratio of that unit to USD, and it varies by customer.

My general view is that Snowflake costs quite a bit more, but you don't need the massive knowledge in devs and users required by Redshift.

Also I think Snowflake does not scale as effectively as Redshift when Redshift is being operated correctly, so Snowflake's relative inefficincy makes it increasingly expensive as the data loads become much larger.

2022-08-04 17:33:10

Nice article. Enjoyed it.

2022-08-04 17:35:16

If you have less than 1 TB you should not be using Redshift.

Also you can only use Tableau directly on Redshift when you have small data. Tableau does not scale, because it cannot operate Redshift correctly, and for Redshift to query Big Data in a timely manner, it must be operated correctly.

2022-08-04 20:46:10

With Redshift, it's not about the DBA; rather, the table designers *AND* the table *users* all must understand sorting and the impact it has on their queries. Redshift is very knowledge intensive. You *cannot* issue arbitrary queries; knowing SQL is *not enough*. You *must* understanding sorting and its ramifications.

2022-08-05 12:08:39

No indexes in Redshift. They do not scale to Big Data. Ordering is *the* method which permits timely SQL on Big Data. Without it, you're unsorted (obviously :-) and your performance is then of the same type as Postgres et al, i.e. max of about 1 TB.

2022-08-05 15:21:22

Much the same as any database, I would think - I can't see implementations varying widely.

The one exception to this is that if the sorting order of the table matches the concatenated partition by and order by, the window function is free, because the table is already sorted in the way the window function needs.

2022-08-22 21:28:12

1. What are you doing about VACUUM?

2. What happens if/when you discover a bug in the data, or data processing, and need to reload all data? what happens if the processing time for this exceeds the 15 minute limit of a lambda function?

2022-08-23 03:56:16

Regarding #1 - you can run one VACUUM at a time, per *cluster*. For sorting to work, tables must be up to date iwth VACUUM. You are in a producer-consumer scenario : all the data you load is producing unsorted blocks, and VACUUM sorts them. You cannot produce more unsorted blocks than you consume. VACUUMing a single larger unsorted table (say, 1TB) can take a day. The extent to which you produce unsorted blocks depends on quite a few factors, and is related to how you load data and how you modify data.

Regarding #2 - is there a situation where due to the amount of data being large a lambda will time out? if that is so, then the ETL system cannot perform reload. Also, will a full reload occur in a timely manner? can your business survive the amount of time require to perform a full reload?

2022-08-23 07:32:05

I've only used Support for Redshift, which I did for a year before giving up the support contract, and I'm afraid this was not my experience.

2022-08-23 19:58:11

Not sure what you mean?

I was working with Redshift and contacting Support with questions and observations and at times bug reports. I had bug reports which could not be made because the support chap I was speaking to could not grasp the issue. After the year, I came away with the view Support do not really understand Redshift. I felt they knew about it, but they didn't understand it, if you get my drift.

2022-08-27 14:52:22

Only when used correctly. Redshift is never used correctly, because AWS tell clients it can do anything, so they use it like a general purpose database, and have a terrible time, and then move on to Snowflake.

2022-08-27 14:53:02

> run hundreds of complex, time sensitive workloads in parallel and it would fall flat on it's face trying to get it all done

Jesus =-) see my previous post in this thread.

2022-08-27 14:54:10

Interesting comment about pricing. I hear the opposite. In particular I hear (and I can understand why) Snowflake scales less effectively than a correctly and skillfully operated Redshift, and so becomes increasingly more expensive as data volumes become increasingly large. Snowflake will always beat an incorrectly operated Redshift (and everyone operates Redshift incorrectly).

2022-08-14 12:43:00

Loading 10k files in a timely manner is problematic due to S3 access overheads.

You will need to amalgamate these files - ideally into one equally (or as equally as possible) sized file per slice - before loading. \*Not\* into one file, as all slices load in parallel. You do not want one slice only loading your data.

Also consider the situation when you discover bugs in your ETL or data processing, and need to reload *everything*. If you then have many hundreds of thousands of files to load, you're toast. You will not be able to do it in a timely manner, and that might be the end of your business.

2022-08-14 13:05:45

There's more to it than you realise, and what's actually happening is different to what you think.

AWS publish no meaningful information about Redshift, so it's impossible for you to know.

I have to head out, I'll write a post later or tomorrow.

2022-08-22 07:27:57

I would advise you in the *strongest possible terms* not to use Redshift.

Redshift is *extremely* knowledge intensive, for both the developera *and* the users and in the absence of that knowledge you will have a terrible time and in particular you \*will\* contruct a system which cannot scale, which defeats the entire point of using Redshift. If you want SQL which does not scale, use Postgres; it's a billion times better.

Use Snowflake.

If the knowledge is available to operate Redshift correctly, Redshift is better than Snowflake (more efficient, so cheaper pro rata and scales better) but if that knowledge is lacking, as it appears to be here, Snowflake is infinitely better.

You been warned :-)

2022-08-22 10:32:16

Sorted relational databases provide timely SQL on Big Data when and only when operated correctly.

To operate a sorted relational database correctly you must understand the impact and consequences of sorting : it is *not* enough to know only SQL.

If you do not understand the impact and consequences of sorting, your queries will of course execute - SQL is SQL - but they cannot be timely on Big Data.

Using an incorrectly operated sorted relational database makes no sense. There are better choices in that situation.

Also note Redshift provides and only provides timely SQL on Big Data (and then, when and only when operated correctly). It does not provide lots of queries, lots of users, small data, fast data, etc. It is in *absolutely no way whatsoever* a general purpose relational database, as say Postgres is. It is a highly specialized, rather narrow use case database.

2022-08-22 14:27:24

That's a major concern, yes : for when you select a sorting order, you are defining the queries you can run in a timely manner, and therefore also defining the queries you *cannot* run in a timely manner.

You must define your sorting orders to allow all existing business use cases to run, as well as the known near-future cases, as well as trying to suppot the general direction the business will go in, in the future.

Then another major issue is VACUUM. For sorting to work, tables must be up to date with vacuum, but you can only run *one* vacuum at a time *per cluster*.

You in fact have a producer-consumer scenario : you are by loading or modifying data continually producing unsorted blocks, and are consuming them with VACUUM. You must ensure you can cosume more quickly than you produce.

Then there is the question of loading and modifying data, as this usually produces large numbers of unsorted blocks. How do you go about this, in ways which minimize the number of unsorted blocks, such that you continue to consume unsorted blocks more quickly than you produce them?

2022-08-22 14:30:55

I would adivse you in the strongest possible terms *not* to use the official Redshift documentation. I consider it relentlessly positive, badly written, often most amibigious, studious and diligent in its avoidance of *any* negatives *whatsoever* and quite well supplied with flat-out factual errors. I consider it hazardous.

There is to my knoweldge as yet no well-informed and realistic over-view of Redshift. I wrote one for a previous client, and I need to re-create that document and then make it public.

I've published various white papers on Redshift (you can find them via the web-site link on my profile) but they're all on quite specialized topics.

A basic problem with Redshift is that it is knowledge intensive, AWS do not supply that knowledge, and there is no other source for that knowledge.

2022-08-25 06:09:23

Be cautious with EXPLAIN. It removes too much of what's going on and by this it can often end up being misleading. It's possible to go from what's really going on to an explain plan, but not from an explain plan to what's really going on.

Note also the costs are arbitrary. In Postgres, costs are a best-effort approximation of the actual cost of the given operation. In Redshift, this is no longer the case. Costs are arbitrary numbers. So for example a sort has a cost of I think it is 1 billion, plus 1 per row. A scan has a cost of 1 per row. So a sort of *one row* is considered more expensive than a read of *one billion rows*. I guess this was done to rather brutually control the optimizer, and I consider it a mistake.

Also note there are bugs in the costs. I know of at least one case where an uninitialized variable is used and the cost is trillions of time the number of atoms in the Universe. I did try to tell Support, but they don't really understand Redshift, and someone had told them "large numbers are expected". I literally could not get them to grasp the idea that a number might be large *and also be a bug*.

2022-08-25 08:55:58

. Are you up to date with VACUUM? do you run VACUUM manually?

. You need to provide the table DDL - otherwise I basically have to ask you questions exactly the same as you providing the table DDL :-)

Note you're not doing a join so the distribution key is unimportant, as long as rows are reasonably equally shared between slices.

The sortkeys you have will not help you, because `column1` is first in the group by. What will help you is the first sort key being the `filter_column`. Then min-max culling will mean you only scan blocks with the filter value, and so the group by only has to work on those blocks.

Right now I expect the query is reading every block in the table, more or less.

2022-08-25 08:56:38

Strings only use their first eight bytes for sorting. They need to be distinct within those first eight bytes, not over their whole length.

2022-08-25 08:57:37

Looks like you're using the default encoding choices.

NEVER DO THIS.

The default choices are awful.

LZO is obsolete, BTW. It's fast to encode, but slow to decode, which is the opposite of what you want with Big Data.

2022-08-25 08:58:26

I may be wrong, but I think there's no efficiency problem as such with the SQL - it's not doing say a crazy join - it's just that the table the query is using is set up incorrectly.

2022-08-25 09:00:39

I'm afraid I think the Redshift docs are positively hazardous. I recommend not reading them at all. They are written to obscure any weaknesses, never indicate what *cannot* be done, contain a wide range of flatly incorrect statements, I believe are *never* reviewed by technical staff, and I believe have not been updated for the fairly major internal design changes which happened about a year ago.

2022-08-25 09:01:28

Never use CTAS, because it uses the Redshift default column encoding choices, which are appalling.

2022-08-25 11:50:52

Just a quick reply, I have to go out - did you issue a `VACUUM FULL [table] to 100 perecent` after changing the sort key? and then an ANALYZE? you will need to do both for the sortkey change to matter.

Also there's the question of whether the strings in the filter column differ in their first eigth bytes.

Back later!

2022-08-25 15:07:25

So, take runlength encoding.

For this to work well, there must be long sequences of identical values.

If it's used where every value differs, the column will get larger, not smaller.

So we must know how an encoding works, to know what kind of data to use it with; does it need sorted data? does it need all values to be within a given range (the "mostly"s), does it work particularly badly with unsorted data?

One issue - there are a couple but this is an important issue - is that is uses AZ64 for everything where AZ64 can be used : the issue then is this; how does AZ64 work? so that we know when it is a good choice, and when it is a bad choice?

And no one knows, because AWS have never said a word about how AZ64 works.

So it could be a general purpose encoder, like ZSTD. Or it could be a runlength type encoder - and it is this which decides which columns we use it on, NOT their data type.

The whole approach of selecting encoding by data type is incorrect, and it is doubly incorrect because AWS have not described how AZ64 encodes, so it can NEVER be used, ever, anyway.

As it is, I investigated AZ64, and it in fact turned out to be Gorilla encoding, which was invented by Facebook in 2015, but with customizations to runlength-type encoding behaviour (and to my eye it is upon these customizations AWS claim "it is proprietary". In fact those changes are absolutely insane. They are ALL less efficient than using the simple one-value-and-number-of-repetitions method, and those methods also have extremely dangerous failure modes where they pack billions of rows into a single block.

Gorilla encoding is clever and orthagonal to existing methods, and would be very useful in some specific cases, if it was not lumbered by these deadly incompetent changes AWS made to it. If you know exactly how it works and so can be use you will not trigger the design flaws, you could use it with data which certain characteristics.

It is however absolutely *not* a general purpose encoder and it is *not* safe for general use, because of the design flaws AWS introduced with their changes.

2022-08-25 15:19:53

That choice depends on quite a lot of information to do with the character of the data in the column. You need to know how each encoding method works, so you can know which is the right choice. Usually there is one choice which is much better than the rest, and stands out.

I cannot answer this question, because I do not know about your data.

2022-08-25 15:24:01

The bytedict dictionary contains up to 256 entries, but the final entry is always used as an end-of-dictionary marker and so does not hold a value from the block. The dictionary holds a copy of the values which are placed in the dictionary and so its size varies according to the size of the column data type; varchar are stored with their full DDL length, regardless of the string length. The dictionary is stored in the block, so a large dictionary means less storage space for values. The dictionary is populated by up to the first 255 distinct values in the block (and so is affected by sorting),

but is only as large as it needs to be, rather than always being the size of 256 entries. A value which is present in the dictionary is replaced by a one byte index into the dictionary. A value which is not present in the dictionary is written out in full, plus a one byte overhead. A NULL (as

opposed to NOT NULL) bytedict column consumes an additional 1 bit per value, even for varchar, which normally consumes an additional 1 byte per value.

2022-08-25 15:25:36

I'm of the view the devs are not great. In particular I think they very strongly lack real-world experience. I am of the view this was simply a blunder. (However, in fairness, there is the possibility it pre-dates Redshift; it might have been someting ParAccel chose to do, ten or more years ago.)

2022-08-25 15:36:34

https://www.amazonredshiftresearchproject.org/white_papers/downloads/az64_encoding.pdf

2022-08-25 15:38:49

Did you run ANALZYE as well? this is also needed.

The DDL for the table is needed. Can you post it?

2022-08-25 15:40:22

You're very kind to say so!

BTW, are you Tunisian? I used to live in Tunis, in the Bardo. Went to the museum, before the attack. Awful to imagine what happened there.

2022-08-25 16:52:37

Mm. Be careful. The docs have a number of flat-out actual factual errors in the encoding pages. Part of why I think technical staff never review the docs; if they did, those errors would immediately have been noticed.

2022-08-27 08:02:19

Aggregation means you broadly speaking need to scan a bunch of rows; in that sense order doesn't make a difference. You need to read all the rows anyway.

However, if you have a `where` clause, and so you are including only certain row, then sorting can help enormously, as it allows min-max culling (the Zone Map) to skip the blocks which contain only rows you will not need.

Your second qustion ("Also is it possible to use (filter_column, column1,timestamp_column2,timestamp_column3) to first satisfy the filter predicate...") is *exactly* what Redshift will be doing internally.

2022-09-01 19:06:01

I could not agree more. Oddly enough, just working with a client now who are having exactly this issue. AWS did not tell them Redshift is the wrong choice for BI, and now they have the usual problems.

2022-09-01 19:14:02

This is a terrible idea.

There are two main problems.

1. Redshift compiles queries. Now it used to be compilation had been off-loaded and so compilation time was 4 to 8 seconds per query; but it looks like the devs have messed this up, and it is often now multiples of 4 to 8 second per query.

2. Redshift is a sorted database. Sorting is used to provide timely SQL on Big Data. However, for this to occur, sorting must be operated correctly. This means selecting sorting orders for tables such that the tables are correctly sorted for the queries you issue. When you select the sorting orders for your tables, you are defining the queries you can emit, and also the queries you cannot emit - of course, Redshift won't stop you issuing anything at all; it just won't be timely. If you are using Redshift and not using sorting correctly, you are using the wrong database; there are better choices in that situation.

In short : you've been warned.

If you do this, you will have a difficult and unpleasant experience and in the end give up.

I was approached last week by a company having exactly this problem. I've investigated their system, and explained how Redshift works and where their problems are coming from. Of course, they cannot change database overnight, so now they must move to their next product release with significant performance issues for end-users, and then redo a good part of their work.

2022-09-02 07:48:03

Less than 1 TB, Postgres.

1 to 10 or so, Exasol (unsorted column store). AWS do not sell an unsorted column store database, which to my eye is why they push Redshift so much.

Over 10, sorting is necessary, and then it's Redshift if you have the necessary skills available (they're not available), Snowflake if not.

Snowflake is just about fully automated. That's good and bad - you get what the automation gives you, but that's also the problem; you can *only* get what the automation gives you. There *are* no controls.

Snowflake is better than Redshift if you do not know how to operate Redshift, but Redshift is better if you *do* know how to operate Redshift; and as data volumes becomes large, the relative inefficiency of Snowflake (and it's higher cost, anyway, I think) makes it an expensive choice.

If your data volume is high enough, it would be worth the investment in devs and users to use Redshift correctly, but you can't get those skills in the marketplace, so you can't do it anyway.

2022-09-02 11:48:41

Sorting as a method has potential.

The best implementation I knew of was Vertica, *prior* to its purchase by HP. After purchase by HP, avoid like plague.

Redshift is not great.

The leader node / worker node dicthomy is a significant issue, but more pressingly my impression is that the code base is large and unmaintainable, as Redshift in my experience is full of bugs. The commonly used behaviour has been debugged, because users report the problems; but if you stray from what is common, you run into issues. This is not quality software. Along with this, also, to my eye, the direction RS management and devs have been going in for many years is completely wrong; they keep adding non-Big Data functionality, usually poorly implemented, without telling anyone it doesn't scale. This all combines with the complete absence of meaningful information from AWS about RS, and AWS's continual near-fraudulant hyping of Redshift - AWS obfuscates anything which is a weakness and pretends Redshift can do everything always, without even trade-offs. I call AWS the sugar-coated cheerleader-squad.

All in all, it's really not great. The devs have repeatedly with new functionality come *that* close to making it impossible to knowingly make correct system designs, where they force poorly implemented, non-scaling functionality onto users while marketing hypes it as automation and an improvement.

2022-09-04 19:39:52

Yes. Postgres, then multi-node Postgres to scale read. A higher capacity option is Exasol (unsorted column-store, clustered), but the skills for that will be rarer, and it's a commerical product, so you pay.

2022-09-15 09:27:27

I've never used them, but datashares?

Also, make a snapshot, bring up a new cluster, drop all the schemas you do not care about.

Also also, reading between the lines, your use case sounds like it could be completely inappropriate for Redshift. What are you doing about VACUUM?

2022-09-15 09:28:35

As I recall, functions exist in a schema. When you create the function, specify a schema name, e.g. `CREATE FUNCTION schema.name`.

2022-09-16 15:28:25

I've seen EC2 and Redshift clusters run out of availability. It does happen, but speaking for myself, very very very rarely. It's almost a off-one event.

2022-09-03 16:56:05

Query result cache is on the leader node and is small, I estimate 16mb on `dc2.large`. A black box, and in any event, only useful for aggregated (low row-count) results.

2022-09-03 17:04:26

I know nothing about ClickHouse, but I am a Redshift specialist.

Redshift is a sorted database and as such requires extensive knowledge - much more than only SQL - on the part of *both* the table developers *and* the users issuing queries.

The designed can create a brilliant design for sorting, but if the users then issue queries which are incorrect for the sorting design, the design is completely wasted : you're back to Redshift under the hood behaviour in the same way as Postgres, except you have no indexes but you do have the cluster (you'd be better off with multi-node Postgres, or unsorted column-store).

In your case, you are in every respect in an ideal situation - the user counts are low, your queries are controlled, data load is infrequent, etc - but you will still need to provide this training, it is a considerable body of knowledge and then experience will need to be built up, and the information you require is not available : AWS publish nothing of any value about Redshift.

I am in due course going to write a white paper, "Introduction to Redshift Fundamentals", but that's not happened yet. You can find other white papers on the site I maintain (link in profile) and they will give you some idea of the complexity and knowledge required.

By the way... you're not thinking of using BI tools, are you? I was approached week before last by a client doing this, and the last week I've spent sorting them out; they will be moving to Postgres in due course, as Redshift is not suitable for BI tool use.

2022-09-03 17:08:05

You cannot use Redshift with BI tools for customer facing dashboards.

Compilation times dominate query execution, and these times are too long to be acceptable.

There are lots of other issues relating to Redshift, but they are irrelevant, as this issue by itself means it is not a suitable choice for your use case.

2022-09-03 18:24:25

So, there is query queue time, query compile time, and query exection time.

With your low user count you should have no queues, and quey execution time, when and only when Redshift is operated correctly, is superb (BI tools cannot operate Redshift correctly).

However, if you are expecting total query times fast enough to be immediately useable by humans (no more than say 10 seconds), then query compilation will likely be problematic.

2022-09-07 06:08:40

Postgres is the best of the unsorted, row-store.

That's the best place to begin. You can go multi-node for read, which will give you scale on read.

When you have too much data, look at unsorted column-store with clustering, whcih means Exasol.

2022-09-10 06:23:43

A little knowledge is a dangerous thing :-)

Here's the history;

"There is a popular and capable open source unsorted row-store database called PostgreSQL.

PostgresSQL comes from a long lineage of earlier projects each evolving into more advanced successor projects, but the version 1.0, the first release, of Postgres as it is known now was released in May 1995.

PostgreSQL is as of the time of writing in mid-2019 at version 11.4.

In about 2008 a company called ParAccel branched PostgreSQL version 8.0.2 (which was released in 2005) to develop a product first named Maverick and then named ParAccel Analytic Database, ParAccel for short.

ParAccel heavily modified PostgreSQL so that it became a cluster-based sorted column-store database.

Amazon invested about 20m USD in ParAccel in 2011, and a year or so afterwards purchased for a single lump sum an unlimited license to the ParAccel codebase.

From this codebase, Amazon developed Redshift.

ParAccel have not been involved in the development of Redshift; Amazon bought the license and that was it.

(ParAccel were bought in the end by Actian, the graveyard where database companies go to die.)"

2022-09-10 06:27:18

I would urge you again to stay away from Redshift. It is \*utterly\* different to Postgres. Their use cases are \*completely different\*. If I point out the key issue(s) and they make no sense, you would most likely not be making a well-informed choice.

2022-09-28 16:10:50

Are you sure Redshift is the right choice?

I have the feeling it will be impossible for you to benefit from sorting, in which case Redshift is *always* the wrong technology choice.

2022-10-03 18:44:12

Cluster bring-up time varies by region and node type.

https://www.amazonredshiftresearchproject.org/bring_up_times/index.html

What's your idea of "quickly"?

2022-10-04 04:56:28

So, the next thought is how much data there is.

Redshift makes sense only when you have Big Data. If you do not, you should not be using Redshift; there are *always* better choices in that situation.

If you have Big Data, and you want to get a system running inside an hour, you're going to need a *very* large cluster (and so *very* expensive cluster) to load that data so quickly.

I get the feeling you do not have Big Data?

2022-10-07 21:22:33

Note serverless has eviscerated the system tables. You almost have no idea now what's going on in the cluster. I advise against serverless, because in my view AWS at least as far as Redshift are concerned are all hype and no substance, and then now along with that, there is no almost no visibility into what's going on in the cluster; hyping is much easier when the clients have no information of their own.

2022-10-08 07:33:03

Redshift specifically. I do not possess the expertise in other AWS databases necessary to hold opinions about them.

The basic problem with Redshift is that it must be correctly designed and operated (both) to be scalable, and no one correctly designs or operates their Redshift system, because the information necessary to do so is not provided by AWS.

Serverless does nothing to fix this; it does not address the issues involved in correct design or operation. I've not investigated it yet, but at the moment, to my eye, it is a marketing excercise to compete with Snowflake; it has almost no meaningful substance at all. It is a collection of already existing functionalities (some of which I think improper, such as concurrency scaling clusters, some of which I think appallingly badly done, such as automated sorting and encoding choices, some of which I think wholly ineffective, automated vacuum), plus automated selection of node counts (and I have no faith in the Redshift devs to have a done a good job of this).

What happens in practise is that people build non-scalable systems, but the have only small data, so the hardware overwhelms the data and it just doesn't matter; but in that situation, they should not be using Redshift. There are much better choices, such as Postgres (single node or multi-node).

2022-10-16 11:49:34

You mean `ra3.xlplus` ;-)

1. How many files do you have in S3, and what sizes are they?

2. Have you defined column encoding choics in your table in Redshift?

3. Have you disabled compression choice update in the `COPY` command?

2022-10-17 09:54:15

I have an Amazon account.

Now AWS and Amazon are separate, can I delete (as I wish to do so) my Amazon account, without affecting my AWS account?

2022-10-19 18:01:17

I vaguely remember earlier versions of the JDBC driver (which I never use) could not handle proc calls. What version are you on?

Also, you need to post your code for calling the proc. There could be something wrong with it, but by posting *other* code, how can we see?

2022-10-20 16:28:22

https://www.amazonredshiftresearchproject.org/white_papers/downloads/AQUA.pdf

AQUA is a solution to the problem that min-max culling (aka the Zone Map) is wholly ineffective when searching for strings with wildcard patterns, forcing full column scans, and as such all scan steps and only all scan steps which use the LIKE and/or SIMILAR TO operators are routed to AQUA, as these are the two operators offering search in strings with wildcards. AQUA is implemented as a multi-node distributed set of independent data processors, separate from RMS and from Redshift, and as such the basic design is the same as Redshift Spectrum, with each processor scanning a portion of the data in a table and returning results to the Redshift cluster, and appears to implement most if not all of the SQL functionality possible with this design, which is to say, that functionality which can be performed by examining a single row at a time in a single table, and by maintaining aggregate information from those rows; so there exists functionality which AQUA cannot implement, such as distinct. When AQUA is presented with a scan step with work AQUA cannot perform, necessarily some, most or all of the rows in the table must be returned to the Redshift cluster, such that that work can there be performed. As such, queries must be written correctly for AQUA, just as with Spectrum. AWS do not provide the information to do this, and so guesswork and experimentation must be used. When a scan step is routed to AQUA, AQUA downloads to itself, not to the Redshift cluster local SSD cache, the table being scanned from RMS. With the 12,988 block test table, on an idle two node ra3.xlplus cluster, this took 41 seconds. After this, AQUA performs more quickly than the cluster, with the time taken with test queries used for the part of the work routed to AQUA going from about 9.4 seconds to about 1.7 seconds. It is necessary then to issue a number of LIKE or SIMILAR TO queries on a table to reclaim the large initial cost of the first query, otherwise AQUA will actually reduce performance. There are no findings about how long must pass before a table needs to be re-read by AQUA. In the course of investigating AQUA, it was discovered that RMS is a block-based store, not a table-based store; Redshift brings into local SSD cache only those blocks which are needed to service queries. With an ra3.xlplus node, bandwidth to RMS is about 160mb/second, and bandwidth to the local SSD cache is about 720mb/second.

2022-10-20 16:31:20

Not between. Off to one side. AQUA loads tables to itself, and that load is independent of Redshift; if Redshift needs to access that table, it has to repeat the load from RMS. (RMS and S3 may or may not be the same, it's not clear.)

No pre-scanning that I saw. I don't think there is any; I've not seen anything anywhere to suggest this is done.

As for the rest, check the abstract (quoted in reply above) and check the paper.

2022-10-20 16:32:29

You know, if you'd searched for "AQUA Redshift", you would have found the white paper.

2022-10-20 17:48:43

I may be completely wrong, I'm not expert on AWS configuration, but can't you just connect directly to the Redshift cluster anyway, regardless of the account?

When I connect, I specify an IP address, a port, database, Redshift username and the password of that Redshift user.

2022-10-20 21:20:01

I thought so too, and I do know clients who do this.

However, I also see a lot of clients workng with third-party services, which need connections to the cluster, so they are public.

2022-10-24 17:40:27

https://www.amazonredshiftresearchproject.org/cross_region_benchmarks/index.html

`dc2.large` performance varies by region and very much over time, with as I recall there being now I think three broad groups of regions, fast, medium and slooooooow.

The other node types are much more consistent, both across regions and over time.

About a month ago a couple of regions (for the first time) had major performance problems across the board, but that was sorted out as of the most recent benchmarks.

2022-10-26 16:48:07

Are you certain the queries are identical? have you examined the query texts, and seen that this is definitely so?

Also, are the US and EU users connecting using the same method, which is to say ODBC, JDBC or via the postgres library? this will influence compilation.

Also also, when you say execution time, what *exactly* do you mean? how are you measuring execution time?

2022-10-31 11:20:54

Somebody - but I can see no way to know who, and I wonder if it was the OP - sent me a Gold Coin for this work.

Thankyou :-)

2022-10-31 18:55:54

This is madness.

Redshift is a sorted relational database - OLAP, as you say.

Sorting does not happen by itself; it must be operated correctly - sorting orders must be chosen for tables, and they must then be honoured by the users issuing queries. Moving from an unsorted system to a sorted system requires at *least* a complete data organization review, and then usually significant and often complete data re-organization.

To move and only move data from Postgres to Redshift, without this work, does *not* operating sorting correctly, and so Redshift is in fact behaving in *exactly* the same way as Postgres. There is no benefit, other than you are now on a cluster, but you could have done that with Postgres anyway.

This article is harmful to readers by the critical issues it omits.

2022-10-24 17:46:33

IMO, it does not.

It's a slightly unholy collection of features most of which are pallatives for mis-use of a sorted relational database.

The correct answer for most clients is to use a different database, either from first principles, due to their use case, or because the skills they need to use Redshift are not available, either in their devs or their users (both need to understand sorting, to use Redshift correctly, and Redshift used incorrectly is always the wrong choice; there are better solutions).

AWS are ultimately responsible for this, as they publish no meaningful information about using Redshift (and their support and TAMs say "yes" to everything), and so extraordinarily few people know what is needed to operate Redshift correctly.

2022-10-24 18:42:04

Redshift does have a use case.

When correctly operated, it is more efficient than Snowflake (I can't speak for BigQuery, as I know nothing about it). As such, with large data volumes, the cost of obtaining the necessary expertise and training devs and users in the use of Redshift, may be less than the cost of how much you'd need to pay to make Snowflake perform acceptably.

2022-10-31 18:57:36

I have to ask : what are you doing about sorting, and what are you doing about VACUUM? without these matters being correctly operated, Redshift is performing no differently to Postgres (except you now have the cluster, which you could have with Postgres too).

2022-10-31 19:00:51

Redshift normally requires extensive knowledge, of both the devs *and* the users, to operate correctly.

Redshift Spectrum requires all this knowledge, *plus more*.

There's probably half a book of stuff I would need to teach you so you could use Spectrum in the way you have imagined it would or could be used.

Explaining all this in a Reddit thread is impossible.

2022-10-31 19:01:56

Never use single-node Redshift systems in production.

They are very odd systems, well off the beaten track, and Redshift has enough test and reliability issues in the first place.

Also, Redshift is extremely knowledge intense. It is *not* a general purpose, simple to use database, such as Postgres. It has a fairly narrow range of use cases, and must be operated correctly, and it is particularly inappropriate for streaming.

2022-11-03 19:00:31

Do not use Redshift.

Redshift is extremely knowledge intensive. You do not have the knowledge necessry to operate Redshift correctly, and your proposed use case is profoundly inappropriate.

Stick with Postgres, or a noSQL.

2022-11-09 12:21:34

Good, but I'll have to see if Redshift works there (assuming it's provided).

It's supposed to work in the new ME region - but although you can start clusters, you cannot connect to them.

2022-11-09 19:59:25

Thankyou, OP. Saved me some work. Appreciated!

2022-11-21 19:34:28

I read a bit, then stopped.

1. CSC is expensive. It is a band-aid, too, for incorrectly operated clusters. Do not use it; rather, operate your cluster correctly.

2. "Use sortkeys and distkeys". Yes. But this is the incredibly astoundingly staggering *difficult* task which is the very core of using Redshift correctly, and which takes extensive skill, knowledge and experience to do effectively and correctly. This is not something where you say "just do this".

2022-11-22 22:36:55

AQUA has a high initial cost, downloading tables from RMS. It then accelerates LIKE/SIMILAR TO queries on the tables it has downloadeed from RMS, where a reasonable number of queries must be issued to recoup the initial cost. If this is not done, you are *slower* with AQUA, not faster.

https://www.amazonredshiftresearchproject.org/white_papers/downloads/AQUA.pdf

Abstract below.

"AQUA is a solution to the problem that min-max culling (aka the Zone Map) is wholly ineffective when searching for strings with wildcard patterns, forcing full column scans, and as such all scan steps and only all scan steps which use the LIKE and/or SIMILAR TO operators are routed to AQUA, as these are the two operators offering search in strings with wildcards. AQUA is implemented as a multi-node distributed set of independent data processors, separate from RMS and from Redshift, and as such the basic design is the same as Redshift Spectrum, with each processor scanning a portion of the data in a table and returning results to the Redshift cluster, and appears to implement most if not all of the SQL functionality possible with this design, which is to say, that functionality which can be performed by examining a single row at a time in a single table, and by maintaining aggregate information from those rows; so there exists functionality which AQUA cannot implement, such as distinct. When AQUA is presented with a scan step with work AQUA cannot perform, necessarily some, most or all of the rows in the table must be returned to the Redshift cluster, such that that work can there be performed. As such, queries must be written correctly for AQUA, just as with Spectrum. AWS do not provide the information to do this, and so guesswork and experimentation must be used. When a scan step is routed to AQUA, AQUA downloads to itself, not to the Redshift cluster local SSD cache, the table being scanned from RMS. With the 12,988 block test table, on an idle two node ra3.xlplus cluster, this took 41 seconds. After this, AQUA performs more quickly than the cluster, with the time taken with test queries used for the part of the work routed to AQUA going from about 9.4 seconds to about 1.7 seconds. It is necessary then to issue a number of LIKE or SIMILAR TO queries on a table to reclaim the large initial cost of the first query, otherwise AQUA will actually reduce performance. There are no findings about how long must pass before a table needs to be re-read by AQUA. In the course of investigating AQUA, it was discovered that RMS is a block-based store, not a table-based store; Redshift brings into local SSD cache only those blocks which are needed to service queries. With an ra3.xlplus node, bandwidth to RMS is about 160mb/second, and bandwidth to the local SSD cache is about 720mb/second."

2022-11-24 00:38:12

Thankyou very much! you're very kind to say so :-)

It's an electic mix - war in Ukraine, and Amazon Redshift research :-)

No RSS, I'm afraid. It's my own blog code; never thought about RSS. I'll have a look and see what it involves.

2022-11-28 21:25:12

Check out just about anything from u/MaxGanzII and https://amazonredshiftresearchproject.org. I could never summarize as well as they can. Basically, the use case for it is extremely narrow and it's very difficult to set up correctly. If your plan is to use a BI visualization tool in front of it, don't use it. If you plan to have more than a few users on it at a time, don't use it.

2022-11-01 06:35:26

Yes and no. The fundamental problem with Redshift is AWS publish no meaningful information about it. It is basically impossible to use, because the knowledge needed to use it is not available.

I am a Redshift specialist (not affilated with AWS in any way), I run a web-site where I publish investigations, (link in my profile) but I've yet to write an "Introduction to Redshift" white paper, which is what you need to understand the situation you're in. However, this would only let you know where you are; it cannot give you the experience and training necessary to use Redshift correctly. This take a lot of time and practise.

I advise you, in the strongest possible terms, not to use Redshift.

2022-11-09 17:07:08

I am a Redshift specialist.

In my experience, *EVERY* client I have worked with has been operating Redshift incorrectly, and Redshift operated incorrectly is a nightmare.

This happens, I believe, because AWS provide no meaningful information about how to operate Redshift correctly, and, in particularly, there is an intense company culture of obfuscating all weaknesses. All you hear from AWS, be it the docs, Support, the TAMs, is "yes, Redshift can do that".

In my opinion, Redshift is an excellent choice for Big Data work but when and only when operated correctly; correct operation requires extensive knowledge on the parts of both the devs *and* the users, and it also excluded a range of uses cases (such as BI tools).

Speaking more broadly, and comparing Redshift to, say, Snowflake, if you do not know how to operate Redshift correctly, Snowflake is a much, much better choice. If you *do* know how to operate Redshift, Redshift is a much better choice. Snowflake is inefficient relative to correctly operated Redshift, and so becomes increasingly more expensive as data volumes increase (but, again, be clear - this holds true only for correctly operated Redshift).

2022-11-09 17:09:23

Couple of starter questions regarding Redshift performance;

1. What are you doing about VACUUM?

2. How has table design taken sorting into account?

3. Do the users understand the impact sorting has on their queries? are they writing their queries accordingly?

4. Are you using Redshift's default encoding choices?

5. Are you running BI tools on the cluster?

2022-11-21 07:42:40

Note that although, I may be wrong, I think Spctrum has a maximum of 10 concurrent queries (at which point, your per query I/O to S3 is down to 10% of the speed when you run a single query).

Also note using Redshift correctly, so that it scales, is extremely knowledge intensive for *both* the system developers *and* the system users; every single client I've worked with has been using Redshift incorrectly (which is always the wrong choice - in that situation, there are better solutions).

I suspect the OP is in this situation, as Redshift when correctly operated is able to handle large volumes of data in a timely manner.

Redshift Spectrum requires all the knowledge of Redshift, *plus more*, as you need to understand Spectrum as well.

2022-11-21 07:44:47

I've not investigated it yet, so I could be wrong, but I currently regard Redshift serverless primarily as a marketing exercise, to cope with Snowflake. From a technical point of view, I do not consider it meaningful.

2022-11-21 07:46:07

You need access to your wallet and permission to spend it ;-)

2022-11-21 19:31:50

I mean to say that there is no advantage to using it; it's not really a product which makes sense, but a bundling together of a number of existing functionalities, none of which to my eye should exist in the first place on Redshift - they are all coping mechanisms for users operating Redshift incorrectly - plus dynamic server sizing, which I've not investigated but about which I have severe doubts.

I suspect it makes no sense to use it; it is not a solid option for anyone.

2022-12-01 18:47:42

Thankyou! I'm honoured to have been mentioned. I hope I can live up to your expectations :-)

2022-12-01 18:51:46

Redshift is a vehicle for sorting. Sorting is what it is - if you have Big Data, and you want timely SQL, you have to have sorting, will all the constraints and restrictions it brings. That's nothing on AWS or Redshift; all sorted relational databases will be the same.

Redshift itself to my eye is fundamentally obsolete, the code base is unmaintainable, and profoundly mis-managed. I am of the view that practically all the new functionality for years is not scalable to Big Data and has no place in Redshift.

I am also very *very* strongly of the view that AWS provide *no* meaningful information about Redshift *whatsoever*, and say "yes" to every question from every client about whether or not Redshift can do something.

2022-12-03 12:06:09

Well, I may be wrong, but I think everyone or almost everyone assumed Redshift clusters were identical in performance across regions.

It turns out this is not the case, and the difference between the faster and slower regions is large; and the slower regions tend to be more expensive, as well.

Getting something like a 2x improvement in performance *and* a reduction in cost for only the effort involved in typing the region name does, on the face of it, seem valuable, almost regardless of whether or not there are other factors involved in performance.

Moreover, the benchmarks are, to my knowledge, the first to compare performance across node types, and the first to compare performance across regions.

Additionally, people who are running clusters probably benefit from having some awareness of general changes over time in region and node performance, as it affects their systems.

A month ago, for example, a number of regions had serious performance problems. I expect people running clusters noticed, but AWS publish almost no information about problems with Redshift (I would say they publish as little as they can get away with).

2022-12-03 21:37:15

> Of course they won't be identical. You can have two hardware-identical machines that will perform differently based on the state of various processes running on them.

The cross-region differences are consistent and significant, where the benchmarks are run every two weeks. This cannot be explained by chance.

> You aren't using a broad enough sampling across a Region to make any declaration about broad performance impacts. All you're doing is taking specific measures and implying all machines everywhere share that characteristic. It's not accurate.

I'm not sure what you mean here by broad.

> Also, different node types are optimized for different types of data volumes and partitioning based on the types of queries that are most common. You can't put the same test on a dc node vs ra and declare one is better than the other. It is 100% dependent on how you model the data and the volume of data being scanned for the queries you run.

I may be wrong, but I think the benchmarks avoid these issues.

The disk-read and disk-read-write tests both execute on the local SSD; the work done is the same on any node type.

The network test measures data transfer speeds between nodes. This also is the same, regardless of node type.

Finally, the processor test, which tests single threaded performance only, by running a tight loop many times and timing the loop, is the same on all node types.

In all cases, as far as I can tell, and it took some thought to design it so, the work done is the same across all node types.

I have to say, the node types are not really optimized for any one thing or another; the `ra3` types have the local SSD cache for RMS, which is a different design approach, rather than being optimized for some particular workload.

> The only thing that matters is a specific business measuring the performance of their specific cluster. If it meets their needs, then nothing to worry about. If they are encountering hardware or network issues, they can cut a support ticket and/or just failover to another cluster.

I'm not sure if you mean to say so, but on the face of it, this argues that all generalized benchmarking has no value.

I must say, Redshift Support in my experience are useless. They have many facts, but no *understanding*. You cannot turn to them, except for mechanical tasks, such as rebooting a stuck cluster.

2022-12-03 21:47:28

I am of the view this assessment is exactly correct.

I regard AWS as profoundly mis-selling Redshift. Every client I've worked with has been operating Redshift incorrectly, and so has been having a terrible time, and should have been moved onto different database technologies.

2022-12-03 21:48:07

What's the driving factor for your cluster size? is it query performance, or disk space?

2022-12-03 22:04:09

Thankyou very much :-)

I'm very glad to be able to help with the problems people are running into.

2022-12-03 22:52:38

On `ra.xlplus`, RMS access runs at 25% of the speed of local SSD access. Serverless is using RMS. It might be a factor in the performance you're experiencing.

I've not benchmarked serverless performance across regions, in part as I think it will be expensive, and in part because I suspect where the system tables have been largely removed, it may no longer be possible to produce benchmarks, which obviously is a concern.

I am of the view AWS have a long history of overhyping RS functionality, even up to the point where we must begin to think of fraud (the AZ64 blog post comes to mind), and now to my eye there is much less opportunity to validate, or even understand, their claims.

This is making Redshift a problematic choice; you have to take AWS on blind faith, and I do not consider AWS in any way safe to trust.

2022-12-05 16:36:19

I'm sorry this and the site are unpleasant for you.

I may be wrong, but I think you are being unreasonable in your assessment of the technical value of the work done and being unpleasant about it.

I looked over your profile, and your posts and replies. You seem to me a reasonable and intelligent person except in this matter, which seems to push a button.

I note your intent to contact AWS legal. This is interesting - someone outside of AWS could not possibly do so; I wonder if you work for AWS, and if you do, if you are offended by my work because I am critical of Redshift.

I love dispute and contrary opinions, but this is too much like, well, let's say nastiness - "your worthless benchmarks".

Consequently, I'm afraid I will now make you the second person only whom I've blocked. I'll clear the blocked list after a while, as things can change over time.

2022-12-08 08:03:03

Postgres is fine for such small data. Do not use Redshift until you have at LEAST a terabyte *and* both the devs and the users understand the impact of sorting on their data designs and queries.

2022-12-08 17:35:03

I know nothing about Athena, so I cannot answer.

I've not used PowerBI, so I cannot say for sure, and I think the BI tools can do some data processing themselves, but the data is stored in the database and so the database must be queried.

Given these questions, I'm concerned that you may well be out of your depth here, and end up running into problems you cannot fix, and becoming stressed.

2022-12-13 09:37:10

I may be wrong, but I understand Glue does not perform data validation.

You will still need to create a system which validates data prior to loading it into a relational database (not least because data can be broken in such a way that it cannot be loaded into a relational database).

If you create such a system, it will load all data, validate it, and emit validate data.

So... where does Glue fit into all this? you're already doing everything it could do for you.

2022-12-17 21:27:09

I've not investigated this functionality yet, but in my experience, for many years new functionality in Redshift has been all hype and poor or very poor implementation (and has never been Big Data capable - it's pretty much all been functionality which should not *be* in Redshift). As such, I have hopes for this; but I've not investigated yet, so there is yet no information available.

2022-12-01 18:47:02

To my knowledge there is no meaningful content on-line about this matter. I am going to write a white paper, Introduction to the Fundamentals of Amazon Redshift, which would be what you should read, but I've not written it yet.

The central issue is sorting. Distribution is critical, but infinitely simpler.

2022-12-05 18:54:51

I am a Redshift specialist.

I would advise you absolutely not to use Redshift in this case.

Redshift has one purpose only, which is to offer timely SQL on Big Data. To do so it must be operated correctly, which is a highly knowledge intensive task, which also requires honouring a wide range of technical constraints and restrictions, all of which is done and is only done because there is no other way to use SQL in a timely manner on Big Data.

To use Redshift on small data makes no sense at all, ever. You are in a situation where the hardware overwhelms the data, so you can do anything you like, and it just won't matter; but your system will not scale, because you will not have designed it to use sorting correctly.

Additionally, if you use Redshift, you miss out on ten years of Postgres development.

2022-12-25 13:08:32

Be aware re-sizing clusters is not a wholly reliable operation. Sooner or later, it will fail.

Cluster re-size is best handled through the ETL system (spinning up a second cluster and populating it), not via actually changing the number of nodes in a cluster.

2022-12-26 23:57:26

Mm. Just to be clear - it is resizing *in general* which is not wholly reliable. The method you use to trigger resizing does not matter. This is not a problem with Cloudformation.

2022-12-30 20:39:26

Originally, it was a 10k table limit =-)

I think VACUUM has start up and shutdown overheads, so I'm guessing with so many tables you're not running vacuum? if that's the case, I guess data is not sorted? and if that's the case, Redshift is categorically the wrong technology choice.

Redshift is used when and only when you have Big Data and need timely queries on that data. This is because timely queries happen when and only when sorting is correctly used, and the requirements and constraints of using sorting correctly are manifold and onerous.

As an aside, don't forget the schema limit.

Last I looked, it was set to 9900, IIRC.

2022-12-30 20:41:31

I do not work at AWS, so maybe they have inside knowledge, but I see no reason in and of itself to change node type.

IME, nobody uses sorting correctly, and so everyone is getting performance purely by brute force of hardware, and I suspect the lean toward "bigger nodes" comes from that, and of course that it is in AWS's interest.

I'm of the view probably the very large majority of Redshift users should be using different databases.

2023-01-04 09:43:32

The docs some time ago began blocking the Tor browser. Probably that and your problem have some connection. I now have to spin up another browser to examine a doc page.

As an aside, the docs pages for Redshift are malformed. If you look at the HTML, the docs pages are actually an *entire* HTML doc (HTML tags, head, title, second body, the lot) dumped straight into a first HTML doc - not in a frame, just dumped in there.

(I was parsing the pages to automatically extract information, found I couldn't do this in the normal way, this being why.)

2023-01-19 22:57:00

I appreciate your position regarding cost, but a word of advice : single node clusters are asking for trouble. Redshift is, in my view, an unmaintainable code base, and single node clusters are not a normal use of the code base - they were brought into existence purely so people could, at the lowest possible cost, *look* at Redshift. You should NEVER - *NEVER* - use single node clusters for production work.

2023-01-19 22:58:06

Ah, one other note.

Resize is Redshift is not wholly reliable. Sooner or later, it fails. Design your ETL system so you do not need to resize clusters.

2023-01-19 23:07:25

The idea of UNLOAD and then COPY seems good to me.

2023-01-19 23:07:42

Hmm. I replied to this, and Reddit has lost the post.

2023-01-20 09:46:53

1. Are you sure RA3 is what you need? SSD access is about 4x faster than RMS.

2. The system tables in Redshift are a total and utter disaster. You have run into one particular issue, which is that each node contains its own store, plus enough store for a copy of one other node - this is how k-safety is performed. To query that system table correctly, you need a where clause of "host = owner", IIRC. It's not documented, and without it, the table is utterly useless. The entirety of the system tables are like this, or worse.

2023-01-20 09:49:03

Redshift is flaky. The core functionality, which is to say, that used by large numbers of users, is reasonably debugged. Step outside of that, and you will always find bugs.

Redshift does not obviously have a test suite, given the kind of issues that emerge into the field.

Redshift is designed to have a leader node, and then worker nodes. The single node cluster is an aberration - a leader node, and one worker node, jammed into a single node of hardware.

Redshift is flaky enough as it is, without doing this as well, especially given what seems to be profound weakness in testing.

2023-01-20 09:49:51

In your cluster, what are you doing about `VACCUM`?

2023-01-30 05:32:09

> I have WLM queues enabled

That's the problem. What you describe is what other people experience, too.

The latest I known about AutoWLM is that every 200 queries it looks at the last 200 queries and reconfigured the queues.

The guess then is that if you have a bunch of lightweight queries, and so end up with lots and lots of small slots, and then a big sneaky ETL jumps in - pow. Trouble.

Speaking more broadly, you can never have a knowingly correctly designed system, when that system contains undocumented black boxes controlled by third parties, and this is even more true where those black boxes are subject to ongoing, unannounced change.

2023-01-06 08:41:41

Are you ensuring that the queries being issued match up to the sorting orders of the tables?

If not, why is Redshift in use? the reason Redshift is used is sorting, and if sorting is not in use, Redshift is always the wrong technology choice - there are better alternatives in this situation.

2023-01-06 12:58:07

When you say transformation, you mean the ETL system queries?

But what about the BI tools, and the end-users? do they know about sorting? it was them I had in mind.

2023-01-06 13:15:49

With BI tools, and all systems which have automated SQL generators, those generators have no knowledge of sorting at all.

2023-01-30 05:21:49

Users and privs are cluster-wide, not database-wide.

There's no reason that a database per customer is more secure than a set of schemas per customer.

The main advantage is that it is more convenient (schemas are per-database, so you can use the same schema names) but remember there is a maximum of 60 (sixty) databases, per cluster.

2023-01-30 05:22:21

I've not looked into the implementation, so I don't know for sure yet, but I suspect row level security does not scale for performance.

2023-01-30 05:22:59

Define "super slow". How long are we talking?

2023-01-30 05:24:40

MVs are palliative only. If you're operating Redshift incorrectly, and so are having a terrible time, MVs will help to some extent. You will still be operating Redshift incorrectly, and will still not have a scalable system, and as such Redshift will still be the wrong choice.

2023-01-30 05:26:32

> In this case, what happens is that each time a query runs, it must read all the data in the whole table. Unless the sort key supports the WHERE clause, then the database can't know which records meet the WHERE clause until it reads and evaluates each one of them.

This is not quite correct. The Zone Map always operates, for every column, regardless of sorting.

Sorting makes the Zone Map effective enough - when and only when the queries issued matching the sorting order of the table - to allow scaling to Big Data.

2023-01-30 05:28:15

https://www.amazonredshiftresearchproject.org/white_papers/downloads/materialized_views.pdf

"Materialized views are implemented as a normal table, a normal view and a procedure, all created by the CREATE MATERIALIZED VIEW command, where the procedure is called by the REFRESH MATERIALIZED VIEW command and performs refresh. The table is created by CREATE TABLE AS, which is why column encodings cannot be specified. The encoding choices made by Redshift are extremely poor. A full refresh makes a new table, populates it, and uses table rename to replace the existing table. An incremental refresh uses an insert followed by a delete, using the system columns deletexid and insertxid to keep track of which rows have changed, and as such runs a full refresh when any of the tables used by the materialized view have either manually or automatically been vacuumed, as vacuum resets the values in the deletexid and insertxid columns and so invalidates the book-keeping information held by the materialized view, this information being stored in extra columns in the materialized view, one plus one for every table used in the materialized view SQL. The table underlying the materialized view is never vacuumed, except by auto-vacuum, which I suspect running so infrequently as to be inconsequential. Auto-refresh is an undocumented black box, likely subject to ongoing unannounced change, and its behaviour is unpredictable. On a small single column table on an idle cluster refresh occurred normally after about 55 seconds; with one row inserted per second, refresh occurred after between 54 to 1295 seconds (twenty-one minutes)."

2023-01-30 17:46:29

We talking about Redshift here? no indexes in Redshift.

2023-02-14 21:50:57

Thank you very much, Truelike!

It will be a paid-for AMI.

What I'd like to do is charge 1% of Redshift monthly spend up to a cap of 100 USD a month, but I have no way to know monthly spend, so I'm thinking of either billing 100/month (which cuts off all smaller users, but what can I do?), or at 13.8 cents per hour, which comes to 100/month if on 24/7.

Multiple clusters in one page - interesting. So, first, to answer the question, right now, nope - never thought of it. As it stands, you login to a cluster, and you see that cluster. You can of course log in to any number of clusters, but each will be in its own browser tab.

If we think of multiple clusters, what information do you have in mind, to view over the clusters? I mean, their uses are I'm thinking going to be very different, right? the users would be different, the groups, privs, etc. I can see some information - higher level - being useful; I can easily imagine overview information being very useful to see all at once.

This feels rather like a status dashboard.

Right now, what's really on my mind is that I need to archive the system tables on an ongoing basis.

If I look at the tables page, and the history of a table, and see the events - insert, delete, vacuums, etc - well, they only go back as far as the system tables go back. Not very useful.

What I have in mind that I really want to do is look at a table, and see for example the size of the table each day on an ongoing basis. Now *that's* useful.

As an aside today I spent kicking back, having got the Combobulator up for people to look at, and I've been writing a `VACUUM` manager.

It does two things; first, it gets the list of tables which are not fully sorted, and makes its way through them. Second, you can `cURL` to it and tell it to vacuum a table - either immediately, or queued.

Being able to queue a vacuum is critical to distributed ETL, because you can only run one vacuum at once, and so just trying to do it at the time can fail, and also because people tend to use transactions (ETL systems often do) and you cannot vacuum inside a transaction.

I was talking to a DBT user today, has ten processes, would like to vacuum in each, but cannot, because they all try to vacuum at about the same moment.

So having a system where you tell it "queue a vacuum on this table" and it will then go and do it, solves both problems.

Also it will *actually* vacuum all tables in the cluster, unlike auto-vacuum, which looks pretty much to do nothing at all (I have reason to believe auto-vacuum basically never kicks in).

Cross-account, mmm, I think that works fine right now. The Combobulator needs a Redshift cluster IP and a Redshift username and password. It knows nothing about AWS accounts. You can point it at any cluster you like, just so long as you can actually reach it on the network - it's an AMI, you run it locally, it's designed to be run fully firewalled, for PII and security. It makes no external connections.

Serveless, no. Very nearly almost all of the system tables were removed. Serverless is the death of cluster admin; because you can't know what's going on.

(Anyway, I could be wrong, I've not investigated serverless, but I suspect it's basically hollow, from a technical point of view. I have a nasty suspicion all it really is, is the rapid addition and removal of compute slices, which would be pretty much worthless - and pretty much all the engineering I've seen in Redshift for many years has under the hood in my view been poor or very poor - but I could be wrong; I've not investigated).

> Selfishly as I'm in FinOps, I would love to see a cost page with monthly costs per cluster + query :)

Wait - I don't quite get this.

I can see a page with a list of clusters, and their costs, but what do you mean by "query"? cost per query? but what would that mean, where you pay for the cluster?

As an aside, clusters don't know what type they are. There's nowhere in the system tables which lets you know what node type they have. You could infer it, I think, by looking at the disks in the node. Otherwise you have to go to `boto3`, and get Redshift status information, which means knowing about AWS accounts. You also need `boto` to get pricing (although pricing has never changed since I've been monitoring it, so you could reasonably hard-code it).

I don't want to involve AWS accounts, because it adds a layer of setup complexity. Right now it's real easy - Redshift IP, user and password.

2023-02-14 22:29:51

Eaarlier on, during development, the login page could be configured with AWS account and you would see a list of clusters, from `boto`.

The only way to get a list of clusters, of course, is from `boto`.

Cross account there then has meaning; you would need to enter every account to use, and each would be enumerated for Redshift clusters.

Since `boto` would then be in use, you'd also use it to get cluster pricing - although, mmm, yeah, you'd need to enumerate for each account the status of nodes - reserved or on-demand, etc.

There's no history of cluster provisioning - no history of the number of nodes over time - but there is a history of *events*; you can know when a cluster resized (but I'm pretty sure the event does not indicate the cluster size, only that there has been a resize).

You'd need to poll events for each cluster, so you could know when to go and check the cluster sizing, and get the new number and types of nodes, to keep pricing up to date.

I'd need to keep the history of those changes, and prices, to correctly compute pricing over time.

(Actually, Redshift should not be resized. It's not a wholly reliable operation. Sooner or later, it goes wrong. The ETL system has to be designed to run multiple clusters concurrently, so you bring up a new cluster of the desired size, load data into it, then move users over to it.)

2023-02-15 23:05:03

> I think a percentage pricing would be better so you don't cut off smaller users, but you still need to be able to get income for the larger users. DC2.Large #1 node is like $180 a month so if you wanted to try and get those users you absolutely couldn't do $100.

I completely agree, but the problem is getting that information. As it is, the Combob makes *no* network access *at all* - which is exactly what is best for PII/regulated/medical.

I say to users -> firewall completely; you need to reach it, it needs to reach the clusters. That's it.

That's real simple, and that's great.

One thought related to this is that as it stands, right now, what exists is a product which is only useful to advanced, highly trained Redshift devs. This will tend to exclude the small clusters.

If I were to make a mass-market RS product, it would be a product aimed at making doing what can be done for incorrectly run cluster, where the users have no idea about RS and are having a terrible time. Palliative care. Making life less painful for a terminal patient.

> So for my company, we have something like 25 or 30 total clusters across multiple accounts. We - per client requirements - are required to have separate clusters for certain clients. Other clients we can combine into shared clusters. For the client specific clusters we end up getting the similar workloads and queries being run, but like you mentioned it'd have to be mostly high level.

Wowsers, that's a lot of clusters - and I can see how AWS is failing here; their world is based on the notion of an account. Cross-account tooling is not a thing for AWS.

> This specifically relates to Spectrum Queries and scanning data in S3. We have a giant ra3.16xl 8 node cluster that's used for multiple workloads. Ideally most data is in RMS but we have a custom report that gets sent out when a user queries something in S3 that provides the query, data scanned (Which you have) but then an estimated cost of $25 per 5TB Scanned per Month.

Ah, so!

Yes. That makes sense and is clear, and easy to price for.

However, for this to be useful, the system tables have to be archived.

Having only the last seven days of data is no use at all.

Everything right now is pointing to the need to archive the system tables.

That's the next piece of work.

2023-02-18 18:49:17

> Ah, copy that. It might be possible to make payment conditional through the AWS Marketplace Metering Service: 1) Pay flat monthly fee with no network access OR 2) Allow network access just for the MMS for percentage based billing. For smaller clusters I'd think it's less likely they are using Redshift for PII/HIPAA/similar.

It's actually a bit of a rabbit hole.

First, if a client has RS, how can you know what their budget is?

You would have to enumerate their clusters - and far as I can see that means boto3 (and in fact, it is clear from our conversation it in fact means cross-account).

Boto3 will enumerate cluster types and node counts - but they can change, and I also need to be aware of reserved nodes.

So I need to regularly poll all the cross-cluster accounts for all the Redshift clusters, and the Combob needs network access to all the boto3 endpoints (cross-region support) and to AWS billing servers.

It's a nightmare!

What I'm considering now is not charging monthly, but charging per hour at a rate which is 24/7 equals 100 USD per month.

It's not ideal, but it does allow low-budget users to fire up the Combob when they need it and examine their clusters - say to diagnose problems, or to check how their tables are doing and so on.

> Understood on all of the other points though, that all makes complete sense. Best of luck though, seriously it's extremely impressive what you've done!

You're tremendously kind and it is very encouraging to read :-)

The Combob will be transformed once I archive the system tables.

Archiving the system tables though adds a consideration to the pricing model, since it's something that needs to run say once per day.

2023-02-21 19:01:41

I concur with the OP. Simple or not, it's a complication I would not want to have to consider, in any way, shape or form - and particularly so when the need to consider it is unreasonable.

2023-02-21 19:05:31

I can't delete this comment (delete is not working), so I've wiped it.

2023-02-10 21:16:06

I may be wrong, but Glue performs no validation, right?

If that is correct, are you performing any data validation?

2023-02-10 23:37:43

Data usually contains errors. Invalid values, missing values, even extra rows, you name it. These need to be detected, and removed or fixed, before the data is presented to the people using the data. The process of data validation needs to occur during the ETL process.

2023-02-25 16:32:05

The problem is you have too many files. The overhead of S3 access is killing you. Compact the files.


Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers