This page provides information about materialized views in the database.
Note the creation and age columns are missing, as Redshift does not store the creation time for a view, but they are present for the underlying table.
Name | Type |
---|---|
schema_id | int8 |
schema | varchar |
view_id | int8 |
view | varchar |
table_id | int8 |
table | varchar |
owner_user_id | int8 |
owner | varchar |
length | int4 |
view:failures | varchar |
view:auto-refresh | varchar |
view:refresh type | varchar |
table:creation | timestamp |
table:age | interval |
table:idle | interval |
types:sort | varchar |
types:dist | varchar |
counts:view:columns | int2 |
counts:table:columns | int2 |
counts:table:sortkeys | int2 |
counts:table:slices | int2 |
store:blocks:sorted | int8 |
store:blocks:unsorted | int8 |
store:blocks:total | int8 |
store:rows:sorted | int8 |
store:rows:unsorted | int8 |
store:rows:total | int8 |
store::sorting %% | float4 |
i/o:bytes read:disk | int8 |
i/o:bytes read:network | int8 |
i/o:rows:inserted | int8 |
i/o:rows:deleted | int8 |
i/o:bytes processed:in memory | int8 |
i/o:bytes processed:on disk | int8 |
related pages | varchar |
The schema ID. This column is emitted in CSV exports only.
The schema name.
The view ID. This column is emitted in CSV exports only.
Surprisingly, view IDs turn out to be unique across all databases.
The view name.
The table ID. This column is emitted in CSV exports only.
Surprisingly, table IDs turn out to be unique across all databases.
The table name.
The owner user ID. This column is emitted in CSV exports only.
The owner user name.
The length of the view text, in bytes.
In the event of the underlying tables or views either dropping a column, changing the data type on a column, renaming a table, view, column or schema, the materialized view stops updating.
Failure | Description |
---|---|
none | This means there are no problems : it has nothing to do with whether the view is actually up to date |
dropped column | A column somewhere has been dropped |
changed data type | A column somewhere has changed its data type |
renamed table | A table or view somewhere has changed its name |
renamed column | A column somewhere has changed its name |
renamed schema | A schema somewhere has changed its name |
Type | Description |
---|---|
automatic | Automatic refresh |
manual | Manual refresh |
Type | Description |
---|---|
unavailable | The column in stv_mv_info
which provides this info is now being used to report the materialized
view is in a broken state |
full | Materialized view is fully recomputed on refresh |
incremental | Materialized view is incrementally recomputed on refresh |
Note that if the materialized view moves into a broken state (see the
view_failures
column), the information for this column
becomes unavailable, as both sets of information are provided
in the original system table in a single column, which can hold only a
single value, and so when it shows an error, it stops showing the
refresh type.
Remember that an incremental refresh table will experience a full
fresh if a VACUUM
occurs on any of its underlying
tables.
The creation timestamp of the table.
The age of the table (i.e. now() minus the creation timestamp).
Duration since the start of most recent table access in
stl_delete
, stl_insert
and
stl_scan
, which is to say, the time since the table last
had rows deleted, inserted, or was scanned. If empty, there are no
records in any of these system tables within the time period being
examined by the page.
Value | Sorting Type |
---|---|
compound | Compound |
interleaved | Interleaved |
unsorted | Unsorted |
Note that it is not possible for me to write a view to tell if a
table has auto
sorting. This is because the system table
which contains this information is owned by rdsdb
, the root
used controlled by AWS, and so is not accessible.
The only way to I can find out is to use svv_table_info
,
which is a view provided by AWS, which provides information about
tables.
However, firstly, the view is flawed. It only displays information about tables which have at least one row; empty tables are not listed.
Secondly, the view is huge. It’s a lot of SQL and work, and indeed is repeating many of the table reads I am performing in my own view. It also looks complex and I think then probably buggy.
As such, I’ve opted not to use it, and so this page does not tell you
if auto
sorting is active.
As it is, I would advise you never to use AUTO
, anyway.
You cannot have a knowingly correctly designed system when the system
contains undocumented black boxes provided by third parties, and where
those black boxes can and do silently change.
Finally, note that auto
is the default setting when
creating a table - if you do not specify sorting, that’s what you get -
and so to get an unsorted table you now need, after making the table, to
then issue an ALTER TABLE
command, to deactivate
AUTO
.
There are a number of distribution types which are not available to users.
Value | Distribution Type |
---|---|
all | All |
even | Even |
key | Key |
all (auto) | All, selected by auto. |
even (auto) | Even, selected by auto. |
key (auto) | Key, selected by auto. |
unknown | The else clause - if nothing
matched, you get this. |
Now, a few words about auto
.
If I remember correctly, an empty table begins with all
.
Once the total size of data in the table (not the number of rows - the
data size) reaches a certain limit, and that limit is computed based on
the node types and the number of nodes, the table changes to
even
.
That’s a one-way change; a table never goes back to
all
.
There’s a white paper AWS published (“Fast and Effective
Distribution-Key Recommendation for Amazon Redshift”) explaining how
auto
selects key distribution.
It’s an NP-hard problem, so they basically have a bunch of algorithms which run and they pick the best result, where “best” is defined in terms of their method of judging how much work being done by Redshift costs.
My take on this is that it’s miles better than a user who knows nothing, and a fundamental blunder for a user who does know what they are doing.
I’m concerned that where real-world usage is complex and surprising,
the actual effects of auto
are going to be unintended. I’m
particularly concerned about the high costs of changing key, and that a
system may flip-flop between different optimal keys, as the estimated
cost of the keys in use will vary as the issued queries vary - so we
might see that overnight, with ETL queries and no user queries, keys
change, and then during the day, with user queries, keys change again -
only key change is slow and expensive, and so the system never actually
settles down.
Again, also, there is the fundamental issue that you cannot knowingly design correct systems when they contain undocumented black boxes provided by third parties, and where those black boxes can and do silently change.
(In this was we have documentation, but it’s by now out of date - the implementation will have changed - and it’s not actually documentation which allows you to predict how the system will operate, as the way this black box works means that although you can understand what it is doing, you as an admin or developer cannot predict its outcome.)
The number of columns in the materialized view.
The number of columns in the underlying table.
A materialized view table contains more columns than the materialized view, as the table needs to keep track of state. The overhead is one column, plus one per table which participates in the SQL of the materialized view.
The number of sortkeys.
The number of slices which hold blocks from this table.
If the number of slices holding blocks is less than the number of
slices in the cluster, then obviously, not all slices are participating
in queries which read this table. This is improper, and you should look
to fix it. However, for tables with less records than there are slices
in the cluster, they will naturally use less slices than are present in
the cluster. Finally, note that tables with all
distribution have a full copy of themselves existing on a single slice
on each node. Their rows, when they are read, are broadcast to all
slices. Remember only use all
for small tables; if you use
it with large tables, you end up producing a system which does not
scale, because only one slice per node can write to that that table.
The number of sorted blocks.
The number of unsorted blocks.
The total number of blocks.
The number of sorted rows.
The number of unsorted rows.
The total number of rows.
Sorting is not a per-table concept, but in fact a per-table, per-slice concept - in fact, it is even actually per-table, per-column, per-slice, but there isn’t much gain by adding in the per-column level, as will become clear.
What I mean by this is that sorting in the system tables and the docs is presented as a single number for a table.
This is wrong, and is misleading.
The key issue is that VACUUM
does not operate on an
entire table at once, but in fact on a subset of slices.
If you watch a table, in real-time, while VACUUM
runs,
you see one group of slices goes through the process of being sorted -
this is quite an involved process, and the slices being vacuumed go
through various contortions with how many sorted and unsorted blocks
they possess (both go up and down), until eventually it all sorts itself
out and the slice is sorted.
This takes quite a while, and the user can interrupt the process at any time.
This leaves the slice in an partially vacuumed state - and, critically, it means then that slices can vary in how vacuumed they are.
A table is as fast as its least sorted slice.
It is then that what we care about is the least sorted slice.
This column gives that percentage, which it is computed by on a per-slice basis dividing the number of sorted blocks on the slice by the total number of blocks on the slice.
This leads us to a the per-column nature of sorting.
It can be that row distribution between slices is not approximately equal. One slice may have significantly more rows than other slices - we can imagine, for ease of reasoning, an fairly extreme scenario where all slices have say three sorted blocks, except for one slice, which has ten sorted blocks.
If we then update one row, the three-block slices become four block slices, with three sorted blocks and one unsorted block. They are 3/4 = 75% sorted.
The ten-block slice becomes an eleven block slice, with one unsorted block, which is therefore 10/11 = 90.9% sorted.
This however is I think normally not much of a consideration - usually tables are well balanced between slices.
A final note, the automatic VACUUM
, which Redshift has
running in the background, is described as backing off, and stopping
working, when the user load on the cluster is too high, or the user
issues a VACUUM
. Obviously, if this is true, then this is a
source of partially vacuumed slices.
(As it is, auto-vacuum seems to be negligible in practise. It runs
too infrequently, and does too little work, to have a detectable impact
upon clusters. Also, in my testing, I only ever saw it run delete-only
VACUUM
.)
This column then shows the total number of bytes read from disk, as
best I can judge the types indicated in stl_scan
.
This column then shows the total number of bytes read from network,
as best I can judge the types indicated in stl_scan
.
Importantly, it only counts the receive side of network
activity - the step is scan
, after all, not
broadcast
, so we’re not counting bytes twice.
The number of rows inserted into the table.
For tables with all
distribution, this is the physical
number of rows (i.e. one per node), not the logical number of rows.
The number of rows deleted from the table.
For tables with all
distribution, this is the physical
number of rows (i.e. one per node), not the logical number of rows.
This column then shows the total number of bytes processed by the
stl_aggr
, stl_hash
, stl_save
,
stl_sort
and stl_unique
steps, when running in
memory rather than on disk.
This column then shows the total number of bytes processed by the
stl_aggr
, stl_hash
, stl_save
,
stl_sort
and stl_unique
steps, when running on
disk rather than in memory.