several things you need to know about Materialized View in Clickhouse
Clickhouse is a realtime OLTP (Online Transaction Processing) engine which uses SQL-like syntax. Talking about SQL, we can create Tables and Views to retrieve data. Usually View is a read-only structure aggregating results from 1 or more Tables — this is handy for report creation which required lots of input from different tables. Sometimes we do need to update the view data and this could be achieved if the view is a Materialized one.
Differences between View and Materialized View
- View is read-only and Materialized View is updatable (however depends on RDBMS product’s implementation as well)
- View is in-memory and hence everytime you access it, you are triggering a select statement and aggregations (if any) to build the content. View contents could be cached to increase performance.
- Under Clickhouse, Materialized View also works in memory, but the results are actually written to a Table. In some sense, we can say that a Materialized View contains the process and instructions to convert data from various sources (i.e. Tables) into a consolidated set and write them to a designated Table as destination. Do note that different RDBMS would have different implementations on how a Materialized View works.
- Under Clickhouse, another use case for Materialized View is to replicate data on Integration Engines. Take an example, Kafka integration engine can connect to a Kafka topic easily but problem is every document is read-ONCE in nature; hence if we want to keep a replicated copy that is searchable, one solution is to build a Materialized View and populate a target Table.
Once we have a ground knowledge on what View and Materialized View are, a question arise “if both of them generates the final data through in-memory operations and table joins… then why should we use Materialized View?”
The short answer is Materialized View creates the final data when the source table(s) has updates. Think about it as Table Triggers, once a Table has been updated (add / edit / delete), the Materialized View instructions are activated and hence updating the destination Table’s content. This might not seem to be advantageous for small datasets, however, when the source data volume increases, Materialized View will outperform as we do not need to aggregate the huge amount of data during query time, instead the final content is built bit by bit whenever the source Tables are updated.
case 1. building our first view
Let’s create a transactions table (MergeTree engine) and populate some data to it. Transactions consist of an ID, customerID, the payment method (cash, credit-card, bitcoin etc), the productID involved as well as the quantity and selling price; finally a timestamp indicating when the transaction happened.
After inserting some data, let’s run a SELECT with aggregations; do note that Clickhouse supports SQL-like syntax and hence aggregation functions like sum, count, avg could be used, also remember to “group-by” whenever aggregations are involved.
Alright, this SELECT acts as the grounding query for building our 1st Materialized View. The first step is actually creating the designated target Table containing the contents built through the Materialized View (confused?? :)) The second step is then creating the Materialized View through a SELECT query.
Do note that the target Table’s definition (columns) is not required to be identical to the source Table. Take an example the target Table — transactions4report defines all columns EXCEPT the “id” and “productID”.
The syntax for Materialized View contains a SELECT statement,remember the view acts as an instruction / process to populate the data for the target Table.
transactions (source) > mv_transactions_1 > transactions4report (target)
Alright, till this point, an interesting question arises - “would the Materialized View create entries for us from the beginning of the source Table?” The answer is NO~ We usually misconcept on this very important point. Materialized View only handles new entries from the source Table(s). So that’s why we need to insert “new” data on the source to validate how our View works.
After that, our target Table should have data populated and ready for SELECT.
Hm… again till this point, another interesting question arises - “all these workloads seem to be pointless as the results of the target Tables are nearly identical to the source Tables???” Indeed, if the Materialized View is maintaining a 1:1 relationship between source and target; then it simply is just performing data replication~ Again such replication is essential for certain integration engines like Kafka and RabbitMQ (check above).
PS. If you want a “clean sheet” on the source table, one way is to run an Alter-DELETE operation. Unlike conventional SQL supporting the “DELETE from table” syntax, Clickhouse supports data removal through the Alter syntax instead
ALTER TABLE transactions DELETE WHERE 1 = 1;
# or even easier…
ALTER TABLE transactions DELETE WHERE 1;
case 2. a View joining multiple source Tables
Usually, Views or Materialized Views would involve multiple Tables’ integration. Take an example for the transactions Table, it might require us to join PaymentMethod Table.
Next is to create the target Table - transactions4report2. The definitions are pretty much the same as the former one, but 1 major difference is this time the payment method’s name would be gathered instead of its ID value (e.g. 0 = cash, 1 = credit card).
Alas, the Materialized View (mv_transactions_2) definition is slightly different from the former in which a table join is required to capture the payment’s name.
transactions t > join by t.paymentMethod = p.id > paymentMethod p
Let’s add a few records in the source Table and let Table transactions4report2 populated as well. Finally we can make use of the target Table to run different kinds of SELECT queries to fulfil the business needs. Views (or Materialized Views) are handy for report creation as 1 simple SQL would be enough to gather enough data to populate fields on the report (e.g. Crystal Reports or Jasper Report).
even though 1 use-case of Materialized Views is for data replication. However, this is not a perfect solution for High-Availability. For production environments, we should look at Replicated Engines instead. Also don’t forget to look for Shard Distributions to avoid single-point-of-failure.
Cool~ We have just gone through some adventures in Tables and Materialized Views.
- Materialized Views is like a set of procedures / instructions to extract data from source Table(s) and aggregate them into the target Table
- Different from Views, Materialized Views requires a target Table. Remember that the target Table is the one containing the final results whilst the view contains ONLY instructions to build the final content.
- Materialized Views could act as a replica for certain integration engines such as Kafka and RabbitMQ. For a more robust and reliable replication solution, look for Replicated Engines and Distributed Engines instead.