5 things to know about Clickhouse SummingMergeTree
1. What is a SummingMergeTree?
In Clickhouse, there are several Table Engine families and MergeTree(s) is the most popular one. If you are looking for OLTP features, the good old MergeTree should be good enough. However if you are looking additional features like the following:
- data deduplication AND
- summarization of numeric columns
then SummingMergeTree is the one to go.
2. How is the data being deduplicated?
In Clickhouse, every Table should be configured to have a Primary-Key (also by default the Sorting-Key shares the same configuration with the primary-key). So when rows of data is ingested, the ones sharing the same Primary-Key would be summarized eventually.
Why is it “eventually”? Data being ingested are stored as data-parts — simply treat it as small files containing certain batch of data. Periodically, Clickhouse would merge the data-parts to form a bigger one. Which means if we are using a SummingMergeTree, there is a chance that several entries would be available under the same Primary-Key simply because the data rows (data-parts) are not merged yet~
How frequent is this periodic merge then? Typically 10 to 15 minutes. A small tutorial session would be illustrating how to solve this by running an “OPTIMIZE” query.
3. The only supported aggregation function is SUM
Yes~ The name of the Table Engine has already told you what aggregation function is available for summarization -- SUM. That simply means the rows of data sharing the same Primary-Key (or Sorting-Key to be exact) would only be “summed” up on the numeric columns. There is no way to provided an alternate aggregation function.
4. What is happening during the Summed operation?
continued from the last point, the numeric columns would be SUM-ed, but what about those non numeric fields? The answer is — the FIRST row of data matching the Primary-Key would be the skeleton of the summarized entry.
Take an example, if a Table contains 3 fields and the Primary-Key is (brand):
- brand (String) — the brand name
- qty (UInt32) — amount of sales AND
- country (String) — country of origin
and assume 2 entries from the same brand is inserted:
- brand (MadBook), qty (100), country (Brazil)
- brand (MadBook), qty (300), country (India)
the sum-ed data would be…
- brand (MadBook), qty (400), country (Brazil)
Note that the country column would be “Brazil” instead of “India” as the 1st row matching the Primary-Key has the country column valued “Brazil”, hence all following rows’ country values would be ignored.
5. Best Practice
Since SummingMergeTree SUMs up rows with the same Primary-Key, hence if raw data is required (instead of the summarized ones); it is suggested to work in pair with MergeTree.
- MergeTree — containing all the raw data (millions of rows)
- SummingMergeTree — the summarized version of those millions of rows
Remember that for report generation or OLAP operations, SummingMergeTree would be a good choice; whilst for search-based operations, MergeTree would be the one to go.
Does that mean a double on my storage size? Not exactly, remember that SummingMergeTree is an aggregated set of the raw data; hence if quite a lot of the raw data share a few Primary-Key, then the number of rows dropped significantly at once (well, after 10~15 minutes maybe :) ). For most cases, would not be a double for sure.
tutorial — a study on how the SummingMergeTree works
The above snippet illustrated how to create the SummingMergeTree table. The targeted columns to be summed is provided as parameter, note that it is in a “tuple” ( (xxx, yyy) ) form and not an “array” ([xxx, yyy]) form. After inserting some data rows, the final result would be the summarized version.
An interesting point is the last query, we already know that sometimes table data are not 100% summed; hence the suggested way to query for data is apply the sum() directly inside your query. Of course, this might not be perfect and the next example will illustrate how to force-merge data parts to form the final results at once.
We added a new column named “country” and then clear out the existing records to avoid confusions. Next, we re-enter the data rows, do note that for MadBook, we have 2 country-of-origin: Brazil and India.
Based on our understanding, the picked country value would be the 1st row matched the Primary-Key and so the winner is “Brazil”. To verify what we’ve learned, insert one more MadBook entry whose country-of-origin is Africa.
Interestingly… there are 2 entries under MadBook~ This is the partial summed scenario. Ways to solve the issue is to run a SUM() aggregation and ignore the “country” column, this for sure works~ However, what if I also need to retreive the “country” column with the summed results??
The solution is to run an “optimize-deduplicate” query. Now we can see, MadBook is produced by “Brazil” again, no India or Africa. Optimize query is powerful BUT also costly. In our tutorial, there is only a few records and should have no impact on summing and deduplicating them. But what if the dataset consists of millions and billions of record? That would take too long to finish the optimization. That is why — “use it with care”.
We have gone through a few intersting things here:
- knowing what the SummingMergeTree is and how it operates
- knowing how the de-duplication (or summarization) works
- tackling issues on partial summed entries
- also the best practice on using SummingMergeTree side-by-side with ordinary MergeTree.
Happy data-mining~ :)