know your data’s “Time to Live” in Clickhouse — 4 essential settings about TTL
Clickhouse is a powerful realtime OLAP (Online Analytical Processing) system for your big data. Since it is “BIG” data, we would need to housekeep them in an efficient way. In this blog, we are going to explore a TTL (time-to-live) setting in Clickhouse where the data would be automatically housekept based on a per table or per column configuration.
1. Column level TTL
TTL in general is an indicator to determine how long the data can exists; on columns level, it simply means how long for that column’s value should be maintained.
Take an example, for a column named “earnings” of type Int, if TTL has reached, the value of earnings would drop to the type’s default value which is “0”.
Let’s start by creating a table:
The create-table syntax is very similar to what normal SQL defines, the only new thing to us is probably the TTL settings for “earnings”. What we try to accomplish is for all records, if they expired (30 seconds in this example), the earnings column would be reset to “0” — the default value for Int type.
The following is a snippet on how it works.
Clearly, applying TTL on columns might not be very useful; as the ONLY thing we could do to the affected columns is to supply the corresponding types’ default value.
2. Table level TTL — DELETE
Now, let’s explore on Table level TTL instead. One of the valid options for Table Level TTL is “DELETE” — removing expired rows physically. If not specified, the default action on expired data is also DELETE.
Deleting data in production systems is always a hot debate. A Simple rule is… if your targeted tables are essential to your business logics (e.g. transaction tables or client-profile tables) DON’T delete them (of course). Whilst for log data, it might be worthy to delete some historical data, the rationale is such aged data is nearly not accessed at all~
Let’s modify our existing table to add in TTL:
As usual add a record to verify the result:
insert into ttl_column_example values (now(), 21000, ‘north-wind-furniture’);
The expected result is the inserted record should be DELETED after 30 seconds being added to the table… and it works~ The following is the full snippet:
Hm… everything sounds good, would it be nicer if the records are deleted not ONLY because of expired but also matching a condition? Your wish is granted and we would introduce the “where” clause under TTL.
Do note that we only removed expired records whose status is in either
- bankrupt OR
PS. there is an interesting line starting with “optimize” which we would have some coverage later in this blog.
3. Table Level TTL — Aggregation
If removing expired data is not your cup of tea, what about setting the values of targeted columns based on an aggregated result? For example, it might make real sense for expired ‘earnings ’entries to set by their average earnings within that particular month. To accomplish this, let’s run the following snippet:
The new thing here is the TTL syntax, a “group-by” is added to categorise the data rows and then per group apply an aggregation on the ‘earnings’ column, if you have other columns involved in the table, then the 1st row’s values, per group, would be chosen.
One more tricky thing is … for mutations on the table rows; this is no guarantee the changes are update immediately. Quoted from the official documentation:
When ClickHouse detects that data is expired, it performs an off-schedule merge. To control the frequency of such merges, you can set merge_with_ttl_timeout. If the value is too low, it will perform many off-schedule merges that may consume a lot of resources.
If you perform the SELECT query between merges, you may get expired data. To avoid it, use the OPTIMIZE query before SELECT.
Hence we would need to run an optimize query on the target table. Typically a simple deduplication would be enough to refresh the mutation / merge. Once the optimization is done, our SELECT query should be retrieving the expected results — averaged earnings during a specific month.
4. Table Level TTL — Moving data
Another option for Table TTL is to move expired data to other discs. Moving data around sounds a weird thing? If you heard of something named Hot-Warm architecture, data mobility is an essential feature to make it happen.
PS. before moving data around, we would need to setup “storage-policy” and define “disk” and “volume” within config.xml. The configuration on storage-policy would be covered in another medium blog.
The following is a snippet on how the moving works:
So on the database-wise, there is no impact on our SELECT syntax. On performance-wise, if “disk_2” is a HDD whilst “disk_1” is a SSD; then we should experirence a slooooower query. Based on the above configuration, all expired / historical records would be moved to the slower disk_2 whilst new records would be inserted into the fast SSD — disk_1. This is exactly how Hot-Warm architecture works~
There are a few things to note for as well…
- our SQL syntax would be re-written for optimization purposes. Take an example, the syntax for create-table with column level TTL would be something like this
` earnings ` Int TTL ts + toIntervalSecond(30)
is the re-written version of the following:
` earnings ` Int TTL ts + INTERVAL 30 SECOND
hence the final SQL might be slightly different from our crafted ones.
- remember that Clickhouse works well with read-only data (similar to many NoSQL databases). We can still update and delete data within the tables; however ONE important thing to note is that all mutations would be triggered as an off-schedule merge. Dirty reads might be possible unless we “wait” for sometime OR run an optimize query before a SELECT.
This is what we have gone through today:
- understanding TTL settings on the column level -> default values based on the column’s data type.
- Table TTL — DELETE which is the default option. Once data has been expired, they would be removed.
- Table TTL — Aggregation. If you found it meaningful to aggregate grouped records (expired ones), this is the right choice.
- Table TTL — Move. Where mobility of data based on age (expiry policy) is important to your architecture.