Entity Centric Elasticsearch events

devops terminal
7 min readAug 17, 2020
screenshot on the “transform” UI — Kibana

Video Tutorial at here: https://www.youtube.com/watch?v=WwPuWYNelpY

Elasticsearch is a great tool for ingesting data and perform fast aggregations (also known as analytics). All data logs ingested are treated as individual events in which theoretically have no dependance on other entries. However logically this might not be the case.

In a eCommerce platform, when a user checks out the shopping cart and paid, a corresponding event (or a set of related application logs) would be ingested by Elasticsearch. So this entry(s) is treated as an individual event which sounds reasonable; however, this particular user might have purchased on this eCommerce platform for more than once in the past. Then… would it be a fantastic idea to gather all user related purchases and started to check on the spending metrics? With these data on hand, it is quite easy to plot a trend analysis based on spendings on a weekly / monthly basis; or to plot a category based spending pie chart.

PS. I am pretty sure marketing team will fall for you (I mean fall for your analysis) at once :)

The process of creating consolidated data from events (event-centric architecture) is named entity-centric processing — which is the main purpose of this blog / tutorial.

How?

Now you heard of the benefits in grouping / consolidating related events, but how? Long time ago, the only way we could perform entity-centric processing is through a program. Basically we would run a query to gather all events related to an entity (in this case a user_id — representing a particular user in the eCommerce platform). Once we got the events, it is just a simple for-loop to go through them and perform the metrics like summation or max of a number field. Finally ingest the consolidated results to Elasticsearch on a different index.

Sounds straight forward isn’t it? Hm… Only if you were a developer~ Plus we would need extra effort on maintaining the program logics too.

Fortunately, if we are using Elasticsearch 7.7 or above, there is an XPack feature named “transform” available to save us from the above.

Job creation

The very 1st step is to create a “transform” job. Navigate to the “management → Elasticsearch label → Transforms” and this will bring us to the correct page.

Click on “create a Transform” and choose an index for consolidation. By now a wizard page should be shown as below:

Pretty straight forward, we could have a preview on the data index on the top right hand side — great for understanding what fields we could employ for data partitioning and analysis. After making up our choices, we would first “Group by” (or partition) the data index by at least 1 field. Basically the “group by” feature works exactly the same as Bucket-Aggregation in normal analytics — simply how to categorise data by a field’s nature. In the tutorial, we would pick up a date to bucket the data into a date-histogram; then on the 2nd level of “group by” we would pick the customer’s full name.

Once we got the groupings setup correctly, it is time to setup the corresponding metrics. In the tutorial, we would just want to know about the max and sum of purchases for a particular user, hence the following would be setup:

Feel free to add more metrics for experimentations too :)

Alright~ Once we setup the “group by” and “aggregations”, the pivot preview table will show us some sample data:

Note that the data would consist of 4 columns — fullname, order date, max of price and sum of price. So these would be the contents for our final index. Click “next” to proceed.

Step 2 would involved a job’s ID (name) and the most important setting — destination index where the consolidated data would be available. If you want an index pattern to be created for this destination index, check the “create index pattern” toggle button (a small discussion would be available on this setting later on). “Continuous mode” makes sense only when the data index contains live data — simply data kept on updating and so … the consolidated data would be updated as well. For the tutorial, the data index picked is a static dataset (in actual the demo dataset provided by Elasticsearch named “eCommerce order”); hence we could just uncheck this option.

Click “Next” and proceed to the final step.

3 options available, “copy to clipboard” will copy the API required to create the transform job; this is useful if you want to know the APIs. “Create” will create the job but NOT start it. “Create and start” on the other hand will create the job and kick start the job at once. Suppose your data index contains a lot of data like trillions of events… it might be a good idea to trigger the transform job only at midnight when the system is not servicing anybody.

PS. there is a very meaningful sentence on the “create and start” option — “A transform will increase search and indexing load in your cluster. This makes sense as consolidation needs to search all related events at the first place (this is search load) and after consolidation, the data / event needs to be indexed back to the destination index (index load); hence this is another reason why you might just create a job and not running it at once.

For our case, the demo dataset is very compact in size, let’s just create and start it~ Once you see this status update, the transformation is done:

progress 100% — transformation done

Now if you click on the “discover” option; you should be able to see the following:

Cool~ We made it~ (Party time)

Wait… something looks weird here… we have a “date” field… then why couldn’t we plot the histogram part?????

A small discussion on the “Create Index Pattern” option

Earlier when we create the job, we have a choice to create the corresponding “index pattern” for the destination index or not, at that time, we did vote for a “YES” and this is why…. the histogram is MISSING~

Yep Elasticsearch did create an index pattern for us, however… the index pattern created has some flaws, it didn’t let us pinpoint which field would act as the default timestamp for visualization plotting!!! Simply the created index pattern has NO SUCH information~ Hence we can’t create graphs…

PS. if you are expecting the consolidated data index available for visulization plotting as well… DO NOT check the “create index pattern” option and create the index pattern manually instead~

Workaround

Alright we voted for a “YES” earlier, how could we get things work again now??? It is rather simple in actual, navigate to “management → kibana → index patterns”, find the corresponding index pattern (if you are following the tutorial, the designated index pattern would be named entity_transform_999) and DELETE it~ YES, delete it and recreate the index pattern by picking which field is the default timestamp field, and done~

“delete” the index pattern
during the recreation of the index pattern; remember to pick the “date” field this time

Now if the above was done, navigate again to “Discover” app; you should see the following~

Cool~ You can also start plotting charts based on this index / index-pattern now like the following using Lens:

closing

We just gone through an interesting journey on how to create Entity-Centric data from Event-Centric logs (data index). You just equipped yourself with one more secret weapon under the sleeves when we are talking about data analysis~ Remember the great thing is we don’t even need to write a single code of program to achieve such goal which is amazing~

PS. if you are more into a video tutorial, do checkout the link here: https://www.youtube.com/watch?v=WwPuWYNelpY

Happy data-engineering :)

--

--

devops terminal

a java / golang / flutter developer, a big data scientist, a father :)