use case: coronavirus COVID 19 study and visualisation with Elasticsearch

Lately a new virus named as COVID-19 is spreading in viral within Asia and starting to expand its horizon to Europe and America regions. In this blog, we are going to explore how Elasticsearch could help to investigate and explore the virus trends.

PS. scripts / code are available at https://github.com/quoeamaster/bigdata_blogs/tree/master/coronavirus_COVID_19_usecase

First of all, the datasource

Various sources of the virus data are available on the web — e.g. the corresponding health department statistics from our Government, research data from various Universities and even from Hospitals’ press documents. Like most cases, the data from the above sources might be in different and unstructured format (e.g. some in csv, xml or raw text format); hence the first thing we should do is to study the data set and decide which one to employ.

The data source I picked came from https://www.pharmaceutical-technology.com/features/coronavirus-outbreak-the-countries-affected/ which is a csv format. The reason is simple — it is easy to parse a csv format vs an xml or raw text one. Plus the dataset contains various information including the number of infected and the lat-lon location which would be nice for coordinate-map plotting later on.

a preview on the data file as follows:

Since the data file is already nicely structured, now all we need to do is to focus on how to ingest and parse the contents correctly. We will be using the simple and easy Filebeat + Ingest pipelines to do the trick.

Filebeat configuration

Download Filebeat here (if necessary) https://www.elastic.co/downloads/beats/filebeat;

Remember that Filebeat is written in Go-Language and hence please download the correct executable based on your target OS (e.g. if you are using Windows 64 bit OS, do download the “Windows MSI 64 bit (Beta)” version). Follow the instructions on how to unpack / install Filebeat.

Now configure Filebeat to ingest the data from our csv file above:

the “input” came from a log file (e.g. our csv); we would be excluding the 1st line in the csv as it is just a Header (exclude_lines: [‘^Lat’]). The “output” would be an Elasticsearch host, we would choose “coronavirus” as the index name.

Important config changes when we are using a non default index name (default is “filebeat-%{[agent.version]}-%{+yyyy.MM.dd}”); the first config is to disable ilm (index-lifecycle-management) by adding “setup.ilm.enabled: false” else the “index” configuration would be ignored. The next 2 configurations are the “setup.template.name” and “setup.template.pattern” as the correspondng index-template needs to be applied to our new index name.

Before we start ingesting the data file, let’s run a test config and test output command. To test the config file’s correctness =>

filebeat test config -c {config_file_name}

if all is GOOD, “Config OK” would be displayed, else an error would be displayed giving you a clue where is mis-configured.

To test the output (e.g. Elasticsearch connectivity) =>

filebeat test output -c {config_file_name}

if all is GOOD, a line displaying “talk to server… OK” would be shown.

Ingest and Parse

To ingest simply run =>

filebeat -c {config_file_name} -e

the “-e” option would show filebeat logs on the standard error which is useful for debugging.

If all good… now we should be able to query the new coronavirus index on kibana. Try to run a GET =>

GET coronovirus/_search

and you should get back a 96 (or more depending on the data set) hits.

2 things to pay attention to:

  • there is a “message” field with exactly the content of the csv (per line); this is not a suitable format for searching or aggregations… hence need to parse this field
  • there are many other fields that we are not interested in, such as “log”, “host”, “agent”; it might be better to remove them and save some space

Let’s write an ingest pipeline to solve the above issues:

The “remove” processor would remove the non-important fields in the document. “gsub” processor would help to replace the DOUBLE-QUOTE character within the message field. Let’s take a look at some of the sample documents content and see why we need to replace the DOUBLE-QUOTE:

normal documents =>

“13.7542529,100.493087,Bangkok,Thailand,10,0” AND documents with DOUBLE-QUOTED city name would be =>

“””36.6248089,-121.1177379,”San Benito, CA”,US,2,0"””

if we replace the DOUBLE-QUOTED city part with a SINGLE-QUOTE then the above value would become =>

“36.6248089,-121.1177379,‘San Benito, CA’,US,2,0” and hence less trouble parsing the parts.

finally we would be using the “grok” processor to parse the corresponding parts. We could also use the “grok debugger” on kibana to test out the regular expressions.

Before deployment; remember to do a test on the pipeline with sample data:

To update / parse the existing documents; run the following =>

POST coronovirus/_update_by_query?pipeline=coronovirus_parser

Exception!

Unexpectedly… there were some exceptions floating out (even our simulation worked earlier); let’s take a deeper look into the exceptions

{
“index”: “coronovirus”,
“type”: “_doc”,
“id”: “86L5UHABRsd5XQzeAMTv”,
“cause”: {
“type”: “number_format_exception”,
“reason”: “For input string: \”54.406\””
},
“status”: 400
}

hm… seems like some data casting problem; to simplest way to investigate on what is wrong with this particular document is to do a GET with the corresponding document_id provided:

GET coronovirus/_doc/86L5UHABRsd5XQzeAMTv

“message” : “31.1517252,112.8783222,Hubei,China,54.406,1.457

ah ha~ the field “infected” and “death” are not integer(s). This is probably a problem in the data file’s quality. To be honest, it does not make any sense for the number of infections to be a floating point number; hence we could modify the above parser pipeline to convert these fields into “float” data type first; and then run a script processor to cast them back to “integer”.

Great! All data parsed correctly now. Next is to plot some charts to investigate the trend.

Visualisation

Let’s try to plot a coordinate map (world map) chart based on the lat-lon values. Wait….. something is wrong, in order to plot a coordinate map; we need a SPECIAL datatype — geo_point. Let’s update our coronavirus index and add back such a datatype PLUS update our parser pipeline to populate this geo_point field with the correct lat-lon values:

update the parser pipeline:

PS: the way to set a geo_point is to use a “set” processor instead of a “script” processor.

POST coronovirus/_update_by_query?pipeline=coronovirus_parser

Now we are ready to plot a coordinate map chart!

Perfect! Let’s hover on 1 of the points on the map and check the values associated with that point…

the number of infected patients should be more than “1”

Let’s check out the “metric” part of the visualisation. Instead of using “count”, we should use a “sum” metric based on the “infected” field; so let’s change it and re-run the visualisation:

Eventually to investigate the trend of the virus; we would need to keep on updating the data (which could be done by Filebeat doing the ingestion) and trend analysis (done by Kibana dashboard). A sample dashboard as follows:

Summary

Today we have illustrated a common use case on:

  • getting source data ingested (Filebeat)
  • parsing the source data (ingest pipeline)
  • cleansing / cleaning the source data (ingest pipeline)
  • visualise the data into dashboards for investigation / trend analysis (Kibana)

It is quite incredible that 1 software stack (the Elastic Stack) could handle all of the above without installing additional tools or software!

Lastly, hope the virus threat would end soon plus let calm and love stay with us to walk through the bitterness of the period.

Further reads:

a follow up blog on integrating a COVID-19 tracker API with the Elastic Stack is available at here

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
devops terminal

devops terminal

a java / golang / flutter developer, a big data scientist, a father :)