Data masking with filebeat
Sometimes we do need to mask out confidential information before employing them for analysis. In this blog, we would take the challenge to use filebeat (instead of logstash or elasticsearch ingest node).
the filebeat way
First of all, we need to prepare a dataset. For simplicity, it would be like the following:
The dataset is in csv format, with the first field “client_id”, second field “transaction_date” and the third field “amount”. Assume we would like to mask out portions of the first field “client_id”.
If you have been using filebeat for sometime, you probably noticed that filebeat will treat each line of input to become a field named “message” in its own json document. This format is not ideal for analytics and searching since we need to break the line of string into individual fields (in this case, client_id, transaction_date and amount); in the past, we could do nothing about it and would need to setup a pipeline in elasticsearch ingest node(s) to pre-process this “message” field.
Fortunately, in recent versions of filebeat, “processors” were introduced. A simple sentence to describe processors are some basic operations to be applied on the line of data — such as splitting the line into fields, conversion of data type and running a script.
The following is the config file with data masking:
To make things easy, the “input” (i.e. the source of the csv data) came from standard-in and the “output” (i.e. the destination of the processed data) would be printing directly to standard-out.
We have setup 3 processors — dissect, convert and script. “dissect” is the processor for breaking the line of data into fields.
tokenizer: “\”%{clientID}\”,\”%{trxDate}\”,%{amt}”
inside the “tokenizer” setting, the portion of data to be extracted to a field would be surrounded by %{field_name}; hence for the 1st line of data, clientID would yield “001”, transactionDate would yield “2020–11–12” and amount yields “1234.567”.
the “field” setting tells filebeat which field contains the data for dissect-ing.
Next is the “convert” processor — which converts a data into another designated type. In our example, we would like to convert the “amount” field into a floating point number.
{ from: “amt”, to: “amt”, type: “float” }
Finally the “script” processor — running a script to further process the data or field(s). Yep, it looks scary to master but it is powerful~
inside the “lang” setting, we declared which language to employ — “javascript” is the lucky one. The “source” setting is where we declare the data masking logic; bear in mind a “process” function MUST be available.
function process(event) {
event.Put(“clientID”, “xx”+event.Get(“clientID”).substring(2,3));
}
“event” is the in-memory content of this line of data (or you can say “this” document); hence when we call event.Put(…) we are trying to add or update a field within the content. In our case, we are just replacing the existing clientID field with a masked value.
The masking logic here is simple — mask the first 2 characters of the clientID and append the last character. Take an example, if the clientID is 001, we would expect “XX1” as the masked value. In order to do so, we would call event.Get(“clientID”) to retrieve the original clientID’s value and then access the last / third character in it by calling substring().
And this is the sample results:
Simple isn’t it? As long as you know how to write the javascript function correctly, you are all set~
performance concerns…
As filebeat is usually installed on the target system (e.g. on a workstation or a server); hence when we employ such processors, we are using the target system’s resources. Depends on how expensive the processors are, there might be some nuisance introduced to the target system, but RARE.
On the other hand, if we are employing logstash; then the dedicated server or machine hosting logstash would be the one chewing up the resources and… not affecting the targeted workstation(s) or server(s).
If we are employing elasticsearch ingest node(s); then the corresponding ingest node(s) would be the one working hard plus not affecting the targeted workstation(s) or server(s).
Based on the understandings, choose wisely which approach that might suit your use case the best~