Elasticsearch: guard your Mappings in the “Painless” way
Elasticsearch is a great tool to gathering logs and metrics from various sources. It provides lots of default handlings for us so that the best user experience is cherished. However, there are cases that default handlings might not be optimal (especially in Production environment); hence today we would dive into ways to avoid “mappings pollution”.
what is a “mapping” and why is there a pollution?
Like any datastore solutions, there must be a schema equivalent setting to illustrate how a data field would be treated (e.g. stored and processed / analysed); this setting is known as “Mapping” under Elasticsearch.
Unlike the majority, Elasticsearch provides default handlings for schema settings if not specified. Take an example, when we created a document and ingested into a brand new data index, Elasticsearch would try to guess the correct field mappings for us. Yep~ It is a guess! Hence for most cases, we would have that document ingested without a problem, plus automatically done. Cheers~ But wait…
- the guessed mapping might not be optimised (e.g. any integer fields would be mapped to a data type “long” which occupies 64bits of memory, however if your integer field ranges only from 0~5… maybe a data type of “byte” or “short” is very enough and all it takes is 8 to 16bits of memory)
- there are fields that are meaningless to us, hence we should exclude them before the ingestion -> introduced a mapping pollution
how — painless script processor approach
We could create an ingest pipeline with a script processor to remove target fields before the final ingestion happened.
Use case: remove fields with its name over 15 characters long.
Maybe our data comes along with some meta data under a random generated UUID (over 15 characters long) and somehow we NEVER need this meta data. Hence to avoid a mapping pollution, we would need to exclude such field(s) at the very beginning. Here is an example on checking the length of possible fields within a document:
We can clearly see the results of the 2 testing documents: the 1st one contains a field “very_very_long_field” which yields a “true”, whilst the 2nd one contains just an “age” field and yields a “false”.
The trick here is “ctx.keySet()” method. This method returns a Set interface containing all “field-names” available under the document. After obtaining the Set, we can start iterating it and apply our matching logics.
PS. one tricky thing is… this Set also contains meta data fields such as “_index” and “_id”, hence when we are applying some field matching logics, be aware of such fields too.
Next example will illustrate how to remove the corresponding fields from our document context:
The magic is ‘ctx.remove(“fieldname”)’. Pretty straight-forward isn’t it? Also note that we have applied a more precise rule on our field matching logic ‘!key.startsWith(“_”) && key.length()>10’ so that all meta fields (e.g. _index) would not be considered.
Also an ArrayList is introduced to store the target field names. You might ask why don’t we directly remove the field from the document context during the loop? The reason is if we try to do so, an exception would burst out describing a concurrent modification on the document context. Hence we would need to delay the removal process and this ArrayList keeps track on those field-names.
Finally, there is also another situation in which our documents might involve multiple levels / hierarchy. The following example illustrates how to determine if a field is a “leaf” field or a “branch” field:
A pretty long one… In order to check whether the field is a “leaf” — a normal field or a “branch” — another level of fields (e.g. object); we would need to check the field value’s type ‘value instanceof java.util.Map’.
PS. The “instanceof” method helps to verify if the value provided matches a particular Java Class type.
Next, we would need to iterate the Set of inner-object fields again to apply our matching rules. The same technique, using ArrayList, would be applied on tracking the target field names for removal at a later stage.
Finally, to remove the fields through ‘ctx.remove(“fieldname”)’. But this time, we would also need to check whether this field is a leaf or a branched one. For a branched field, it would come at this format “outer-object-name.inner-field-name”. We would need to extract the “outer-object-name” first and get access to its context before deleting the “inner-field-name” -> ‘ctx[field.substring(0, idx)].remove(field.substring(idx+1))’
Take an example: outer.very_very_long_field
- idx (index where the “.” separator is) = 5
- field.substring(0, idx) = “outer”
- field.substring(idx+1) = “very_very_long_field”
- hence… ctx[field.substring(0, idx)].remove(field.substring(idx+1)) = ctx[“outer”].remove(“very_very_long_field”)
Well done~ This is the “painless” script approach to avoid mapping pollution. Hope you are still alive by now :)))))
how — dynamic settings on indices approach
Sometimes, we might not bother a mapping pollution introduced; however~ We do NOT want those meaningless fields to be searchable or aggregatable. Simply we let those meaningless fields act as a dummy, you can see them (available under the ‘_source’ field) but never able to apply any operations on them. If that is the case… we could alter the dynamic settings on indices.
We have a very simple mapping definition here — ‘“dynamic”: “false”’ is applied at the root level meaning that if there were unknown fields introduced later on, we simply ignore them for operations (not searchable or aggregatable for these fields). However, these fields’ value would still be inside the “_source”, the dummies~
That explains why for this document:
{ “age”: 45, “name”: “Edward Elijah”, “address”: { “post_code”: “344013” }}
We could not search through the field “age” since “age” is a dummy field.
On the “address” level, we set ‘“dynamic”: “true”’ meaning that unknown fields would be included for operations and hence updating our mappings (a mapping pollution). Hence we could search through the field “address.post_code” this time.
Finally we declared ‘“dynamic”: “strict”’ at the “work” level, this means all unknown fields would be treated as exceptions / errors immediately~ This is the strictest way to avoid mapping pollutions but the results seem brutal as well…
reference: https://www.elastic.co/guide/en/elasticsearch/reference/7.13/dynamic.html#dynamic-parameters
how — remove fields through pre-processing approach
The last approach discuss here is to remove meaningless fields right before passing to Elasticsearch… and how? Well… write our own programs and pre-process the documents~ :)))))
This is indeed a VALID approach but might not be suitable for everyone; since some programming knowledge is required. Sometimes, it might be much more flexible if we pre-process the documents before passing to Elasticsearch as we have full control on the document’s modification (thanks to the programming language’s capabilities).
Closings
Nice~ We have discovered a bit on how to prevent mapping pollution through 3 approaches.
- heavy duty “Painless” script approach
- the “dynamic” setting approach for index level
- pre-process your documents by a program
Again, there is no right or wrong choices here, instead, experiment a bit and pick the best approach for your team. All the best~