How I make the document contents to be email address awared / searchable under Elasticsearch
It all started in a morning when a colleague messaged me a question: “morning… I am trying to get parts of the email address being searchable under Elasticsearch. Do you have any idea?”
My instant reply was “sure…? I thought this is for sure possible, isn’t it?” After some screenshots being sent over, I knew it wasn’t the fact… Well, time to get hands dirty!!
I started to test the scenario by not creating any special settings or mappings for the test.
pretty weird… so the test illustrated that using the full email address for search is possible whilst partial email address (e.g. “jojo” or “donna”) would not return a single document.
to further understand what caused the problem, let’s try to run the _analyze API to check the email address values:
Eureka~ clearly the default analyzer for Elasticsearch fields (named “standard”) does NOT treat the “.” as a token delimiter;
- this is what we expected -> [ jojo, star, crusade, uni gov ] BUT…
- this is reality -> [ jojo.star.crusade, uni.gov ]
this explains why partial address keywords like “jojo” will not match.
Let’s solve the mystery then…
attempt 01: replacing the “.” to “-” with ingest pipeline
since the “.” is the root of the problem, we could simply replace this character into something else such as “-”.
an ingest pipeline is created and help to do the data-patching. The only processor required is “gsub” in which we provided the pattern for matching “\\.” (double slash is required to escape the “.” becasue in regular expressions “.” simply means any character; \\. simply means “.” as a fullstop symbol instead)
don’t forget to test our pipeline before any deployment~ By running the “_simulate” API and providing the sample document(s) for testing, we can evaluate whether the logic is implemented correctly. Nicely, the result document’s “instructors” field is now “-” delimited. So we should be having a “jojo-star-crusade@uni-gov” instead of “email@example.com”.
let’s do a search on partial email address keywords:
Cool~ we did it~~~ Oh wait….
“wow amazing! It works now… but I am expecting the email address to be firstname.lastname@example.org instead of donna-karen… something” my colleague replied promptly.
Yep, this approach works for sure, however it is quite intrusive since an ingest pipeline actually PATCHS the “_source” content of your original document which might not be acceptable by everyone.
Let’s try another approach then…
attempt 02: build a custom analyzer
if a direct patch on the _source is not allowed, then let’s try to modify the text-analysis chain instead.
I know it is super lengthy… we would first need to create the target index with the supplied “settings” and “mappings”. In general, our custom analyzer would be under the “settings” section.
A text-analyzer is composed of 3 parts:
- character-filters (a preprocessor on words),
- a tokenizer (the split token engine) and
- token-filters (a postprocessor after tokens are created).
To fix the “.” issue, we can apply a character-filter named as “mapping” (oh wow… a confusing name isn’t it).
The “mapping” character filter works in a way that a supplied list of identifiers and their replacements would be checked across the contents of the field; and if there were any matches, simply replace that area with the supplied replacement value. In our use-case, simply:
“. => -”
Remember to assemble the analyzer together, of course. Our custom analyzer would be named “ana_email_fixer” in the example. Finally, don’t forget to assign this custom analyzer to the target field within the “mappings” section, in our case the “instructors” field.
Testing the analyzer could be done through the “_analyze” API as usual and clearly we can see that now the tokens created are also the partial email address values such as “jojo” or “donna”.
let’s test it out~
Woot~ I am pretty confident no more “questions” from my colleague again~
So the goodies on using analyzers to solve this mystery is that we didn’t patch the contents of the documents at all to get the desired search results.
But the downside is the configuration is slightly more complicated and error prone when we miss to assign the custom analyzer to the required field (no joking here, many of us did miss this point ;)))
Bonus — can Elasticsearch tokenize the full email address as one instead of pieces???
sometimes we might want to have a full email address as 1 token instead of pieces of it:
- expect -> email@example.com instead of [ donna, karen, uni, gov ]
There is acutally a special tokenizer known as “uax_url_email”, the setup would be as follows:
here, we added a new analyzer, email_extractor, using the “uax_url_email” tokenizer; the field “comments” would be using this custom analyzer. Now run a test with the _analyze API as usual and you will see that 2 emails are extracted as 1 single token:
and both tokens are of type “EMAIL”.
Time to run some searches:
Interestingly, we have 2 fields each having its own custom analyzer attached.
- comments -> email_extractor
- comments.email -> ana_email_fixer
for searches considering the field “comments.email”, we can search emails based on partial mail address such as “donna” or “jojo”. The “comments” field however could only return results if the full and exact email address is provided.
You might ask what is the point to have this “comment” field then? Think about it, if we only have the comments.email which supports partial mail address search, then we would end up querying a lot of partially matched documents back.
In the above execution, even if I provide a full email address for search against the “comments.email” field… still all documents are returned. Looking into the “highlight” session of the response, you can clearly see that the 1st document has the exact match! Hence the highest score. But then looking at the remaining 2 documents…
- the 2nd document is returned because there are matched words like [ donna, uni, gov]
- the 3rd document is returned solely because of the words [ uni, gov ]
simply these documents are not relevant at all~ They are returned just because partially matching the email address parts. That is why the “comments” field with an exact email address acting as 1 token is essential to reduce the noises.
We have gone through a mysterious case on how Elasticsearch handles email address during a search. Ways to solve this are the following:
- ingest pipeline — replacing the problematic “.” to “-”, but intrusive
- custom analyzer — preprocess correctly before tokenization happened, non intrusive but slightly more complex
Happy data mining :)