using Elastic Cloud as your data repository and how to make your 1st data ingestion through filebeat

https://cloud.elastic.co/home

code source available at https://github.com/quoeamaster/bigdata_blogs/tree/master/es_intro_elastic_cloud_n_1st_ingestion

If you want to know how to use Watcher / Alerting features on Elastic Cloud, check the following blog posts too:

creating notifications through Alerting / Watcher X-Pack plugin on Elastic Cloud — Part 1

Ever used a software that could help to ingest loads of data without a sweat? Ever used a software that could search your data in a variety of ways (like matching keywords, regular expressions, suggestions on keywords and more)? Ever used a software that could provide quick analytics? Ever used a software that could convert the data analytics into visualisations and charts?

Yep, Elasticsearch is the name of the software. Hey you already using it? Alright now my next set of questions… :)

How easy is it to setup your own Elasticsearch clusters from scratch? How do you handle cluster health monitoring? How do you manage cluster upgrades? How often do you create snapshots (backups) and of course housekeep on that lengthy snapshots. How easy would it be to add node attributes to your nodes — the Hot-Warm architecture technique requires tagging nodes (this also happens if you want to deploy ILM — Index Lifecycle Management, tagging nodes is inevitable).

If your answer to all the above are “piece of cake” then you could skip the whole blog~ On the other hand, if you are not so sure how comfortable it is for solving the above by yourself or your team, do READ on.

In a simple sentence, Elastic Cloud is the cloud hosting version of your Elasticsearch cluster. Now you might ask

I know there is cloud hosting for Elastic a long time ago, they are provided by Amazon, correct?

The answer is NO! Be aware that Elastic Cloud is the cloud service provided by Elasticsearch — the company. Whilst Elasticsearch service is a service provided by AWS instead, plus there is no direct connection between the 2 companies.

em… so what’s the differences between them?

There are quite a few variations between the 2; however 1 major difference is Elastic Cloud has X-Pack plugins available at once, whilst Elasticsearch service on AWS doesn’t provide X-Pack (but they do have Open-Distro as an alternative). Also you might have better confidence in Elastic Cloud since the software’s company is maintaining the platform, maybe more support on tuning and customisation.

Assume that you are interested in hosting your cluster on Elastic Cloud, let’s click on this link: https://cloud.elastic.co/home and get started!

After the simple registrations (creating an account); you should be brought to the create-deployment page:

“create your first deployment” page

Click on the “create deployment” button would be guide you to this page:

deployment wizard

key in the name of your deployment, suggested to have no spaces and use underscores or hyphens to join words of your deployment’s name.

Point 1 is which cloud platform would your Elasticsearch cluster be hosted? For the moment, GCP from Google, Azure from Microsoft and AWS from Amazon are available.

hm… would a different platform affect my fees?

Will take a look on the costing a bit later. Remember that changing the hosting platform would yield different options on Point 2.

Point 2 is the region available for the hosting; pick the one nearest to your country or if you have no preference, us-central would be a valid choice.

Step 4 is about the cluster’s configuration:

clearly now the latest Elasticsearch version is chosen; however, if you do have a reason to use another version (maybe some deprecated APIs are still in use), simply click on the “Edit” button

but do note that, the available versions are quite limited. If you really need something much much older like 5.2.x, you would need to contact Elastic directly and see if they could provide any help.

Step 5 is the use-case of your cluster

choices on your major use-case

By default “I/O optimized” is chosen which is suitable for most generic use cases including rapid ingest and search of data, however, if your use case is slanted to high ingestions or high computations between queries, there are other flavours available like “Compute Optimized” and “Memory Optimized”.

So just now, 1 of my questions is how easy would it be to introduce Hot-Warm architecture to your cluster… clearly, you can see an option “Hot-Warm Architecture”~ I will choose this one to further illustrate how it facilitates the use-case :)

pricing varies based on a couple of factors

The 1st thing is… if you pick different options on your deployments, the final pricing changes accordingly, Hot-Warm architecture option on GCP charges $0.7065 / hr whilst the I/O Optimized option charges only $0.5617 / hr.

2nd thing… choosing different hosting platforms also affects our pricing. On AWS, Hot-Warm architecture charges $0.5973 / hr and I/O Optimized charges only $0.3789 / hr~

3rd… on the same hosting platform, choosing which region for your cluster also affects the charges. For example, on AWS, hosting in US (East) for I/O Optimized charges $0.3789 / hr whilst hosting in Asia Pacific (Sydney) charges $0.4547 / hr!

If budget is an important issue, do some researches on the above combinations and pick the right hosting platform + the acceptable region for hosting :)

Now the last mile~ If you are all good, just hit the “create deployment” button! If you want to custom further the nodes on memory, disk resources etc, hit the “customize deployments” instead. I will hit the “customize deployments” to show you more about the Hot-Warm architecture settings that you could modify

Hot node

Now you can see the 1st node is a highio model which makes sense for Hot nodes handling heavy ingestions. Point 1 indicates this node’s roles / responsibilities — clearly this node has “data”, “coordinating” and “master eligible” capabilities. Point 2 indicates how many zones are required for hot nodes, by default “2” is chosen hence you would end up having 2 zones with each containing 1 of such Hot node. On the right hand side is a summary panel showing you the specification of the deployment, note that some of the nodes are FREE! Including a Kibana instance and an APM server.

Warm node

Similarly, a highstorage model which is well fitted for warm nodes (storing lots of older data) is provided. Again 2 zones are chosen for resilience purpose hence you would end up having 2 warm nodes each located in a separate zone.

There are still a couple of other nodes / models available for customization, like the “machine learning” node which is disabled by default, a dedicated coordinating node which is also disabled by default, a dedicated master node (enabled if you have more than 6 nodes in the cluster, so for now… disabled since only 4 nodes are available), another node for kibana, the last node for APM. Do remember, customizing would change the hourly charge rate again :)

now click on the “Configure index management” button at the very bottom.

now you should be brought forward to this page:

ILM

Since Elasticsearch version 6.x, Index Lifecycle Management was introduced and would be the successor of the Curator APIs. So the secret in which we could make sure latest data are ingested by Hot nodes is because of the node’s tags — clearly “data: hot” tag is added to the highio node whilst “data: warm” tag is added to the highstorage node facilitates search, analytics and storage could be handled by warm nodes; of course just tagging won’t make the magic happen hence an ILM policy would be created for your deployment automatically and make things work, no sweat at all and right at your finger tips :)

If you prefer the legacy way using curators:

curator way

simply add back the index patterns and when should the data be moved from highio (hot nodes) to highstorage (warm nodes). however be aware that curator is deprecated and would be gone in some future major releases of Elasticsearch.

Alright, now hit “create deployment” and wait…. till everything is done.

deploying…

Alright, pay full attention on this! Copy the password of this user “elastic” as it would not be shown AGAIN!!!!

hey, I forgot to copy and save the password just now! What can I do?

Guess the only thing you could do now is to delete the current deployment and re-create it again… remember to save the password this time :)

important information!!!

Once your deployment is ready; click on the left hand side panel and locate your deployment’s name (in my example, the name is “tbd”) and you should be able to see a similar page. Point 2 are the endpoints of your Elasticsearch, kibana and APM instances; again do note and save them down somewhere. Point 3 is super important as well, usually when we do ingestion to our cloud cluster, we do not specify the Elasticsearch node’s address, instead we use this cloud-id; remember to save this down as well.

PS. cloud-id’s format is {{deployment_name}}:{{some_base64_alphanumeric_string}}. Also the the cloud-id above has been “handled with care” hence don’t try to use it and hack my cluster :))))

Great~~~~ We have just finished the 1st part of this blog…. now how to ingest data into cluster!

The data I would ingest into the cluster came from another blog I have written earlier — “create vivid presentations with kibana canvas”.

The data file (json entries) would be like this: (feel free to use your own datasets instead)

You would need a filebeat configuration file, below is a sample of what it should look like:

PS. remember I told you the cloud-id and “elastic” user’s password is VERY IMPORTANT? :)))

Note that, setup.ilm.enabled is set to false since you probably not expecting the data ingested would be under the index of “filebeat-yyyy-mm-dd” right? In order to use a customized index name ILM needs to be disabled, so that also introduce the setup.template.xxx settings. cloud.id and cloud.auth are the most crucial settings for our filebeat to connect to Elastic cloud.

em… we are connecting to the cloud… then why do we still need to add the output.elasticsearch settings???

Good question! Yep we are targeting to our cloud cluster hence need the cloud.id setting; however, how do we explicit the other settings on our elasticsearch index? Simply, we need to add back the output.elasticsearch settings if we are supposed to modify the index’s defaults (in this case, the index name is “imdb_movie” instead of filebeat-yyyy-mm-dd)

Finally, the execution by running this command:

./filebeat -c filebeat_to_es_cloud.yml

To validate if everything is ingested correctly, login to your cloud’s kibana instance (what? you forgot where your kibana is? scroll upwards and check out on which page on your cloud portal defines these endpoints)

Again… you are supposed to use the “elastic” user and its corresponding password to login to kibana (told you to save these information earlier~); go to “dev tools” and run the following query:

GET imdb_movie/_search

You should see something similar like this:

6 records of movies

Congratulations~ You have just created your Elastic Cloud cluster (in a Hot-Warm architecture option) and then you have ingested your very first set of data to the cloud cluster!

For the next blog, I would introduce 1 cool feature from X-Pack — Alerting / Watcher.

X-Pack? I don’t have a license yet, can I still use it?

Great question again~ Once you have created your Elastic Cloud cluster, you already have rights to use X-Pack (which tells you as well… you already have a license), also you can always enjoy a 14-days free trial after registration. So hopefully you have enough time to evaluate the Cloud environment plus X-Pack features etc.

PS. by the way, there is another blog mentioning how to extend the trial period at here

Happy searching~ :)

a java / golang / flutter developer, a big data scientist, a father :)