introducing GLastic, a data exporter and importer for Elasticsearch

devops terminal
4 min readNov 21, 2021
Photo by Joshua Sortino on Unsplash

Foreword

A common use case of Elasticsearch is to act as a data pipeline — ingesting data from various sources and apply analytics for the data. But there are still use cases when these ingested data would require to integrate with external systems for further analysis. Take an example, integrate data with Python Machine Learning libraries / systems; or drawing dashboards with Tableau. With luck would have it, certain system endpoints might provide Elasticsearch connectors for integration whilst all others require some workarounds.

Despite of writing a connection program (of any Language) by yourself, GLastic provides an alternative to perform data export and import on a Elasticsearch cluster.

Export data from indices or data-streams

Data export is pretty straighforward by running the following command:

glastic -c {{config_file.json}} export

all the rules involved in export are defined in the configuration file (json format). The following is a brief introduction on the minimal configuration for export operation:

  • batch_size: size of query page (max is 10000), set to 10000 if not provided.
  • indices: the set of indices for export, default is [] and simply do nothing.
  • filter_query: query filter to be applied, default is “” which is equivalent to a “match_all” query.
  • target_folder: the target folder for storing the exported data, default is ./ which is equivalent to the current folder.

Also don’t forget to add back the connectivity configurations:

  • es_host: the host name of the Elasticsearch node for connection.
  • es_username: username for authentication (if required).
  • es_password: password for authentication (if required).

the bare minimal json file:

Import data as indices or data-streams

Data import shares the similar command line syntax:

glastic -c {{config_file.json}} import

As usual we would need to add the connectivity configs (es_host, es_username and es_password) and only es_host is MANDATORY.

If data is imported as data indices; the following is a minimal configuration for reference:

  • batch_size: size of documents to be imported per batch, default is 10000; but could be any number as long as the host machine is strong enough to cache that number of payload before pushing to Elasticsearch.
  • source_folder: where the data files are located, default is ./ which is equivalent to current folder.

In additional to the above, each target index would require a set of meta data itself — under the key “create_target_indices”. The meta data involved are as follow:

  • target_index: index / data-stream name to be imported.
  • source_index: index / data-stream name the data originates from.
  • source_file: data file name, the file name would be appended after the source_folder’s value.
  • data_stream: for importing an index set the value to false.

For example, originally data index “school_subject” is exported out and then re-imported into another cluster with a target index named as “saint_john_college_subjects”; then the configuration would be the following:

deploy GLastic as a microservice — docker

GLastic is available through a docker image; simply pull the image by the following:

docker pull quemaster/glastic

important folders are listed here:

  • /usr/bin/glastic: the binary executable is available under /usr/bin; hence it is accessible everywhere within the image.
  • /glastic/elasticconnector: the source code and unit test config files. Good for testing your configuration files here.

to mount a local folder (storing your config files and the to-be exported data files) to the docker image, simply:

Gotchas~

Hey… is this a golang gotcha series? No~ But there might be some confusions of concepts that would be nice to go through :))))

Q. Why don’t we just create a snapshot as a way of integration?

A. Snapshot is the official way for Elasticsearch to backup data, the data format is a proprietary binary which is hard to integrate directly with other systems. Hence need some other ways to export the data in a system-friendly format (e.g. plain text, json, xml), this is one of the reasons why GLastic is created.

Q. Elasticsearch provides both scroll and search_after APIs for iterating through the data indices, isn’t that enough?

A. First, the scroll API is no more recommended for deep paginations anymore https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.html; hence you only have search_after API.

Second, search_after would require a manual tracking of the sorting values generated for each pagination, if the target dataset is quite huge… this manual tracking could be tiring. Hence it would be great to have another tool for handling these. GLastic handles the pagination and sort value trackings automatically for you.

Q. Could Logstash export data too?

A. Definitely; if you are using the elasticsearch input plugin, typically you can add a query for filtering out which subset of data for export. However the limitation is the page size must be within 10,000 documents or else an illegal_argument_exception would be thrown as the result-window is limited to 10,000 only. Hence eventually you would need to switch to search_after API. GLastic provides a “filter_query” config for subset selection and handles the pagination for you automatically.

So that is it! All you need to know on how to export and import data with GLastic~

PS. if you are more into the theories behind GLastic and limitations etc, please read the gitlab project’s README.

--

--

devops terminal

a java / golang / flutter developer, a big data scientist, a father :)