07 Nov Using the Amazon Elasticsearch Service with Hive
Amazon launched the Amazon Elasticsearch Service less than a month ago to enable their clients to spin up scalable Elasticsearch clusters directly from the AWS Management Console and forget about about managing these clusters by themselves. While you can spin up and use an Elasticsearch cluster in several minutes, this ease of use comes with a small disadvantage: as opposed to a classic Elasticsearch setup, the Elasticsearch service only exposes the publicly accessible client gateway, making it impossible for Hadoop applications to connect to the nodes behind this gateway using discovery mechanisms.
Hive and Elasticsearch
To connect to the ElasticSearch service from any popular Hadoop applications (Hive, Pig, Spark etc.) you need to use the Elasticsearch Hadoop connector. This can be imported into your Java/Scala application using build tools such as Maven and sbt respectively. To use the connector in Hive though, you need to download the standalone jar package available on the Elasticsearch website.
Note that support for the Amazon version (and other cloud based versions) of Elasticsearch was only added in Hadoop 2.2.0 beta 1. Older versions will not work with the restrictive environment of the cloud and you may experience one of the following errors when trying to connect to the Elasticsearch cluster, depending on your configuration:
- Cannot find node with id
- Cluster state volatile; cannot find node backing shards – please check whether your cluster is stable
- Client-only routing specified but no client nodes with HTTP-enabled available
- timeout
From this point on, I am assuming you are running a relatively fresh version of Hive on an EMR cluster and that you have already spun up an Elasticsearch service cluster with the correct access policy to allow EMR clusters to connect to it.
Once you downloaded the standalone JAR, make it available to your Hive cluster by storing it at an S3 location. For example, if I am hosting the file in the ana
bucket with key emr/elasticsearch-hadoop-2.2.0-beta1.jar;
I would call:
1 |
ADD JAR s3://ana/emr/elasticsearch-hadoop-2.2.0-beta1.jar; |
Now create a new external table with the schema that you want to push to ElasticSearch as follows:
1 2 3 4 5 6 7 8 9 10 11 |
CREATE EXTERNAL TABLE elasticSearchData ( id STRING, name STRING, age INT, interests ARRAY<STRING>) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES( 'es.mapping.id' = 'uid', 'es.resource' = 'ana/people', 'es.nodes' = 'http://your-amazon-endpoint.es.amazonaws.com:80', 'es.nodes.wan.only' = 'true', 'es.index.auto.create' = 'true'); |
By setting es.nodes.wan.only
to true
, the connector will disable discovery and its typical peer-to-peer connections and use only the nodes indicates in es.nodes
. For Amazon this would be the publicly accessible gateway. Don’t forget to specify port 80 to override the default 9200. You can see an explanation of all other configuration options in the official documentation.
Disadvantages of the approach
Since all connections to and from Elasticsearch will be made through this gateway, I believe this solution disables the inherent parallelism of Hive and might be slower than connecting to clusters which have all the Elasticsearch nodes exposed. However, this is the only solution that works for now.
I did a test of uploading 200,000 documents from a 2 x c3.xlarge EMR cluster to a t2.micro Elasticsearch cluster and it took around 80 seconds. I am satisfied with the performance for now. I am planning to do a test with about 1 million documents in the following day and I will update this article.
No Comments