2013年12月13日 星期五

[elasticsearch][hadoop]note timeout while indexing 1gb document

note 








timeout while indexing 1gb document
















InquiringMind <brian.from.fl@gmail.com>2013年5月16日下午11:32

回覆:elasticsearch@googlegroups.com

收件者: elasticsearch@googlegroups.com










By "we can say", you must mean you and your tapeworm. No one else is included in your conclusion.

 

By "document" you really mean "input data stream". In strict terms, an ElasticSearch "document" is a MySQL "row". You will never succeed in loading a 1 GB row into MySQL. But from your posts, I am guessing that MySQL has a tool that slurps one huge 1 GB input stream into the multiple rows it represents and loads them optimally. OK, ElasticSearch doesn't come with such a tool, but it comes with wonderful APIs that let you dream up and implement all manner of input streams. There are many third-party tools for pulling in data from many sources (rivers, they call them), and I wrote my own converters with proper bulk-load coding to push bulk data into ElasticSearch.

 

I can easily and successfully load a 3.1 GB "document" into ElasticSearch. Even on my laptop with decent CPU power but low end disk performance, I can load this 3.1 GB monster in just under 3 hours. The MacBook fans sound like a (quiet) jet engine, but the system is still surprisingly responsive during its efforts. And there are no memory issues, exceptions thrown, or any other issues at all. And note that this exact same 3.1 GB input "document" was loaded into MySQL in 8 hours on a production server with a proper disk array; ElasticSearch did the same job on my laptop and single slow disk in less than half the time.

 

And that 3.1 GB document is a gzip'd CSV file. Of course, I needed my Java skills to take the gunzip'd output (using gunzip -c to decompress to stdout but not on disk. Yay!), then convert that (probably about 7 or 8 GB by now) uncompressed CSV stream into the desired JSON stream, and the use the excellent examples as a model for my bulk loader that properly loaded that huge document into ElasticSearch.

 


 


At Infochimps we recently indexed over 2.5 billion documents for a total of 4TB total indexed size. This would not have been possible without ElasticSearch and the Hadoop bulk loader we wrote, wonderdog. I'll go into the technical details in a later post but for now here's how you can get started with ElasticSearch and Hadoop.

 

[thedatachef] @About Trust, We plan on a blog post with just those technical details soon. For now, the 4TB was after indexing and the data was raw text. We used 16 m2.xlarge ec2 nodes for the elasticsearch cluster and 5 m1.large hadoop nodes. Took 2-5 minutes per input GB.

 

[thedatachef] @Michael Yes. Indexing speed varied from 2 minutes per input GB (at best) to 5 minutes per input GB (at worse). That is all given the setup explained in the previous comment.

 

[jasonInKorea] I have done same thing that you did, and I checked wonderful speed. But I didn't use hadoop storage.


 

 

And so we can all conclude that ElasticSearch will easily, smoothly, and gracefully load and process and query documents that are many, many times larger than a relatively tiny 1 GB document!

 

Regards,

Brian




沒有留言:

張貼留言