Thursday, 7 February 2013

Analytics/Kraken/Query Service

Created page with "= Research = Projects that might make useful components in our query service. == Highlights == === Impala (Cloudera) === Open source clone of [http://research.google.com/p..."


New page


= Research =



Projects that might make useful components in our query service.



== Highlights ==



=== Impala (Cloudera) ===



Open source clone of [http://research.google.com/pubs/pub36632.html Google Dremel], aiming to be "mostly compatible" with HiveQL.



* Source: https://github.com/cloudera/impala

* Project: https://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation



==== About ====

* [http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/ Blog post], with high-level overview.

* [https://ccp.cloudera.com/display/IMPALA10BETADOC/Introducing+Cloudera+Impala Introducing Impala]

* [https://ccp.cloudera.com/display/IMPALA10BETADOC/Impala+Frequently+Asked+Questions FAQ]

* [http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/impala-real-time-queries-in-hadoop-webinar-slides.html Screencast] (I haven't watched this)



==== Docs ====



* [https://ccp.cloudera.com/display/IMPALA10BETADOC/Installing+and+Using+Cloudera+Impala Usage Guide]

* [https://ccp.cloudera.com/display/IMPALA10BETADOC/Learning+Impala+Tutorial Tutorial]

* [https://ccp.cloudera.com/display/IMPALA10BETADOC/Language+Reference Query Language], even in beta, largely compatible with HiveQL.

* [https://ccp.cloudera.com/display/IMPALA10BETADOC/Configuring+Impala+Security Security Features] -- using kerberos would allow us to initially bypass the need for an HTTP gateway for internal applications, but ultimately we'll need one so (at least) Limn and friends can query it.





== TODO ==



* [https://github.com/nathanmarz/elephantdb ElephantDB] (Nathan Marz) — Distributed database specialized in exporting key/value data from Hadoop. (KV -- not ideal for analytics/slicing.)

* [https://github.com/twitter/elephant-twin Elephant Twin] (Twitter) — Elephant Twin is a framework for creating indexes in Hadoop.

* [http://opentsdb.net/ OpenTSDB] (StumbleUpon) — OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of [http://hbase.org/ HBase]. (Seems aimed at instrumentation-style data (a la RRD), not analytic purposes.)

No comments:

Post a Comment