Developer Blog: Improving Hive join performance using Oozie

In the portal processing we are making use of Apache Hive to provide SQL capabilities and Yahoo!'s Oozie to provide a workflow engine. In this blog I explain how we are making use of forks to improve the join performance of Hive, by further parallelizing the join beyond what Hive provides natively.

Please note that this was adopted using Hive version 0.5 but in Hive 0.7 there are significant improvements to joins

For the purposes of this explanation, let's consider the following simple example, where a table of verbatim values is being processed into four tables in a star schema:

To generate the leaves of the star, we have three simple queries (making use of a simple UDF to produce the increment IDs):

CREATE TABLE institution_code AS

SELECT rowSequence(), institution_code

FROM verbatim_record

GROUP BY institution_code;

CREATE TABLE collection_code AS

SELECT rowSequence(), collection_code

FROM verbatim_record

GROUP BY collection_code;

CREATE TABLE catalogue_number AS

SELECT rowSequence(), catalogue_number

FROM verbatim_record

GROUP BY catalogue_number;

To build the core of the star the simple approach is to issue the following SQL:

CREATE TABLE parsed_content AS

SELECT v.id AS id, ic.id AS institution_code_id,

cc.id AS collection_code_id, cn.id AS catalogue_number_id

FROM verbatim_record v

JOIN institution_code ic ON v.institution_code=ic.institution_code

JOIN collection_code cc ON v.collection_code=cc.collection_code

JOIN catalogue_number cn ON v.catalogue_number=cn.catalogue_number;

What is important to note is that the JOIN is across 3 different values, and this results in a query plan with three sequential MR jobs, a very large intermediate result set, which is ultimately passed through the final Reduce in the Hive planning.

By using Oozie (see the bottom of this post for pseudo workflow config), we are able to produce three temporary join tables, in a parallel fork, and then do a single join to bring it all back together.

# parallel join 1

CREATE TABLE t1 AS

SELECT v.id AS id, ic.id AS institution_code_id

FROM verbatim_record v JOIN institution_code ic ON v.institution_code=ic.institution_code;

# parallel join 2

CREATE TABLE t2 AS

SELECT v.id AS id, cc.id AS collection_code_id

FROM verbatim_record v JOIN collection_code cc ON v.collection_code=cc.collection_code

# parallel join 3

CREATE TABLE t3 AS

SELECT v.id AS id, ic.id AS institution_code_id

FROM verbatim_record v JOIN catalogue_number cn ON v.catalogue_number=cn.catalogue_number;

CREATE TABLE parsed_content AS

SELECT v.id AS id, t1.institution_code_id

t2.collection_code_id, t3.catalogue_number_id

FROM verbatim_record v

JOIN t1 ic ON v.id=t1.id

JOIN t2 cc ON v.id=t2.id

JOIN t3 cn ON v.id=t3.id;

Because we have built the join tables in parallel, and join on the foreign key only, Hive compiles to a single MR job, and runs much quicker.

In reality our tables are far more complex, and we use a Map side JOIN for the institution_code since it is small, but for our small cluster and the following table sizes we saw a reduction from several hours to 40 minutes to compute these tables.

verbatim_record: 284 million
collection_code: 1.5 million
catalogue_number: 199 million
institution_code: 8 thousand

All of this work can be found here.

Pseudo workflow config for this:

Friday 6 May 2011

Improving Hive join performance using Oozie

No comments:

Post a Comment

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Tags

Followers

Friday 6 May 2011

Improving Hive join performance using Oozie

No comments:

Post a Comment

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Subscribe To

Tags

Followers