Share
In the end of 2013, JethroData was invited to participate in the first round of eBay Big Data Lab project. The project grant selected startups and researches access to anonymoized data sets on eBay large internal clusters. We used the fantastic opportunity to harden our alpha software, identify and solve performance and scalability bottlenecks and add basic Kerberos support, to name a few.
We participated in the first round, which ended on February and was a great success. To quote Yan Chelly, Big Data Lab and Innovation Senior Manager at eBay Marketplace “JethroData successfully deployed its product on eBay’s Hadoop cluster and demonstrated performance improvements ranging for 3-100 times faster query responses, depending on the nature of the query.”
In the time since the program ended, we have made some significant progress, including improving our compression by 30%, adding support for joins, subqueries and IN clause, implementing partitioning and local caching of hot files from HDFS, writing proper installer and documentation and of course, adding numerous performance and memory usage improvements.
See the full text of the approved PR below:
JethroData is proud to announce results of its participation in eBay’s Big Data Lab project.
JethroData is a provider of an index-based SQL engine for Hadoop. Its technology enables companies to significantly accelerate performance of querying data stored in Hadoop clusters. Over the last 3 months JethroData participated in eBay’s inaugural Big Data Lab project. As part of this project, JethroData deployed its product on eBay’s systems and test its effectiveness with anonymized eBay data, infrastructure and business use-cases. Jethro’s SQL-on-Hadoop engine was compared to eBay’s current tool of choice – Hive 0.12 – and has consistently shown improved performance ranging from 3x to over 100x.
JethroData is a unique SQL and Indexing engine for Hadoop. It works by automatically adding indexes to select datasets. Queries use those indexes to quickly identify the required rows and read only the data they need. In contrast, Hive performs a full-scan of the data, where the entire dataset is read, regardless of how small the actual requested result set might be. Jethro is therefore optimal for use cases such as interactive ad-hoc queries, live dashboards and rapid reports, where queries typically access only a portion of the data. With Jethro, eBay users can expect the performance typical of eBay’s warehouse even when they actually work against Hadoop.
The testing was done on eBay’s Hera cluster. This cluster is running the latest Hadoop distribution with Hive 0.12. For testing we used a table of 7 billion rows and 11 columns, representing items attributes that we joined from several tables in eBay. Both Hive version of the data, and Jethro’s index and column files were stored on the same Hera HDFS cluster, with Jethro’s storage taking only 30% of the storage used by Hive. Jethro was tested only on a single server, with limited network bandwidth to the Hadoop cluster (2x1Gb/s NICs), but was still able to provide impressive response time to most queries.
During the cycle JethroData and eBay ran many queries that reflect typical eBay use cases. Below are some examples:
Count with or without a WHERE criteria.
Count Queries | Results | Hive(sec) | Jethro(sec) |
SELECT count(*) FROM ebay1; |
6,894,631,129 | 171 | 0.024 |
SELECT count(item_id) FROM ebay1 WHERE prdct_aspct_nm='Part Type'; |
28,113,018 | 174 | 3.34 |
SELECT count(item_id) FROM ebay1 WHERE (prdct_aspct_nm='Part Type' OR prdct_aspct_nm='Large' OR prdct_aspct_nm='Shirts') AND (leaf_categ_id=33704 OR leaf_categ_id=14106 OR leaf_categ_id=3938 OR leaf_categ_id=82620); |
405,857 | 259 | 3.54 |
SELECT count(*) FROM ebay1 WHERE item_site_id=0 AND leaf_categ_id=115; |
128,232 | 188 | 16 |
Multi-level GROUP BY with WHERE criteria – top-n queries.
GROUP BY (Top-N) Queries | Hive(sec) | Jethro(sec) |
SELECT leaf_categ_name,aspct_vlu_nm ,count(item_id) FROM ebay1 WHERE prdct_aspct_nm='Part Type' GROUP BY leaf_categ_name,aspct_vlu_nm ORDER BY count(item_id) DESC LIMIT 10; |
188 | 39 |
SELECT leaf_categ_name,aspct_vlu_nm ,count(item_id) FROM ebay1 WHERE prdct_aspct_nm='Part Type' AND leaf_categ_id=33704 GROUP BY leaf_categ_name,aspct_vlu_nm ORDER BY count(item_id) DESC LIMIT 10; |
264 | 3.9 |
SELECT leaf_categ_name,aspct_vlu_nm ,count(item_id) FROM ebay1 WHERE prdct_aspct_nm='Part Type' AND leaf_categ_id=33704 AND leaf_categ_name='Automo’ GROUP BY leaf_categ_name,aspct_vlu_nm ORDER BY count(item_id) DESC LIMIT 10; |
258 | 2.56 |
During our testing we found we have a problem with performance of filtered dates which is now fixed in the product. The results are before the fix:
SELECT item_site_id,leaf_categ_id,count(item_id) FROM ebay1 WHERE auct_end_dt='2011/04/15' AND auct_start_dt='2011/03/16' GROUP BY item_site_id,leaf_categ_id LIMIT 10 |
201 | 30.03 |
Count DISTINCT with or without WHERE criteria
Count Distinct Query | Result | Hive(sec) | Jethro(sec) |
SELECT count(distinct item_id) FROM ebay1 WHERE leaf_categ_id =33704; |
398,869 | 189 | 33.8 |
SELECT count(distinct leaf_categ_id) FROM ebay1; | 38,042 | 181 | 0.065 |
SELECT count (distinct auct_end_dt) FROM ebay1 WHERE prdct_aspct_nm='Part Type'; |
365 | 172 | 7.94 |
SELECT count (distinct product_id) FROM ebay1; | 7,126,183 | 249 | 0.51 |
SELECT count (distinct aspct_vlu_nm) FROM ebay1; | 52.176.532 | 1351 | 0.49 |
SELECT count(distinct load_ts) FROM ebay1; | 1.386 | 187 | 0.054 |
SELECT count (distinct leaf_categ_name) FROM ebay1; | 45.496 | 198 | 0.033 |
The results indicate that Jethro’s is faster across the board when compared to Hive 0.12. Hive performance is relatively constant across all queries tested and was not affected by the actual result set. This is consistent with Hive’s design that is based on scanning all the records for every query. Jethro’s results were highly correlated to the size of the result set an – the more you drill down the faster it gets. Some queries that require only index access (e.g. count (distinct x)) typically returned in less than 1 second.
In conclusion, the test demonstrated Jethro’s technology ability to support interactive queries over data in Hadoop. It consistently performed better than the latest version of Hive and could potentially provide eBay users migrating to Hadoop a performance boost.
About JethroData
JethroData is Big Data software company developing a unique index-based SQL engine for Hadoop. It is based in Israel, with offices in NY. Jethro’s technology has been in development since 2011 and was recently released for beta testing. JethroData is backed by Pitango Venture Capital, the largest VC in Israel.
For more information visit us on www.jethrodata.com