Tips for Installing Jethro

Optimal Queries - Use as Many Filters as Possible

Jethro process queries by first evaluating the WHERE clause and determining the rows needed for the query. It then fetched column data only for those rows. The narrower the query, the faster it performs

Optimal Data Types

Use numeric formats (INT/BIGINT, FLOAT/DOUBLE) whenever possible - any string column that holds only numeric values should be converted. This is especially true for high cardinality columns.
Use TIMESTAMP for Date/Time columns. Jethro creates multiple date-related indexes for such columns to improve performance of date-range queries.

Partitions for Large FACT Tables

A TIMESTAMP column is typically best choice for partitions as it simplifies maintenance tasks like purging old data
Jethro recommends a total of 5-25 partitions although it comfortably supports hundreds of partitions
Jethro partition key can use range values. For example: PARTITION BY RANGE(ts_col) EVERY (INTERVAL ‘7’ DAY)

Cache Space

Jethro uses server-side caching for metadata and frequently used file fragments. The greater the space the more data it will be able to store. Note that the benefit of the cache will be realized over time as filling up the cache can take some time.
Cache space can be defined when an instance is created or updated later on by editing the local-conf.ini file.
Jethro automatically enables query-result cache. The query-result cache is stored in HDFS and does not require local disk space.

Consolidate Tables When Possible

While Jethro optimizes JOINs and automatically performs Star-Transformation, it is better to avoid them when not required.
Jethro's columnar format and effective compression minimize the storage impact of such denormalization.

Hardware considerations: more is better!

More CPU and RAM Improves query speed as Jethro takes advantage of multi-threading. It also improves concurrency as more user/queries can be served in parallel.
10Gb network connectivity to cluster will speed up HDFS access.
Local drives for caching – SSD is preferable.
Trial servers can start with as little as 64GB and 8 cores.

Use a Cluster of Jethro Servers

Multiple servers linearly increase Jethro’s capacity for concurrent users and queries.
When performing frequent incremental loads, it is recommended to run the JethroLoader on a different server.

Data sorting can improve performance

If a large number of queries filter by a specific column (that is not already a partition column) it could be beneficial to pre-sort the input data by such column before it’s loaded into Jethro.

Join Indexes

When attributes of large dimensions are often used as a filter it is recommended to define them as a JOIN INDEX on the fact table. There is no limit to the number of JOIN INDEXES that can be defined.

Jethro without Hadoop

Jethro is capable of using other storage systems besides Hadoop’s HDFS. These include a local filesystem, cloud storage (eg S3) or network storage (SAN/NAS).
When the dataset used with Jethro can fit in a local filesystem it is often the best solution as it avoids Hadoop overhead.

Load “Overwrite” for table update with no downtime

When a dimension changes and need to be reloaded you can use Jethro’s Load Overwrite feature. It loads the updated table and only when the process is completed the tables are swapped.

Use ALTER TABLE to add columns on the fly

Jethro, being a column oriented design, can dynamically add (or drop) columns without having to reload the table. The value NULL will be used for the new column over existing rows

Use Jethro’s “SHOW” SQL command to learn about Jethro internals

Top 11 Tips for Optimizing Jethro Performance