Spark¶

Crail can be used to increase performance or enhance flexibility in Apache Spark. We provide multiple plugins to allow Crail to be used as:

HDFS Adapter: input and output
Spark-IO: shuffle data and broadcast store

HDFS Adapter¶

The Crail HDFS adapter is provided with every Crail deployment. The HDFS adpater allows to replace every HDFS path with a path on Crail. However for it to be used for input and output in Spark the jar file paths have to be added to the Spark configuration spark-defaults.conf:

spark.driver.extraClassPath      $CRAIL_HOME/jars/*
spark.executor.extraClassPath    $CRAIL_HOME/jars/*

Data in Crail can be accessed by prepending the value of crail.namenode.address from crail-site.conf to any HDFS path. For example crail://localhost:9060/test accesses /test in Crail. Note that Crail works independent of HDFS and does not interact with HDFS in any way. However Crail does not completely replace HDFS since we do not offer durability and fault tolerance cf. Introduction. A good fit for Crail is for example inter-job data that can be recomputed from the original data in HDFS.

Spark-IO¶

Crail-Spark-IO contains various I/O accleration plugins for Spark tailored to high-performance network and storage hardware (RDMA, NVMef, etc.). Spark-IO is not provided with the default Crail deployment but can be obtained here. Spark-IO currently contains two IO plugins: a shuffle engine and a broadcast module. Both plugins inherit all the benefits of Crail such as very high performance (throughput and latency) and multi-tiering (e.g., DRAM and flash).

Requirements¶

Spark >= 2.0
Java 8
Maven
Crail >= 1.0

Building¶

To build Crail execute the following steps:

Obtain a copy of Crail-Spark-IO from Github
Make sure your local maven repository contains Crail, if not build Crail from source
Run: mvn -DskipTests install

Configure Spark¶

To configure the crail shuffle plugin add the following lines to spark-defaults.conf

spark.shuffle.manager           org.apache.spark.shuffle.crail.CrailShuffleManager

spark.driver.extraClassPath     $CRAIL_HOME/jars/*:<path>/crail-spark-X.Y.jar:.
spark.executor.extraClassPath   $CRAIL_HOME/jars/*:<path>/crail-spark-X.Y.jar:.

Since Spark version 2.0.0, broadcast is no longer an exchangeable plugin, unfortunately. To use the Crail broadcast plugin in Spark it has to be manually added to Spark’s BroadcastManager.scala.