Spark¶
Crail can be used to increase performance or enhance flexibility in Apache Spark. We provide multiple plugins to allow Crail to be used as:
HDFS Adapter: input and output
Spark-IO: shuffle data and broadcast store
HDFS Adapter¶
The Crail HDFS adapter is provided with every Crail deployment. The HDFS adpater allows to replace every HDFS path with a path on Crail. However for it to be used for input and output in Spark the jar file paths have to be added to the Spark configuration spark-defaults.conf:
spark.driver.extraClassPath $CRAIL_HOME/jars/*
spark.executor.extraClassPath $CRAIL_HOME/jars/*
Data in Crail can be accessed by prepending the value of crail.namenode.address
from crail-site.conf to any HDFS path. For example crail://localhost:9060/test
accesses /test
in Crail.
Note that Crail works independent of HDFS and does not interact with HDFS in
any way. However Crail does not completely replace HDFS since we do not offer
durability and fault tolerance cf. Introduction.
A good fit for Crail is for example inter-job data that can be recomputed
from the original data in HDFS.
Spark-IO¶
Crail-Spark-IO contains various I/O accleration plugins for Spark tailored to high-performance network and storage hardware (RDMA, NVMef, etc.). Spark-IO is not provided with the default Crail deployment but can be obtained here. Spark-IO currently contains two IO plugins: a shuffle engine and a broadcast module. Both plugins inherit all the benefits of Crail such as very high performance (throughput and latency) and multi-tiering (e.g., DRAM and flash).
Requirements¶
Spark >= 2.0
Java 8
Maven
Crail >= 1.0
Building¶
To build Crail execute the following steps:
Configure Spark¶
To configure the crail shuffle plugin add the following lines to spark-defaults.conf
spark.shuffle.manager org.apache.spark.shuffle.crail.CrailShuffleManager
spark.driver.extraClassPath $CRAIL_HOME/jars/*:<path>/crail-spark-X.Y.jar:.
spark.executor.extraClassPath $CRAIL_HOME/jars/*:<path>/crail-spark-X.Y.jar:.
Since Spark version 2.0.0, broadcast is no longer an exchangeable plugin, unfortunately. To use the Crail broadcast plugin in Spark it has to be manually added to Spark’s BroadcastManager.scala.