Spark FAQ

Is Spark a modified version of Hadoop?

No. Spark is a completely separate codebase optimized for low latency, although it can load data from any Hadoop input source (InputFormat). You can run Spark on the same cluster as Hadoop using Apache Mesos.

Does Spark require a modified version of Scala?

No. Spark requires no changes to Scala or compiler plugins.

What happens when a cached dataset does not fit in memory?

Spark can either spill it to disk or recompute the partitions that don't fit in RAM each time they are requested. By default, it uses recomputation, but you can use the DiskSpillingCache to have it spill to disk:

System.setProperty("spark.cache.class",                    "spark.DiskSpillingCache")

Do I need to install Mesos to use Spark?

You can run Spark locally (possibly on multiple cores) without Mesos by just passing local[N] as the master URL (where N is the number of parallel threads you want). If you want to run on a cluster, you do have to set up Mesos. Follow the instructions on the wiki.

I don't know Scala; how hard is it to pick it up to use Spark?

Scala is pretty easy to pick up, especially if you have experience in Java. Check out First Steps to Scala for a quick introduction, the Scala tutorial for Java programmers, or the free online version of the book Programming in Scala.

What Scala versions does Spark support?

Spark currently supports both Scala 2.8 and 2.9.

What license is Spark under?

Spark is open source under the liberal BSD license.

How can I contribute to Spark?

Contact the mailing list or send us a pull request on GitHub. We're glad to hear about your experience using Spark and to accept patches.

Where can I get more help?

Please post on the spark-user mailing list. We'll be glad to help!