What is Spark?
Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much more quickly than with disk-based systems like Hadoop MapReduce.
What can it do?
Spark was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can run up to 100x faster than Hadoop MapReduce. However, you can use Spark for general data processing too. Check out our example jobs.
While Spark is a new engine, it can access any data source supported by Hadoop, making it easy to run over existing data.
Who uses it?
Spark was developed in the UC Berkeley AMPLab. It’s used by several groups of researchers at Berkeley to run large-scale applications such as spam filtering and traffic prediction. It’s also used to accelerate data analytics at Conviva, Quantifind, and other companies — in total, 14 companies have contributed to Spark! Spark is open source under a BSD license, so download it to check it out.