Apache Spark : Memory management and Graceful degradation
Dec 11, 2014

Many of the concepts of Apache Spark are pretty straightforward and easy to understand, however some lucky few can be badly misunderstood. One of the greatest misunderstanding of all is the fact that some still believe that "Spark is only relevant with datasets that can fit into memory, otherwise it will crash".
This is an understanding mistake, Spark being easily associated as a "Hadoop using RAM more efficiently", but it still is a mistake.
Spark is by default doing the best it can to load the datasets it handles in memory. Still when the handled datasets are too large to fit into memory, automatically (or should i say auto-magically) these objects will be spilled to disk. This is one of the main features of Spark coined by the expression "graceful degradation" and it was very well illustrated by these two charts in Matei Zaharia's dissertation : An Architecture for Fast and General Data Processing on Large Clusters :
[caption id="attachment_1211" align="aligncenter" width="660"]