What is the difference between cache and persist

Question

In terms of RDD persistence  what are the differences between cache   and persist   in spark

User · Answer

There is no difference   From RDD scala        Persist this RDD with the default storage level   MEMORY ONLY       def persist    this type   persist StorageLevel MEMORY ONLY       Persist this RDD with the default storage level   MEMORY ONLY       def cache    this type   persist

User · Answer

For impatient  Same Without passing argument  persist   and cache   are the same  with default settings   when RDD  MEMORY ONLY when Dataset  MEMORY AND DISK  Difference  Unlike cache    persist   allows you to pass argument inside the bracket  in order to specify the level   persist MEMORY ONLY  persist MEMORY ONLY SER  persist MEMORY AND DISK  persist MEMORY AND DISK SER   persist DISK ONLY    Voil

User · Answer

Spark gives 5 types of Storage level   MEMORY ONLY MEMORY ONLY SER MEMORY AND DISK MEMORY AND DISK SER DISK ONLY   cache   will use MEMORY ONLY  If you want to use something else  use persist StorageLevel  lt  type  gt     By default persist   will store the data in the JVM heap as unserialized objects

User · Answer

With cache    you use only the default storage level    MEMORY ONLY for RDD MEMORY AND DISK for Dataset  With persist    you can specify which storage level you want for both RDD and Dataset  From the official docs    You can mark an RDD to be persisted using the persist   or cache   methods on it  each persisted RDD can be stored using a different storage level The cache   method is a shorthand for using the default storage level  which is StorageLevel MEMORY ONLY  store deserialized objects in memory     Use persist   if you want to assign a storage level other than    MEMORY ONLY to the RDD or MEMORY AND DISK for Dataset  Interesting link for the official documentation   which storage level to choose

User · Answer

Cache   and persist   both the methods are used to improve performance of spark computation  These methods help to save intermediate results so they can be reused in subsequent stages    The only difference between cache   and persist   is  using Cache technique we can save intermediate results in memory only when needed while in Persist   we can save the intermediate results in 5 storage levels MEMORY ONLY  MEMORY AND DISK  MEMORY ONLY SER  MEMORY AND DISK SER  DISK ONLY

User · Answer

The difference between cache and persist operations is purely syntactic  cache is a synonym of persist or persist MEMORY ONLY   i e  cache is merely persist with the default storage level MEMORY ONLY   But Persist   We can save the intermediate results in 5 storage levels   MEMORY ONLY MEMORY AND DISK MEMORY ONLY SER MEMORY AND DISK SER DISK ONLY              Persist this RDD with the default storage level  MEMORY ONLY         def persist    this type   persist StorageLevel MEMORY ONLY           Persist this RDD with the default storage level  MEMORY ONLY         def cache    this type   persist    see more details here     Caching or persistence are optimization techniques for  iterative and interactive  Spark computations  They help saving interim partial results so they can be reused in subsequent stages  These interim results as RDDs are thus kept in memory  default  or more solid storage like disk and or replicated  RDDs can be cached using cache operation  They can also be persisted using persist operation    persist  cache These functions can be used to adjust the storage level of a RDD  When freeing up memory  Spark will use the storage level identifier to decide which partitions should be kept  The parameter less variants persist   and cache   are just abbreviations for persist StorageLevel MEMORY ONLY     Warning  Once the storage level has been changed  it cannot be changed again    Warning -Cache judiciously    see   Why  do we need to call cache or persist on a RDD  Just because you can cache a RDD in memory doesn   t mean you should blindly do so  Depending on how many times the dataset is accessed and the amount of work involved in doing so  recomputation can be faster than the price paid by the increased memory pressure  It should go without saying that if you only read a dataset once there is no point in caching it  it will actually make your job slower  The size of cached datasets can be seen from the Spark Shell   Listing Variants    def cache    RDD T   def persist    RDD T   def persist newLevel  StorageLevel   RDD T   See below example   val c   sc parallelize List  quot Gnu quot    quot Cat quot    quot Rat quot    quot Dog quot    quot Gnu quot    quot Rat quot    2       c getStorageLevel      res0  org apache spark storage StorageLevel   StorageLevel false  false  false  false  1       c cache      c getStorageLevel      res2  org apache spark storage StorageLevel   StorageLevel false  true  false  true  1    Note   Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably  See more visually here     Persist in memory and disk   Cache Caching can improve the performance of your application to a great extent

[apache-spark] What is the difference between cache and persist?

Examples related to apache-spark

Examples related to distributed-computing

Examples related to rdd