Task not serializable java io NotSerializableException when calling function outside closure only on classes not objects

Question

Getting strange behavior when calling function outside of a closure    when function is in a object everything is working when function is in a class get         Task not serializable  java io NotSerializableException  testing   The problem is I need my code in a class and not an object  Any idea why this is happening  Is a Scala object serialized  default     This is a working code example   object working extends App       val list   List 1 2 3       val rddList   Spark ctx parallelize list        calling function outside closure      val after   rddList map someFunc          def someFunc a Int     a 1      after collect   map println         This is the non-working example    object NOTworking extends App     new testing   doIT      adding extends Serializable wont help class testing       val list   List 1 2 3      val rddList   Spark ctx parallelize list     def doIT            again calling the fucntion someFunc      val after   rddList map someFunc           this will crash  spark lazy      after collect   map println            def someFunc a Int    a 1

User · Answer

Complete talk fully explaining the problem, which proposes a great paradigm shifting way to avoid these serialization problems: https://github.com/samthebest/dump/blob/master/sams-scala-tutorial/serialization-exceptions-and-memory-leaks-no-ws.md

The top voted answer is basically suggesting throwing away an entire language feature - that is no longer using methods and only using functions. Indeed in functional programming methods in classes should be avoided, but turning them into functions isn't solving the design issue here (see above link).

As a quick fix in this particular situation you could just use the @transient annotation to tell it not to try to serialise the offending value (here, Spark.ctx is a custom class not Spark's one following OP's naming):

@transient
val rddList = Spark.ctx.parallelize(list)

You can also restructure code so that rddList lives somewhere else, but that is also nasty.

The Future is Probably Spores

In future Scala will include these things called "spores" that should allow us to fine grain control what does and does not exactly get pulled in by a closure. Furthermore this should turn all mistakes of accidentally pulling in non-serializable types (or any unwanted values) into compile errors rather than now which is horrible runtime exceptions / memory leaks.

http://docs.scala-lang.org/sips/pending/spores.html

A tip on Kryo serialization

When using kyro, make it so that registration is necessary, this will mean you get errors instead of memory leaks:

"Finally, I know that kryo has kryo.setRegistrationOptional(true) but I am having a very difficult time trying to figure out how to use it. When this option is turned on, kryo still seems to throw exceptions if I haven't registered classes."

Strategy for registering classes with kryo

Of course this only gives you type-level control not value-level control.

... more ideas to come.

User · Answer

FYI in Spark 2 4 a lot of you will probably encounter this issue  Kryo serialization has gotten better but in many cases you cannot use spark kryo unsafe true or the naive kryo serializer    For a quick fix try changing the following in your Spark configuration  spark kryo unsafe  false    OR  spark serializer  org apache spark serializer JavaSerializer    I modify custom RDD transformations that I encounter or personally write by using explicit broadcast variables and utilizing the new inbuilt twitter-chill api  converting them from rdd map row   gt  to rdd mapPartitions partition   gt    functions   Example  Old  not-great  Way  val sampleMap   Map  index1  - gt  1234   index2  - gt  2345  val outputRDD   rdd map row   gt        val value   sampleMap get row  1      value      Alternative  better  Way  import com twitter chill MeatLocker val sampleMap   Map  index1  - gt  1234   index2  - gt  2345  val brdSerSampleMap   spark sparkContext broadcast MeatLocker sampleMap    rdd mapPartitions partition   gt        val deSerSampleMap   brdSerSampleMap value get     partition map row   gt            val value   sampleMap get row  1          value        toIterator      This new way will only call the broadcast variable once per partition which is better  You will still need to use Java Serialization if you do not register classes

User · Answer

I had a similar experience  The error was triggered when I initialize a variable on the driver  master   but then tried to use it on one of the workers  When that happens  Spark Streaming will try to serialize the object to send it over to the worker  and fail if the object is not serializable  I solved the error by making the variable static  Previous non-working code   private final PhoneNumberUtil phoneUtil   PhoneNumberUtil getInstance     Working code   private static final PhoneNumberUtil phoneUtil   PhoneNumberUtil getInstance     Credits   https   docs microsoft com en-us answers questions 35812 sparkexception-job-aborted-due-to-stage-failure-ta html   The answer of pradeepcheekatla-msft  https   databricks gitbooks io databricks-spark-knowledge-base content troubleshooting javaionotserializableexception html

User · Answer

RDDs extend the Serialisable interface  so this is not what s causing your task to fail  Now this doesn t mean that you can serialise an RDD with Spark and avoid NotSerializableException  Spark is a distributed computing engine and its main abstraction is a resilient distributed dataset  RDD   which can be viewed as a distributed collection  Basically  RDD s elements are partitioned across the nodes of the cluster  but Spark abstracts this away from the user  letting the user interact with the RDD  collection  as if it were a local one   Not to get into too many details  but when you run different transformations on a RDD  map  flatMap  filter and others   your transformation code  closure  is    serialized on the driver node  shipped to the appropriate nodes in the cluster  deserialized  and finally executed on the nodes   You can of course run this locally  as in your example   but all those phases  apart from shipping over network  still occur   This lets you catch any bugs even before deploying to production   What happens in your second case is that you are calling a method  defined in class testing from inside the map function  Spark sees that and since methods cannot be serialized on their own  Spark tries to serialize the whole testing class  so that the code will still work when executed in another JVM  You have two possibilities   Either you make class testing serializable  so the whole class can be serialized by Spark   import org apache spark  SparkContext SparkConf   object Spark     val ctx   new SparkContext new SparkConf   setAppName  test   setMaster  local          object NOTworking extends App     new Test   doIT    class Test extends java io Serializable     val rddList   Spark ctx parallelize List 1 2 3      def doIT            val after   rddList map someFunc      after collect   foreach println         def someFunc a  Int    a   1     or you make someFunc function instead of a method  functions are objects in Scala   so that Spark will be able to serialize it   import org apache spark  SparkContext SparkConf   object Spark     val ctx   new SparkContext new SparkConf   setAppName  test   setMaster  local          object NOTworking extends App     new Test   doIT    class Test     val rddList   Spark ctx parallelize List 1 2 3      def doIT            val after   rddList map someFunc      after collect   foreach println         val someFunc    a  Int    gt  a   1     Similar  but not the same problem with class serialization can be of interest to you and you can read on it in this Spark Summit 2013 presentation   As a side note  you can rewrite rddList map someFunc     to rddList map someFunc   they are exactly the same  Usually  the second is preferred as it s less verbose and cleaner to read   EDIT  2015-03-15   SPARK-5307 introduced SerializationDebugger and Spark 1 3 0 is the first version to use it  It adds serialization path to a NotSerializableException  When a NotSerializableException is encountered  the debugger visits the object graph to find the path towards the object that cannot be serialized  and constructs information to help user to find the object   In OP s case  this is what gets printed to stdout   Serialization stack      - object not serializable  class  testing  value  testing 2dfe2f00      - field  class  testing  anonfun 1  name   outer  type  class testing      - object  class testing  anonfun 1   lt function1 gt

User · Answer

def upper name  String    String      var uppper   String     name toUpperCase   uppper    val toUpperName   udf   EmpName  String    gt  upper EmpName   val emp details    quot  quot  quot    quot id quot    quot 1 quot   quot name quot    quot James Butt quot   quot country quot    quot USA quot      quot id quot    quot 2 quot    quot name quot    quot Josephine Darakjy quot   quot country quot    quot USA quot      quot id quot    quot 3 quot    quot name quot    quot Art Venere quot   quot country quot    quot USA quot      quot id quot    quot 4 quot    quot name quot    quot Lenna Paprocki quot   quot country quot    quot USA quot      quot id quot    quot 5 quot    quot name quot    quot Donette Foller quot   quot country quot    quot USA quot      quot id quot    quot 6 quot    quot name quot    quot Leota Dilliard quot   quot country quot    quot USA quot    quot  quot  quot   val df emp   spark read json Seq emp details  toDS    val df name df emp select   quot id quot    quot name quot   val df upperName  df name withColumn  quot name quot  toUpperName   quot name quot    filter  quot id  5  quot   display df upperName   this will give error org apache spark SparkException  Task not serializable at org apache spark util ClosureCleaner  ensureSerializable ClosureCleaner scala 304  Solution - import java io Serializable   object obj upper extends Serializable      def upper name  String    String            var uppper   String     name toUpperCase       uppper     val toUpperName   udf   EmpName  String    gt  upper EmpName      val df upperName   df name withColumn  quot name quot  obj upper toUpperName   quot name quot    filter  quot id  5  quot   display df upperName

User · Answer

Grega s answer is great in explaining why the original code does not work and two ways to fix the issue  However  this solution is not very flexible  consider the case where your closure includes a method call on a non-Serializable class that you have no control over  You can neither add the Serializable tag to this class nor change the underlying implementation to change the method into a function    Nilesh presents a great workaround for this  but the solution can be made both more concise and general   def genMapper A  B  f  A   gt  B   A   gt  B       val locker   com twitter chill MeatLocker f    x   gt  locker get apply x      This function-serializer can then be used to automatically wrap closures and method calls   rdd map genMapper someFunc    This technique also has the benefit of not requiring the additional Shark dependencies in order to access KryoSerializationWrapper  since Twitter s Chill is already pulled in by core Spark

User · Answer

I solved this problem using a different approach  You simply need to serialize the objects before passing through the closure  and de-serialize afterwards  This approach just works  even if your classes aren t Serializable  because it uses Kryo behind the scenes  All you need is some curry      Here s an example of how I did it   def genMapper kryoWrapper  KryoSerializationWrapper  Foo   gt  Bar                    foo  Foo    Bar         kryoWrapper value apply foo    val mapper   genMapper KryoSerializationWrapper new Blah abc      rdd flatMap mapper  collectAsMap    object Blah abc  ABC  extends  Foo   gt  Bar        def apply foo  Foo    Bar       This is the real function       Feel free to make Blah as complicated as you want  class  companion object  nested classes  references to multiple 3rd party libs   KryoSerializationWrapper refers to  https   github com amplab shark blob master src main scala shark execution serialization KryoSerializationWrapper scala

User · Answer

I m not entirely certain that this applies to Scala but  in Java  I solved the NotSerializableException by refactoring my code so that the closure did not access a non-serializable final field

User · Answer

I faced similar issue  and what I understand from Grega s answer is   object NOTworking extends App    new testing   doIT     adding extends Serializable wont help class testing    val list   List 1 2 3   val rddList   Spark ctx parallelize list   def doIT          again calling the fucntion someFunc    val after   rddList map someFunc         this will crash  spark lazy    after collect   map println        def someFunc a Int    a 1      your doIT method is trying to serialize someFunc    method  but as method are not serializable  it tries to serialize class testing which is again not serializable   So make your code work  you should define someFunc inside doIT method  For example   def doIT       def someFunc a Int    a 1     function definition     val after   rddList map someFunc      after collect   map println         And if there are multiple functions coming into picture  then all those functions should be available to the parent context

[scala] Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

The Future is Probably Spores

A tip on Kryo serialization

Examples related to scala

Examples related to apache-spark

Examples related to serialization