map
: It returns a new RDD
by applying a function to each element of the RDD
. Function in .map can return only one item.
flatMap
: Similar to map, it returns a new RDD
by applying a function to each element of the RDD, but the output is flattened.
Also, function in flatMap
can return a list of elements (0 or more)
For Example:
sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()
Output: [[1, 2], [1, 2, 3], [1, 2, 3, 4]]
sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()
Output: notice o/p is flattened out in a single list [1, 2, 1, 2, 3, 1, 2, 3, 4]
Source:https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey/