[apache-spark] What is the difference between map and flatMap and a good use case for each?

map: It returns a new RDD by applying a function to each element of the RDD. Function in .map can return only one item.

flatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but the output is flattened.

Also, function in flatMap can return a list of elements (0 or more)

For Example:

sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()

Output: [[1, 2], [1, 2, 3], [1, 2, 3, 4]]

sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()

Output: notice o/p is flattened out in a single list [1, 2, 1, 2, 3, 1, 2, 3, 4]

Source:https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey/