Pyspark Transformation and Actions

RDD Transformation Examples

Some of the examples of RDD transformations are:

flatmap()
map()
reduceByKey()
filter()
sortByKey()

RDD Actions Example

some of the examples of RDD actions are:

count()
first()
max()
reduce()
take()
collect()

lets us understand by code how RDD is fault tolerant, immutable, how lazy evaluation works and also understand its distributed nature.

lets define a RDD , while converting a data to RDD, it is not directly loaded in memory, first of all a lineage graph is made for the data, such that even if any node crashes in future, data can be recreated. This lineage graph helps in fault tolerance by allowing Spark to recreate lost data partitions in case of any reason causing node failure. RDDs are distributed across multiple nodes in a cluster, allowing parallel processing of data records.

rdd = sc.parallelize([1, 2, 3, 4, 5])

the immutability nature of RDD can be understand such that, notice now we didnt changed the "rdd" , instead created a new RDD "squared_rdd" cause we cannot change the rdd once it is created.

Also, the map() is a transformation logic in RDD, and as per lazy evaluation , pyspark make a DAG of all the transformation until RDD action is not encountered.

squared_rdd = rdd.map(lambda x: x**2)

filtered_rdd = squared_rdd.filter(lambda x: x % 2 == 0)

ok, now after all the transformations we need, we use RDD action i.e collect(), so finally the computation takes place.

result = filtered_rdd.collect()

Search This Blog

Data Science and Engineering