In this blog I try to cover the difference between RDD, DF and DS. much of you have a little bit confused about RDD, DF and DS. so don’t worry after this blog everything will be clear.
With Spark2.0 release, there are 3 types of data abstractions which Spark officially provides now to use: RDD, DataFrame and DataSet.
so let’s start some discussion about it.
Resilient Distributed Datasets (RDDs) – Rdd is is a fault-tolerant collection of elements that can be operated on in parallel.
By the rdd, we can perform operations on data on the different nodes of the same cluster parallelly so it’s helpful in increasing the performance.
How we can create the RDD
Spark context(sc) helps to create the rdd in the spark. it can create the rdd from –
- external storage system like HDFS, HBase, or any data source offering a Hadoop InputFormat.
- parallelizing an…
View original post 659 more words