explain the concept of RDD (Resilient Distributed Dataset). Also, state how you can create RDDs in Apache Spark.
Answer / Sadhana Dubey
RDD (Resilient Distributed Dataset) is an immutable distributed collection of objects that provides fault-tolerant parallel processing for large datasets in Apache Spark. It serves as the fundamental data structure for performing computations in Spark. RDDs can be created from various sources such as local files, HDFS files, or even other RDDs using Spark's API (Application Programming Interface). Some ways to create RDDs include textFile(path), wholeTextFiles(path), and parallelize(iterable) in Scala, SparkSession.textFile(path), SparkSession.wholeTextFiles(path), and SparkSession.parallelize(iterable) in Java and Python respectively.
| Is This Answer Correct ? | 0 Yes | 0 No |
Can you use Spark to access and analyse data stored in Cassandra databases?
How does spark rdd work?
How tasks are created in spark?
How is RDD in Apache Spark different from Distributed Storage Management?
Define "PageRank".
What is lazy evaluation and how is it useful?
Describe the distnct(),union(),intersection() and substract() transformation in Apache Spark RDD?
Can we run Apache Spark without Hadoop?
Explain distnct(),union(),intersection() and substract() transformation in Spark?
How do I start a spark cluster?
How do you integrate spark and hive?
What is the need for Spark DAG?
Apache Hadoop (394)
MapReduce (354)
Apache Hive (345)
Apache Pig (225)
Apache Spark (991)
Apache HBase (164)
Apache Flume (95)
Apache Impala (72)
Apache Cassandra (392)
Apache Mahout (35)
Apache Sqoop (82)
Apache ZooKeeper (65)
Apache Ambari (93)
Apache HCatalog (34)
Apache HDFS Hadoop Distributed File System (214)
Apache Kafka (189)
Apache Avro (26)
Apache Presto (15)
Apache Tajo (26)
Hadoop General (407)