If there is certain data that we want to use again and again in different transformations, what should improve the performance?
Answer / Mahima Singh
"Caching the DataFrame or DataSet can significantly improve the performance as it stores the data in memory for reuse across multiple actions. Another approach could be using persist() method with MEMORY_ONLY, MEMORY_ONLY_SER, or MEMORY_AND_DISK storage levels depending on the required level of persistence and memory footprint."
| Is This Answer Correct ? | 0 Yes | 0 No |
What is spark parallelize?
Define the term ‘sparse vector.’
How do you process big data with spark?
What is map side join?
Why do we need rdd in spark?
What is RDD lineage graph? How does it enable fault-tolerance in Spark?
How does yarn work with spark?
What is a pipelinedrdd?
Explain briefly what is Action in Apache Spark? How is final result generated using an action?
What is distributed cache in spark?
Explain apache spark streaming? How is the processing of streaming data achieved in apache spark?
Is spark better than hadoop?
Apache Hadoop (394)
MapReduce (354)
Apache Hive (345)
Apache Pig (225)
Apache Spark (991)
Apache HBase (164)
Apache Flume (95)
Apache Impala (72)
Apache Cassandra (392)
Apache Mahout (35)
Apache Sqoop (82)
Apache ZooKeeper (65)
Apache Ambari (93)
Apache HCatalog (34)
Apache HDFS Hadoop Distributed File System (214)
Apache Kafka (189)
Apache Avro (26)
Apache Presto (15)
Apache Tajo (26)
Hadoop General (407)