How can you achieve high availability in Apache Spark?
Define a worker node?
Name a few companies that use Apache Spark in production?
What is the difference between persist() and cache()?
Which spark library allows reliable file sharing at memory speed across different cluster frameworks?
What does the Spark Engine do?
How Spark uses Akka?
How Spark handles monitoring and logging in Standalone mode?
What is Hadoop serialization?
Explain a simple Map/Reduce problem.
Data Engineer Given a list of followers in the format:123, 345234, 678345, 123…Where column one is the ID of the follower and column two is the ID of the followee. Find all mutual following pairs (the pair 123, 345 in the example above). How would you use Map/Reduce to solve the problem when the list does not fit in memory?
How would you use Map/Reduce to split a very large graph into smaller pieces and parallelize the computation of edges according to the fast/dynamic change of data?
Write a Hive UDF that returns a sentiment score. For example, if good = 1, bad = -1, and average = 0, then a review of a restaurant states "Good food, bad service," your score might be 1 - 1 = 0.
Explain how RDDs work with Scala in Spark
Define HRegionServer in HBase