In this article, we are going to present the Spark Interview Questions and sample answers over here.
Spark Interview Questions
- Why do you choose Spark?
Nowadays, everybody prefers to go with a spark. Before spark, there were different data processing engines. One of the most used data processing engines is MapReduce. Still, in some cases, a MapReduce is relevant. But let’s try to understand why Spark performs much better than MapReduce.
- What is the difference between MapReduce and Spark?
MapReduce is a disk-based data processing framework, while Spark is an in-memory data processing framework. If you need further classification, a Hadoop framework is a MapReduce and a spark framework. Hadoop is a distributed system that doesn’t process your data but stores them. Your MapReduce does the processing.
But the problem with MapReduce is when you try to launch an iterative job or iterative processing, it will try to read your data for every iterative job from a hard disk. And then, after every step, it will try to write back your data into a hard disk. So it will be an HDFS file system. It consumes your network input-output in between reading and writing data, and it will perform serialization and a deserialization operation to convert your data. This way, it utilizes a lot of i/o bandwidth, and it consumes your CPU while doing serialization and deserialization operations. This is because MapReduce is a disk-based computation engine.
When considering Spark, it is an in-memory computational framework. So while you are processing data into a Spark, it will keep your data, the intermediate data that is supposed to provide the next job, into memory instead of keeping it in a hard disk. So it reads your data from a hard disk, performs some operations in this step, and then it writes your data into memory. It does not write your data into a hard disk, unlike MapReduce. And again, the next step provides the data from memory itself. Again for the next step of going forwards, it will provide data from memory.
Another difference between MapReduce and Spark is, any SQL data to implement any machine libraries to implement or to do any streaming process, you can’t utilize MapReduce directly. To perform a SQL kind of operation, you should utilize a high framework on the top of MapReduce. To perform a machine learning operation, you should use a mahout framework on the top of MapReduce. To perform a streaming job in MapReduce, you are supposed to apply or use a strong framework. And again, it’s a burden on MapReduce.
When considering Spark about the same consent, Spark is having Spark SQL libraries built inside it. Spark has streaming APIs. Spark has graphics. Spark itself can handle so many things than MapReduce. So Spark is a preferred choice than MapReduce. Mapreduce is a batch processing framework, while spark is a real-time data processing framework. Spark provides integration with almost all no SQL databases like Mongo, Cassandra, influx DB, elastic search, neo4j, HBase, and SQL-related databases such as SQL Oracle, DB2, etc.
- Why do we say that the spark is 100 times faster than a Map-Reduce execution framework?
Spark supports batch mode processing data, as well, as spark supports streamings of our data. Whenever you try to submit a job to a spark, the first point of contact for any program launched from your client system will be a driver program. Inside a driver program, there will be a spark context which is our first point of executing any program. This spark context will try to interact with your cluster manager. And as a cluster manager, you could use Missiles, Ian, or spark stand-alone cluster manager.
Spark provides 3 kinds of cluster managers.
- Stand Alone
Spark context will try to get other sources from a cluster manager. Then after getting the confirmation that the cluster manager has enough resources to execute this particular job, it will launch an executor inside a worker node. An executor is nothing but a CPU and memory (RAM) as we know that spark performs all the operations into a memory. Therefore, the spark is an in-memory data processing framework.
So inside an executor, it will execute a task in a distributed mode if you are using a Hadoop cluster. If you are trying to execute your job into a stand-alone, it will perform the same kind of operation again. It will try to create the same kind of executor but not in a distributed model.
- What is the component that is needed to process our data into memory?
Spark’s primary core abstraction is called RDD, which is the abbreviation for Resilient Distribution Database. Resilient means property of spark to achieve fault tolerance in case of any job failure. To achieve this fault tolerance, a spark always tries to create a lineage graph. It is nothing but meta-information about a process that a spark is going to execute. A meta-information is not about data.
Suppose if we try to process some of the jobs. And in that job, we try to read our data from some hard disk. And then, we try to perform some operations such as Map-Reduce or the right operation. So for all these sequential operations internally, it is going to create a graph. And that is called a lineage graph. So in that graph, it will have information about processes.
Suppose that we try to read Map Reduce and write the operation. It will be having an idea about all of these operations; Not about data. So whenever there is a job failure, it will try to re-execute a complete process from the beginning until the end. This is how it achieves false tolerance so that the lineage graph will satisfy resiliency properties.
The word “distribute” in RDD has another meaning. Spark executes your data. It will process your data into a distributed system as well as into a non-distributed system. It provides you the capability to execute your data in a bigger file to a Hadoop system. (HDFS system). The third property is a dataset. You can execute or process any data set into a spark. It could be unstructured, structured, or semi-structured data. There is no restriction with a specific kind of dataset. You can process any data. RDD is a primary data structure of spark.
- What is spark context?
Whenever a programmer creates an RDD, spark context will be the first point of contact to a Spark cluster by creating a new spark context each time. Spark context is an entity that is going to tell a Spark how to access a cluster. Spark context will always try to communicate with the cluster manager to get your resources. It tries to submit your job, and it will try to execute your job by getting the resources from a cluster manager.
And inside this spark context, you can configure the properties with the help of spark configuration. A spark configuration is a key factor in creating a programmer application. Whenever you are trying to create a programmer, it will try to create different properties like a local r a master’s application. All things have been configured inside a spark configuration.
- What are partitions?
Whenever you try to execute a job, a spark will create RDD for every bit of your data. But inside the RDD, data will be available in its and pieces. Those bits and pieces are called partitions. Partition is a logical division of our data, whereas RDD is a physical division of our data. So this idea is derived from a map-reduce. Inside a MapReduce, a concept called input split tries to create a logical division of your data from blocks. In the same way, inside a spark, you can create our logical division of data with the help of partition. So an RDD is wrapping over your partition.
RDD is a collection of a partition. A partition can support small chunks of data. So it’s not like it should have one block of data such as 128MB or 64MB data. It can create a partition and, on top of that, RDD even with 1Kb of our data and supports scalability to speed up the processor with the help of in-memory computation. Input data, intermediate data, output data, and whatever you process inside spark are partitioned RDD.
- How does spark partition the data?
Spark uses a map-reduce API to do the partition of data. Map Reduce was trying to do an input split which is a logical division of data out of your blocks. Same as that, in a spark, in the input format, we can create any number of partitions just like we were creating any number of input splits in the case of Map Reduce.
By default, HDFS block size is the partition size of a Map Reduce. So by default, 128MB of our data is equal to one partition/ a logical division. But is it possible to change our partition size like an input split we used inside a Map Reduce?
- How does spark store data?
Spark is our processing engine. Because it is a processing engine, obviously, it has low storage. It just has computation power and in memory which is a RAM (primary memory). It can retrieve our data from HDFS from S3 and other data resources.
- Is it mandatory to start Hadoop to run the Spark application?
The answer will be NO. Spark supports a stand-alone mode and also a cluster mode. You can configure and execute your spark program into your windows machine as well as in a Linux machine, individual Linux machine, or a Mac machine. On top of a cluster where you will be using a bunch of the machines. But in storage again, whatever the machine you use, there is no separate storage in a spark. If you are using a stand-alone machine, it uses a local file system to store your data. You can load the data from a local system and process it. Or you can process it from an HDFS. But Hadoop or HDFS is not mandatory to run a spark application.
- What are the basic components of the spark ecosystems?
Spark has s spark core in which you can process your data in batch mode. And apart from that,
- Spark supports SQL for SQL developers – If you want to process data like a hive, which we were doing in Map Reduce, you can process the same kind of data with much faster speed, and less disk conjunction into a Spark SQL. A spark SQL supports interaction with a hive directly.
- Spark Streaming for streaming data – The second component is spark streaming. Whenever you are trying to process data into real-time or near real-time (suppose you get data from a sensor, a satellite, or some web blogs) you would go with the spark streaming API. And spark can process your data in real-time with the help of discretized stream.
- MLlib for machine learning algorithm – Suppose if you are trying to implement a machine learning module. If you are trying to build some models using spark. Spark is having APIs for that as well. Spark supports almost all the advanced machine learning algorithms inside it such as clustering, classification, or regression.
- GraphX for Graph computation – Suppose if you are trying to develop a system in which you have to build with a network as LinkedIn and Facebook do. They try to make connectivity between one person to another by storing some information into a node and then mapping a distance between the two persons. If you are trying to achieve the same kind of operation, you can do that with the help of spark GraphX libraries.
- What are SparkCore functionalities?
SparkCore is a base engine of the Apache Spark framework. It does memory management. It does fault tolerance. It does scheduling and monitoring of your job. It interacts with your storage system, which is a primary functionality of spark. In terms of memory management, spark tries to distribute your memory to a processor you try to execute inside a spark, and it will try to store some of the memory for cache memory. It will try to preserve some memory for some other processes as well. Whether your job is fast or a failure, it will always try to communicate with your storage system or read and write data in the memory and from memory. So this is a Core spark functionality.
- How SparkSQL is different from SQL and HQL?
SparkSQL is a special component of the spark core engine that supports SQL and HQL without changing syntax. It is possible to join the SQL table and HQL table inside a spark. Internally it optimizes your query with the help of a query optimizer. And I will try to optimize your data size if you are using a spark query processor. That is an advantage of using spark SQL on top of HQL.
- When do we use Spark streaming?
Spark streaming is the real-time processing of streaming API. Spark streaming gathers data from different resources such as web server log files, social media data, stock market data, or Hadoop ecosystems like Flume and Kafka. Whenever you are getting data in a fraction of a second, whenever there is a system that generates data so frequently so that you are supposed to process within a second, you should go with the spark streaming. Whenever you are getting data from a sensor, or a satellite, spark streaming is a better choice than any other processing framework of real-time.
- How spark streaming API work?
A programmer always tries to set a time inside a spark configuration. That time is nothing, but you are setting a time for a particular stream. In a single iteration, spark tries to get our data for that mini seconds. How much data it is getting into a spark within this time is that data is being separated into batches. The input stream is called a Dream (Discretized stream). It goes into a spark streaming framework, and the framework breaks up your data into bits and pieces and tries to form small batches. So streaming is nothing, but it’s a collection of small chunks of batches of data. Spark streaming APIs process batches of data to a core engine that will process your data. And the core engine can generate the final result in the form of streaming batches again. The output is a batch of streaming data again. It can allow streaming data and batch data for processing.
- What is Spark MLlib?
Like in the case of MapReduce, we have been using a mahout as a machine learning library. But mahauth is not a part of the Hadoop ecosystem. But it was an external framework that we have configured on top of a MapReduce. With the help of the MapReduce execution engine, we can process our data. But a spark itself provides very rich machine learning libraries, the same as mahauth. Inside a spark, you are not supposed to configure some external framework. You can directly execute your machine learning models into a spark. It provides all sorts of machine learning algorithms for classification clustering and regression problems like linear regression libraries, multiple regression libraries, polynomial regression libraries, support vector machines, Nya base K means, KMN, and all sorts of very rich libraries. So you don’t need to look out for any external libraries. You can process all of your machine learning jobs by spark itself. And that is a region most of the data scientists are using MLlib libraries from a spark.
- What is a GraphX module?
GraphX is an advanced module inside spark which provides you lots of features and flexibility. GraphX is an API for manipulating graphs and collections. It unifies ETL (Extract transform load) and other analyses, and it provides an iterative graph computation. It is one of the fastest graph systems which provide fault tolerance and ease of use without having any special skills. Suppose if you are trying to solve a networking problem or a routing problem. GraphX will be having very good features for solving networking and routing problems. Nowadays, everybody is using social media. And in our social media, everyone uses a graph algorithm to route any person from any person. To make a relationship between one connection to another connection and to store our data, people are using GraphX. If we talk about some movie websites such as Netflix, they use the GraphX algorithm extensively in their back end. So there are extensive use cases of GraphX that you can achieve all these operations in an easy step if you try to explore GraphX algorithms.
- What is a Spark file system API?
File system API is a storage system in different storage devices such as HDFS, S3, or local file systems. Spark uses FS API to read data from different storage engines. You can read the data from any NoSQL databases or SQL databases as well. You can read the data from sensors, satellites as well inside a spark.
- Why partitions are immutable?
Whenever you try to perform any transformation operation such as mapping, filtering, or reading operations, it generates a new partition. In other words, it generates a new RDD. RDD is a bunch of partitions. So it will generate a new partition for you. And partitions useHDFS API like an input split. So that partitions are immutable. You can’t change the partitions, but you can generate new partitions out of the previous ones. The same happens with RDD. RDDs are immutable. You can’t manipulate RDD. But you can generate a new RDD out of a previous RDD. Partitions are aware of the data locality. So that it will always try to create a partition in the system where the data is available then, it will not try to create a partition.