Saturday, 28 March 2015

Apache Spark components


Today we will discuss about the Apache Spark components. Its really important to understand below components

1. SparkContext
2. Driver program
3. Executor
4. Task


What is Driver program?

The one who create the SparkContext is the driver program.


what is SparkContext?

SparkContext is the connection to Cluster, and used to create RDDS and few other stuff which we can discuss later.


What is Executor?

Executor is a process which run on each slave node.

What is Task?

Task is the actual computation whihc run inside the executors, Each Task is an thread inside the executors.



 When we run any application in Apache Spark, application computation is spread across multiple nodes(computers) to process the data and finallly the results will be out. Each node runs a process called "Executor" which will be responsible for executing the task. Each Task is a thread inside executor, which is the one actually does computation.

An Executor can run multiple Task if the number of core is more than one. Each task is assigned to one core to make use parallel comuting.

Apache Spark

Apache Spark


Apache spark is open source cluster computing framework, Its completly written in Scala. It internally uses Akka framework for distributting task to multiple nodes. Here node is nothing but a system which consist of RAM, HardDisk, Processor.

Basic building block of Apache Spark is RDD(Resilient Distributed Datasets), It provides fault tollerence, RDD has metadata about the computation information which is used to re-compute the failed task. I will talk about fault tollerence little later when we start coding because that will be the right context and time. :-)

Download Apache Spark, And follow the steps to install.

Introduction

Cluster Computing


Cluster of computer used to compute task in distributed manner is called cluster computing. Generally building single computer with high end configuration is costly and scaling the hardware is not possible, So the best way to get high performance with cost effective is cluster computing.

Cluster computing is linking/grouping commodity computers together to make the big powerful system to process large data, just linking multiple system is not going to provide power of processing large data but we need a framework which can distribute the task to multiple system for processing and finally providing the result.


  There are many cluster computing framework is available, There are two types of cluster computing framework.

1.  Task parallelism
2.  Data parallelism


1. Task parallelism

Slicing the program into multiple parts and distributing across system is called task parallelism.

2. Data parallelism

In task parallelism the distributed machine code is different but in data parallelism the machine code is same but the data is sliced into multiple parts.

We are going to explore Data parallelism..! Apache Spark