There is always something new to learn in the field of BIG DATA. This time I took a course in Big Data University on Spark Fundamental I. You can go through the free courses in Big Data University in order to learn in different tracks.
Firstly, in order to learn spark we need to install it first in your machine or VM. I had previously install Apache Hadoop 2.6.0 on my VM and want to install spark on top of it. Thanks to YARN we don’t need to pre-deploy anything to nodes and as it turned out it was very easy to install and run spark on YARN. You can also install spark in standalone mode.
If you haven’t install Hadoop then You can go through below link in order to install Apache Hadoop on your VM, I had previously gone through below blog post in order to install Hadoop 2.6.0 on my VM. Which is quite simple.
What is Spark?
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
– According to Big Data University – Spark foundation Course:
Apache Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide, Spark is the alternative. Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining. Spark provides in-memory cluster computing for lightning fast speed and supports Java, Scala, and Python APIs for ease of development.
Spark combines SQL, streaming and complex analytics together seamlessly in the same application to handle a wide range of data processing scenarios. Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3
You can find more on Spark from:
Now, how to install spark?
In order to install Apache Spark:
You can create separate user for spark as spark user as similar to Hadoop user or just install as root user.
tar -zxvf spark-1.6.0-bin-hadoop2.6.tgz
mv spark-1.6.0-bin-hadoop2.6 spark
Also you can set the environment variables
Spark supports programming languages java, Scala, Python. So, I setup scala programming language in order to play around with spark.
How to Install scala ?
You can download the scala version 2.11.7 from the site: http://www.scala-lang.org/download/ into your local disk space on your VM. Then use below command:
tar xvf scala-2.11.7.tar
mv scala-2.11.7 scala
Now let’s check the setup of Spark – shell.
You can also check Spark GUI:
Great!!! You have successfully install Apache Spark. After that, you can go through the course from Big Data University using above link at first and learn more about spark .Through this course you can learn Spark Fundamentals, some of basic related to spark. After completing the course, you should be able to:
- Describe what Spark is all about know why you would want to use Spark
- Use Resilient Distributed Datasets operations
- Use Scala, Java, or Python to create and run a Spark application
- Creating applications using Spark SQL, MLlib, Spark Streaming, and GraphX
- Configure, monitor and tune Spark
In summary, it feels always great to learn new things and play around with it. In the Big DATA world, there is always something new to learn and this time Spark is what I got my free time to play around with. But I must say, the more you want to learn these things the more excitement and challenges it brings and want to deep dive more and more.
Through this article, I believe you are now able to install spark 1.6.0 on top of Hadoop 2.6.0. and play around with it.