How to install spark 1.6.0 and play around with it.

There is always something new to learn in the field of BIG DATA. This time I took a course in Big Data University on Spark Fundamental I. You can go through the free courses in Big Data University in order to learn in different tracks.

http://bigdatauniversity.com/courses/spark-fundamentals/

Firstly, in order to learn spark we need to install it first in your machine or VM. I had previously install Apache Hadoop 2.6.0 on my VM and want to install spark on top of it. Thanks to YARN we don’t need to pre-deploy anything to nodes and as it turned out it was very easy to install and run spark on YARN. You can also install spark in standalone mode.

If you haven’t install Hadoop then You can go through below link in order to install Apache Hadoop on your VM, I had previously gone through below blog post in order to install Hadoop 2.6.0 on my VM. Which is quite simple.

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#

What is Spark?

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

– According to Big Data University – Spark foundation Course:

Apache Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide, Spark is the alternative. Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining. Spark provides in-memory cluster computing for lightning fast speed and supports Java, Scala, and Python APIs for ease of development.

Spark combines SQL, streaming and complex analytics together seamlessly in the same application to handle a wide range of data processing scenarios. Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3

You can find more on Spark from:

http://spark.apache.org/

Now, how to install spark?

In order to install Apache Spark:

You can create separate user for spark as spark user as similar to Hadoop user or just install as root user.

wget http://apache.mirrors.ionfish.org/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz

tar -zxvf spark-1.6.0-bin-hadoop2.6.tgz

mv spark-1.6.0-bin-hadoop2.6 spark

Also you can set the environment variables

vi ~/.bashrc

export SPARK_HOME=/home/spark

Spark supports programming languages java, Scala, Python. So, I setup scala programming language in order to play around with spark.

How to Install scala ?

You can download the scala version 2.11.7 from the site: http://www.scala-lang.org/download/  into your local disk space on your VM. Then use below command:

tar xvf scala-2.11.7.tar

mv scala-2.11.7 scala

Now let’s check the setup of Spark – shell.

# $SPARK_HOME/bin/spark-shell

Spark

You can also check Spark GUI:

sparkgui

Great!!!  You have successfully install Apache Spark. After that, you can go through the course from Big Data University using above link at first and learn more about spark .Through this course you can learn Spark Fundamentals, some of basic related to spark. After completing the course, you should be able to:

  • Describe what Spark is all about know why you would want to use Spark
  • Use Resilient Distributed Datasets operations
  • Use Scala, Java, or Python to create and run a Spark application
  • Creating applications using Spark SQL, MLlib, Spark Streaming, and GraphX
  • Configure, monitor and tune Spark

In summary, it feels always great to learn new things and play around with it. In the Big DATA world, there is always something new to learn and this time Spark is what I got my free time to play around with. But I must say, the more you want to learn these things the more excitement and challenges it brings and want to deep dive more and more.

Through this article, I believe you are now able to install spark 1.6.0 on top of Hadoop 2.6.0. and play around with it.

References:

http://bigdatauniversity.com/courses/spark-fundamentals/

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#

http://spark.apache.org/

https://sparkhub.databricks.com/

https://anilmaharjanonbi.wordpress.com/2015/07/01/big-data-apache-hadoop-hive-then-vertica/

https://www.linkedin.com/pulse/big-data-apache-hadoop-hive-vertica-anil-maharjan

Big Data, Apache Hadoop, Hive & then Vertica

It’s all about Big Data : Previously , in 2012 , I had some research related to Big Data and Apache Hadoop while I was working in Healthcare domain in my old company , I couldn’t spent much time in this sector though I wanted to . So, now back again, I started to dig down little bit more on this topic which is hot topic at that time and now becoming even hottest topic at this time. Big Data, every company now trying to incorporate and deal with these things.

Really speaking, the more we search related to Big Data the more it will be vast and big. Big Data as far as I know just a way of handling the massive amount of data both structural , semi- structural and unstructured data that is beyond the storage and processing capabilities of a single physical machine . Day by Day (V3) data velocity, data variety formats and data volume gets on increasing and from terabytes to petabytes and so on. If we really don’t think and invest enough time to handle these massive amount of growing data and the emerging Big Data Challenges then the future seems little dark outside.

Where Big Data comes there comes the term Hadoop. Hadoop: I think ‘Doug Cutting’ wouldn’t have thought that the tool which he’s naming after his kid’s soft toy will be such a huge elephant now that every top companies trying to ride on that big elephant. Every companies now trying to find out what’s this big elephant is about, what it is capable of, do we need it or how to incorporate it in our existing frameworks, these are some of the things which companies are thinking and investing there valuable time these days.

Hadoop is by no means an out-of-the-box solution. In order to build a truly information- driven enterprise, where decisions are based on data and not guess works, the companies would require a data management solution that not only offers robust data governance, but also is easily manageable and seamlessly integrates with existing enterprise infrastructure.

Hadoop as describer formally from Apache itself:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

It has mainly tow core components:

  1. Data processing framework (MapReduce): MapReduce is a programming model and an associated implementation for processing and generating large data sets with a paralleldistributed algorithm on a cluster.
  2. Distributed file system for data storage. (HDFS): HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

You can find more on these through different sites online.

I have gone through step by step installing Hadoop 2.7.0 in single node cluster on RedHat Linux VMware Machine, if you want to play along with Hadoop by trying installation and configuration then you can go through below link which have all details:

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/

Then after that I try to install HIVE: Hive, another Apache application that helps convert query language into MapReduce jobs, for instance.

http://tecadmin.net/install-apache-hive-on-centos-rhel/

Now, I am trying to connect Vertica Cluster node and Hadoop Cluster node to talk to each other, since I can definitely process the different source unstructured data directly and easily into Hadoop nodes and then for quick ad-hoc analytics & quick decision making, the same data we process can load into Vertica and do further analysis in near real time. I am trying to go through this document for now.

https://my.vertica.com/docs/7.0.x/PDF/HP_Vertica_7.0.x_HadoopIntegration.pdf

For installing Vertica in a single node you can go through the blog that I had previously written:

https://anilmaharjanonbi.wordpress.com/2014/11/07/how-to-install-vertica-in-a-single-node/

So far as Hadoop distribution is concerned, the three companies that really stand out in the competitions are: Cloudera, MapR and Hortonworks.

You can find more on these by going through their sites. One blog I found great to read related to these Hadoop distribution vendor companies by going through below link:

http://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions/

The more we dig down into these topics, the more vast the big data world it became. One who is trying to learn and read related to big data thing should surely need to dig down more related to some other topics too for sure, such as:

HBase, YARN, Cassandra, Pig , Zookeeper, Spark , Ambari ,sqoop  , flume and so on.

These are some of topics one should go through if you want to dig down more into the world of Big Data.

References:

https://hadoop.apache.org/

http://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions/

http://www.quora.com/What-is-Apache-Hadoop-1

https://my.vertica.com/docs/7.0.x/PDF/HP_Vertica_7.0.x_HadoopIntegration.pdf

http://www.smartdatacollective.com/mtariq/120791/hadoop-toolbox-when-use-what

https://anilmaharjanonbi.wordpress.com/2014/11/07/how-to-install-vertica-in-a-single-node/

Thanks,

Anil Maharjan

BI Engineer

The power of SIMBA into the world of BI on top of BIG DATA

After a long time gap, I would love to post this blog regarding the power of SIMBA  into the world of BI on top of BIG DATA .

It’s really interesting and exciting to see these kinds of new technology and tools on the way. Well ,I am talking about the SIMBA MDx Provider developed by the simba Technologies .

Yesterday, while I was researching regarding BIG DATA and reading some great blog post then I came to know about the SIMBA MDx Provider which seems to be a great and cool tool .

What actually the Simba’s MDx Provider :

Simba’s MDX Provider is an ODBO provider installed on the same machine as Excel. Simba also has a tool for building cube definitions, which we call schemas. These schemas are saved in XML. Simba’s schema maps MDX metadata constructs to Impala table structures. When an ODBO compliant tool such as Excel issues an MDX query, Simba’s MDX Provider maps the MDX query to HiveQL, sends the HiveQL to Impala, collects the results and returns them to the end user.

The most important technical concept is that there is no intermediate server or cube structure that caches data, all queries go direct from Simba’s MDX Provider to the Cloudera Impala server in real time.

You can find out more on this through the main blog post by samba itself through the link below:

http://blogs.simba.com/simba_technologies_ceo_co/2013/02/demo-microsoft-excel-pivottables-on-cloudera-impala-via-simba-mdx-provider.html

As the technology is in early development, it is not generally available for early testing.

Demo of Simba’s integration of doing MDX queries over Cloudera’s Impala for use with Microsoft Excel Pivot Tables .http://youtu.be/kZahPE9Puv0

Also there is a great PDF regarding the Simba Teradata case- study.

http://www.simba.com/docs/Teradata-Case-Study.pdf

Hope to see and use these kind of great tools in near future into our world of BI.

Reference :

http://blogs.simba.com/simba_technologies_ceo_co/2013/02/demo-microsoft-excel-pivottables-on-cloudera-impala-via-simba-mdx-provider.html

http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

http://cwebbbi.wordpress.com/2013/02/25/mdx-on-cloudera-impala/

Thanks,

Anil Maharjan