Big Data, Apache Hadoop, Hive & then Vertica

It’s all about Big Data : Previously , in 2012 , I had some research related to Big Data and Apache Hadoop while I was working in Healthcare domain in my old company , I couldn’t spent much time in this sector though I wanted to . So, now back again, I started to dig down little bit more on this topic which is hot topic at that time and now becoming even hottest topic at this time. Big Data, every company now trying to incorporate and deal with these things.

Really speaking, the more we search related to Big Data the more it will be vast and big. Big Data as far as I know just a way of handling the massive amount of data both structural , semi- structural and unstructured data that is beyond the storage and processing capabilities of a single physical machine . Day by Day (V3) data velocity, data variety formats and data volume gets on increasing and from terabytes to petabytes and so on. If we really don’t think and invest enough time to handle these massive amount of growing data and the emerging Big Data Challenges then the future seems little dark outside.

Where Big Data comes there comes the term Hadoop. Hadoop: I think ‘Doug Cutting’ wouldn’t have thought that the tool which he’s naming after his kid’s soft toy will be such a huge elephant now that every top companies trying to ride on that big elephant. Every companies now trying to find out what’s this big elephant is about, what it is capable of, do we need it or how to incorporate it in our existing frameworks, these are some of the things which companies are thinking and investing there valuable time these days.

Hadoop is by no means an out-of-the-box solution. In order to build a truly information- driven enterprise, where decisions are based on data and not guess works, the companies would require a data management solution that not only offers robust data governance, but also is easily manageable and seamlessly integrates with existing enterprise infrastructure.

Hadoop as describer formally from Apache itself:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

It has mainly tow core components:

  1. Data processing framework (MapReduce): MapReduce is a programming model and an associated implementation for processing and generating large data sets with a paralleldistributed algorithm on a cluster.
  2. Distributed file system for data storage. (HDFS): HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

You can find more on these through different sites online.

I have gone through step by step installing Hadoop 2.7.0 in single node cluster on RedHat Linux VMware Machine, if you want to play along with Hadoop by trying installation and configuration then you can go through below link which have all details:

Then after that I try to install HIVE: Hive, another Apache application that helps convert query language into MapReduce jobs, for instance.

Now, I am trying to connect Vertica Cluster node and Hadoop Cluster node to talk to each other, since I can definitely process the different source unstructured data directly and easily into Hadoop nodes and then for quick ad-hoc analytics & quick decision making, the same data we process can load into Vertica and do further analysis in near real time. I am trying to go through this document for now.

For installing Vertica in a single node you can go through the blog that I had previously written:

So far as Hadoop distribution is concerned, the three companies that really stand out in the competitions are: Cloudera, MapR and Hortonworks.

You can find more on these by going through their sites. One blog I found great to read related to these Hadoop distribution vendor companies by going through below link:

The more we dig down into these topics, the more vast the big data world it became. One who is trying to learn and read related to big data thing should surely need to dig down more related to some other topics too for sure, such as:

HBase, YARN, Cassandra, Pig , Zookeeper, Spark , Ambari ,sqoop  , flume and so on.

These are some of topics one should go through if you want to dig down more into the world of Big Data.



Anil Maharjan

BI Engineer