Big Data, Apache Hadoop, Hive & then Vertica

It’s all about Big Data : Previously , in 2012 , I had some research related to Big Data and Apache Hadoop while I was working in Healthcare domain in my old company , I couldn’t spent much time in this sector though I wanted to . So, now back again, I started to dig down little bit more on this topic which is hot topic at that time and now becoming even hottest topic at this time. Big Data, every company now trying to incorporate and deal with these things.

Really speaking, the more we search related to Big Data the more it will be vast and big. Big Data as far as I know just a way of handling the massive amount of data both structural , semi- structural and unstructured data that is beyond the storage and processing capabilities of a single physical machine . Day by Day (V3) data velocity, data variety formats and data volume gets on increasing and from terabytes to petabytes and so on. If we really don’t think and invest enough time to handle these massive amount of growing data and the emerging Big Data Challenges then the future seems little dark outside.

Where Big Data comes there comes the term Hadoop. Hadoop: I think ‘Doug Cutting’ wouldn’t have thought that the tool which he’s naming after his kid’s soft toy will be such a huge elephant now that every top companies trying to ride on that big elephant. Every companies now trying to find out what’s this big elephant is about, what it is capable of, do we need it or how to incorporate it in our existing frameworks, these are some of the things which companies are thinking and investing there valuable time these days.

Hadoop is by no means an out-of-the-box solution. In order to build a truly information- driven enterprise, where decisions are based on data and not guess works, the companies would require a data management solution that not only offers robust data governance, but also is easily manageable and seamlessly integrates with existing enterprise infrastructure.

Hadoop as describer formally from Apache itself:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

It has mainly tow core components:

  1. Data processing framework (MapReduce): MapReduce is a programming model and an associated implementation for processing and generating large data sets with a paralleldistributed algorithm on a cluster.
  2. Distributed file system for data storage. (HDFS): HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

You can find more on these through different sites online.

I have gone through step by step installing Hadoop 2.7.0 in single node cluster on RedHat Linux VMware Machine, if you want to play along with Hadoop by trying installation and configuration then you can go through below link which have all details:

Then after that I try to install HIVE: Hive, another Apache application that helps convert query language into MapReduce jobs, for instance.

Now, I am trying to connect Vertica Cluster node and Hadoop Cluster node to talk to each other, since I can definitely process the different source unstructured data directly and easily into Hadoop nodes and then for quick ad-hoc analytics & quick decision making, the same data we process can load into Vertica and do further analysis in near real time. I am trying to go through this document for now.

For installing Vertica in a single node you can go through the blog that I had previously written:

So far as Hadoop distribution is concerned, the three companies that really stand out in the competitions are: Cloudera, MapR and Hortonworks.

You can find more on these by going through their sites. One blog I found great to read related to these Hadoop distribution vendor companies by going through below link:

The more we dig down into these topics, the more vast the big data world it became. One who is trying to learn and read related to big data thing should surely need to dig down more related to some other topics too for sure, such as:

HBase, YARN, Cassandra, Pig , Zookeeper, Spark , Ambari ,sqoop  , flume and so on.

These are some of topics one should go through if you want to dig down more into the world of Big Data.



Anil Maharjan

BI Engineer

The power of SIMBA into the world of BI on top of BIG DATA

After a long time gap, I would love to post this blog regarding the power of SIMBA  into the world of BI on top of BIG DATA .

It’s really interesting and exciting to see these kinds of new technology and tools on the way. Well ,I am talking about the SIMBA MDx Provider developed by the simba Technologies .

Yesterday, while I was researching regarding BIG DATA and reading some great blog post then I came to know about the SIMBA MDx Provider which seems to be a great and cool tool .

What actually the Simba’s MDx Provider :

Simba’s MDX Provider is an ODBO provider installed on the same machine as Excel. Simba also has a tool for building cube definitions, which we call schemas. These schemas are saved in XML. Simba’s schema maps MDX metadata constructs to Impala table structures. When an ODBO compliant tool such as Excel issues an MDX query, Simba’s MDX Provider maps the MDX query to HiveQL, sends the HiveQL to Impala, collects the results and returns them to the end user.

The most important technical concept is that there is no intermediate server or cube structure that caches data, all queries go direct from Simba’s MDX Provider to the Cloudera Impala server in real time.

You can find out more on this through the main blog post by samba itself through the link below:

As the technology is in early development, it is not generally available for early testing.

Demo of Simba’s integration of doing MDX queries over Cloudera’s Impala for use with Microsoft Excel Pivot Tables .

Also there is a great PDF regarding the Simba Teradata case- study.

Hope to see and use these kind of great tools in near future into our world of BI.

Reference :


Anil Maharjan

Hadoop Summit 2012- Big Data in Focus !!!

Hello all,

This is not any blog describing about something helpful but just want to inform all the data and technology loving person that Hadoop Summit 2012 is going to be held soon.

Looking forward to hear more about the Hadoop Summit 2012 which is going to be held tomorrow ,JUNE 13th – 14th @ San Jose Convention Center .Wish to watch the live coverage of Hadoop Summit 2012 and hear  more tweet’s regarding on this and upcoming focuses of this summit.

For more:

Also, I have only few knowledge regarding on this ‘Microsoft Big Data and Apache Hadoop’  but want to learn more on this .So if anyone knows some good links or something regarding on this technology then please you can share and provide some knowledge or help along within this blog. That would be really appreciated 🙂 ..!!!

Also, really looking forward if any universities offering the Master’s courses that includes Big Data and Apache Hadoop.


Anil Maharjan