Happy and excited to be a Program committee member for DATA and BI Summit 2018.

Happy and excited to be a Program committee member for DATA and BI Summit 2018 in Visualize Track focusing on Building and Designing Dashboards and Reports.

Thanks to the DATA and BI Summit 2018 Program committee team for giving me an opportunity to work as a Volunteer Program committee member for upcoming DATA and BI Summit 2018 in Visualize Track focusing on Building and Designing Dashboards and Reports.

I had selected Visualize track (Focus: Building and Designing Dashboards and Reports) and regarding top 10 topics that should be covered in this track at DATA & BI Summit are:

  1. How to get story/insights from DATA using Power BI.
  2. Power BI Desktop and it’s features.
  3. R on Power BI.
  4. Different types of Visualization(bar charts ,trends, scatter plot, Geo-spatial , maps ,drill up, drill down, sub-reports ) we can create just by drag and drop approach and also custom visuals.
  5. Power BI service , dashboards, reports , data sources , APIs.
  6. workspaces , Power apps, how can end users be benefit by using different workspaces and power apps and helps in decision making.
  7. Roles and privileges in different workspaces , apps, reports , dashboard.
  8. scheduling, refresh data source , publishing in web, mail subscription, subscriber different reports , dashboards to different teams and end users.
  9. Q&A , Cortana , mobile power BI app, APIs, Gateways.
  10. Phone Layout, Power Query, Data modeling, Get Apps Microsoft App store, Real time data analysis and analytics.

You can find more about application proposal for Data and BI Summit Program committee. Deadline is 12th Jan 2018

https://www.pbiusergroup.com/engage/volunteeropportunities/volunteer-opportunity-details?VolunteerOpportunityKey=3791f992-2f6d-4eb7-b558-3ad62022dc74

Also, if you want to speak in this conference then you can find more about call for proposal for DATA and BI Summit below: Deadline is 12th Jan 2018.

https://www.pbiusergroup.com/engage/volunteeropportunities/volunteer-opportunity-details?VolunteerOpportunityKey=c2bb2d9e-ddab-4be1-9fcd-7204a4e6226d

So What about DATA and BI Summit 2018 ?

Data & BI Summit is hosted by The Power BI User Group (PUG). With over 38,000 members, PUG has established itself as the go-to for Data Professionals and Business Analysts who are eager to collaborate and deepen their expertise in Microsoft business intelligence tools.

DATA and BI Summit is going to be held on 24-26 April 2018 | The Convention Centre Dublin | Dublin, Ireland.

What to expect at the Data & BI Summit:

  • Exceptional, quality content: Learn how to bring your company through the digital transformation by gaining new understandings of your data and deepening your knowledge of the Microsoft Business Intelligence tools. Products will include: Power BI, PowerApps, Flow, SQL Server, Excel, Azure, D365 and more!
  • Answers to your questions: Network with the Microsoft Power BI team, dig-in onsite to find immediate answers with industry experts, Data MVPs, and User Group Leaders while taking advantage of the opportunity to engage in interactive sessions, workshops and labs.
  • Network with your peers: Enjoy countless opportunities to create lasting relationships by connecting and networking with user group peers, partners and Microsoft team members.
  • Stretch your skill set: Advance your career by learning the latest updates and how they can help you and your business.

You can find more on DATA and BI Summit 2018 by going through below link.

https://www.databisummit.com/home

Thanks,

Anil Maharjan

Senior BI Engineer | Nepal Power BI User Group Leader

SQL Saturday #692 Conference and Nepal Power BI User Group Meetup a big success.

Last Saturday, we conducted SQL Saturday #692 conference and Nepal Power BI User Group Meetup Event in Kathmandu, Nepal successfully on 23rd Dec, 2017.

SQLSaturday#692 conference and Nepal Power BI User Group Meetup Event is free event for a Microsoft Data Platform professionals and those wanting to learn about SQL Server, Business Intelligence, Power BI and Analytics.

I got a chance again to organize and speak in this great event and it was fun and we share knowledge along with other SQL and Power BI Geeks and also had great opportunity to be a speakers around the world. Thank you all speakers , MVP & members for being part of #SqlSaturday692 #SqlSatNepal #NepalPUG we had a great speakers , MVP  from around the world. Such as Shree Prasad Khanal Anil Maharjan Gogula Aryalingam Guy Glantser Jonathan Stewart Deependra Bajracharya Virendra Dibya Tara Shakya  

I had given presentation on PowerBI: Data Visualization SQL Saturday with R You can find my slides using below link: http://www.sqlsaturday.com/692/Sessions/Details.aspx?sid=69982

Also, it was my great opportunity to share more about Nepal Power BI User Group along with the community. How one can engage and join the NepalPUG group. Currently, there are few members in Nepal PUG and hope it will grow soon. http://www.pbiusergroup.com/kathmandu

The event was a big success and we had nearly more than 100 participants and the interactions between participants and speakers was really overwhelm.

In future also, we will conduct more Data Platform SQL and Power BI related events again soon. Also, one problem we faced difficult here is not having any Microsoft Office premises in Nepal. It would be great if Microsoft can open up small office premises here in Nepal too. So that we can conduct these kind of Microsoft related conference within Microsoft office. Which will help the Microsoft community to grow further and will help to engage more community members in one place.

Thanks,

Anil Maharjan

Senior BI Engineer | Nepal Power BI User Group Leader

Speaking at SQLSaturday Nepal SQLSaturday#692 and Nepal Power BI User Group Meetup.

Firstly I am happy and excited that I will be speaking at SQLSaturday#692 and Nepal Power BI User Group Meetup on Dec 23 2017. This is my fourth time speaking for these International events and I’m really excited that I will be speaking this time also.

I will be speaking on ‘PowerBI: Data Visualization SQL Saturday with R’ here is my abstract detail:

This session is mainly use to learn more about Power BI and R visualization chart. From this session one can learn how can we make some simple and quick visualization using Power BI desktop taking real SQL Saturday data and publish in PowerBI cloud service and also publish those visualization reports to publicly in web. Also, this session mainly helps to tell the story of SQL Saturday by using Power BI and R visualization chart.

Overall, by using Power BI Visualization how one can find out which SQL Saturday is conducted in which state or country in which year and a particular Day along with total sessions conducted. Also one can find SQL Saturday trends year on year and mostly in which month most of SQL Saturday is conducted as per SQL Saturday data history. So, this session will surely help one to learn regarding Power BI and its capabilities and SQL Saturday Stories.

So, What is SQLSaturday?

SQLSaturday is a free training event for Microsoft Data Platform professionals and those wanting to learn about SQL Server, Business Intelligence, Power BI and Analytics.Please register soon as seating is limited, and let friends and colleagues know about the event.

This event will be held on Dec 23 2017 at Hotel Yellow Pagoda, Kantipath, Kathmandu, Nepal.

Please do register and be a part of this great event.

http://www.sqlsaturday.com/692/EventHome.aspx

One can join our local SQL Server User Group ‘Himalayan SQL Server User Group’.

http://www.sqlpassnepal.org/

So, What is PUG?

PUG offers online and in-person communities where you can share best practices, take part in exclusive training opportunities, and connect with other passionate Power BI users from various professions and industries. Get involved in your local user group today, and gain a better understanding of data that will enable you to excel in your role.

One can join our local Power BI User Group ‘Nepal Power BI User Group’

http://www.pbiusergroup.com/kathmandu

Hope to see you there and don’t forget to say hello to me in the event.

Thanks,

Anil Maharjan

Senior BI Engineer

http://np.linkedin.com/in/maharjananil

WELCOME TO NEPAL POWER BI USER GROUP

It’s happy and existing to know that Nepal Power BI User Group has been setup finally 🙂
Welcome all, Feel free to join this User Group.

You can sign up and join Nepal Power BI User group by using below link:

http://www.pbiusergroup.com/kathmandu

We are a group of Power BI users and enthusiasts in Kathmandu, Nepal area, looking to connect with others to have interesting discussions and exchange ideas.  We meet quarterly along with Himalayan SQL Server User Group to go over the latest updates to Power BI & SQL, help new users get started, and explore specific topics in detail.  All are welcome, from beginners to experts.

Power BI User Groups (PUG): PUG offers online and in-person communities where you can share best practices, take part in exclusive training opportunities, and connect with other passionate Power BI users from various professions and industries. Get involved in your local user group today, and gain a better understanding of data that will enable you to excel in your role.

Connect with Power BI Users in the PUG Exchange where you can instantly share what you’re working on in Power BI and in your local user groups.

To know more about PUG Power BI User Group you can check out below link:

www.pbiusergroup.com/home.

To know more about Himalayan SQL Server User Group you can follow below link:

http://www.sqlpassnepal.org/

Feel free to join Nepal Power BI User group.

Thanks,

Anil Maharjan

Installing SQL Server 2016 Developer Edition and trying out Telco Customer Churn with R services.

During SQL Server Geeks Annual Summit 2016, #SSGAS2016 I was really impressed by Wee Hong Tok for his session on ‘SQL Server R services’ where I got to know more about SQL server 2016 and R services and the demo he presented in that particular session regarding Telco Customer Churn.

Also, another session that I am really impressed by Jen Stirrup for her session on ‘Delivering Practical Analytics and Results with Cortana Analytics’.  And also impressed by Andreas Wolter  Amit Bansal ,Other Speakers & SQL Server Geeks Community for such a friendliness and hospitality.

Since, I also work in Telco sector so first thing I want to try it out is what Wee Hong Tok had shown in his demo and thanks to Jen Stirrup where she introduce me with Cortana Analytics, I don’t know much of this SQL Server 2016 with R services and regarding Cortana Analytics now I want to know more on these.

One can find regarding Cortana Analytics and Telco Customer Churn from below links.

https://gallery.cortanaintelligence.com/Experiment/Telco-Customer-Churn-5

So, if you also want to try out this then all you need is firstly download the SQL Server 2016 Developer Edition or SQL Server 2016 Evaluation version for 180 Days.

https://www.microsoft.com/en-us/cloud-platform/sql-server-editions-developers

https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016

For step by step installation you can follow below link

http://www.sqlcoffee.com/SQLServer2016_0001.htm

Before installation SQL Server 2016, you need to install Java Development Tool kit. http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Also, don’t forget to check the R Services (In-Database) tab during installation, also there is separate R Server (Standalone) during installation just prefer R Services ( In-Database) option.

sql

You can find more on R Services (In-database) and R Service (Standalone) from below links.

https://msdn.microsoft.com/en-us/library/mt696069.aspx

https://msdn.microsoft.com/en-us/library/mt695941.aspx

Once installation compete, in order to test Telco Customer Churn just go to below link to get the backup file for Telco Customer churn in GitHub link below,

https://github.com/Microsoft/sql-server-samples/tree/master/samples/features/r-services/Telco%20Customer%20Churn/SQL%20Server

Get the teloedw2.bak file and restore in SQL Server 2016 database, now read the Read.md file which contains:

Instructions

Restore the database provided (telcoedw2.bak)

Run the code in TelcoChurn-Main.sql

Description

TelcoChurn-Main.sql – Use this T-SQL script to try out the telco customer churn example.

TelcoChurn-Operationalize.sql – T-SQL scripts to create the stored procedures used in this example.

while going through above scripts and trying out Customer Churn I got few errors, so I am sharing my finding here, anyone if gets same error then this post might be helpful.

After installing SQL Server 2016 developer edition and trying for this Telco Customer churn SQL server scripts. I got certain error while running scripts TelcoChurn-Main.sql and TelcoChurn-Operationalize.sql. For first, after google I found out we need to change the Memory_Limit_percent in order to run above sq. scripts. So, just add MEMORY_LIMIT_PERCENT=50 into the config file for Rlauncher.config which can be found in below location:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn

Change as below;

RHOME=C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES MPI_HOME=C:\Program Files\Microsoft MPI INSTANCE_NAME=MSSQLSERVER TRACE_LEVEL=1 JOB_CLEANUP_ON_EXIT=1 USER_POOL_SIZE=0 WORKING_DIRECTORY=C:\PROGRA~1\MICROS~3\MSSQL1~1.MSS\MSSQL\EXTENS~1 MEMORY_LIMIT_PERCENT=50

While running this script TelcoChurn-Operationalize.sql we need to firstly install the required R packages, I have also uploaded one SQL File which installed required R packages before running the scripts of telco customer churn into my folk repository for sql-server-samples. The file contains enabling sp_execute_external_script to run R scripts in SQL Server 2016 and installing required R packages in order to run the Script TelcoChurn-Main.sql and TelcoChurn-Operationalize.sql successfully. Since we need to install those R packages before we run the scripts in order to avoid the error. https://github.com/maharjananil/sql-server-samples/blob/master/samples/features/r-services/Telco%20Customer%20Churn/SQL%20Server/Enabling%20R%20scripts%20to%20run%20and%20Installing%20required%20R%20Packages.sql

After that all the scripts run successfully and now you can learn more about R Services and R scripts & algorithm used and then test along with your own Telco data. Which I am planning to try out and will try out for sure.

There is always something new to learn into the world of DATA so called BIG DATA & now DATA Science/Analytics and Machine Learning.

Thanks,

Anil Maharjan

Speaking at SQL Server Geeks Annual Summit 2016, Asia’s Only Data & Analytics Conference.

Firstly I am happy and excited that I will be speaking at SQL Server Geeks Annual Summit 2016, Asia’s Only Data & Analytics Conference. I have been selected as a Speaker from Nepal and am representing as only one speaker from Nepal.

http://www.sqlservergeeks.com/ssgas-2016-anil-maharjan/

About the Conference:

SQLServerGeeks Annual Summit 2016 (SSGAS 2016) is Asia’s Only SQL Conference focusing on Microsoft Data. Scheduled from Aug 11-13 (Pre-Con on Aug 10) at NIMHANS, Bangalore, the summit will see 120+ sessions being delivered by 50+ speakers across 3 days. Joseph Sirosh (CVP, Data Group) will key-note the conference. Complete MS Data Platform stack is being covered at the summit. Speakers include Group PMs, Senior PMs, PMs, Premiere Field Engineers, Escalation Engineers & Data Architects from MTC. SQL CAT, SQL TIGER & Global Black Belt Team from Microsoft will deliver top-notch content and spend quality time with attendees at the convention center.

More details are here: http://www.sqlservergeeks.com/summit2016

I am Speaking

Why one should attend this conference:

  • To get real-world training from industry experts
  • Know the latest trends in Data & Analytics world
  • Special focus on Analytics, Cloud & Big Data
  • To network & connect with the MVPs, MCMs
  • To learn from SQL product team, Redmond
  • Direct access to product team members
  • Benefit from new delivery formats like Open-Talks & Chalk-Talks*
  • Expert level demo-oriented sessions
  • Five parallel full-day classroom training

Please do register and be a part of this great event and follow #SSGAS2016 on Twitter for more news.

Hope to see you there and don’t forget to say hello to me in the event.

Thanks,

Anil Maharjan

Senior BI Engineer

https://www.linkedin.com/in/maharjananil

SQL Server 2016 Discovery Day – Data Visualization using R and Power BI.

SQL Server 2016 Discovery Day – Data Visualization using R and Power BI.

Last week we conducted the SQL Server 2016 Discovery Day –Release Event Kathmandu, Nepal successfully on July 9th 2016. SQL Server Launch 2016 Event and Discovery Day is a free, one-day event where individuals come together, learn about SQL Server 2016 and solve a pre-determined problem.

https://www.eventbrite.com/e/sql-server-2016-discovery-dayrelease-event-kathmandu-nepal-tickets-25888946536?aff=efbnreg#

I got a chance again to speak in this great event and it was fun and share knowledge along with other SQL Geeks, I had given presentation on

SQL Server 2016 Discovery Day – Data Visualization using R and Power BI.

You can find my slides using below link:

Also, we had a small solution development competition in the event where we use Power BI Desktop free tool in order to create some visualization and tell some story behind data.

Below is some quick visualization that I had created in the event by using PASS SQL Saturday real Data.

Steps for Making Data visualization by using Power BI Desktop.

Step 1: Firstly, download and install the Microsoft Power BI Desktop tool, which is free from the link below also for R related charts and play around with R codes we need to download R , RStudio IDE and install too. Below is link for Power BI Desktop, R, and RStudio IDE.

https://powerbi.microsoft.com/en-us/desktop/?gated=0&number=1

https://www.r-project.org

https://cran.r-project.org/src/base/R-3/

https://www.rstudio.com/products/rstudio/download/

Step 2: Get the data related to SQL PASS, SQL Saturday from the link below:

https://drive.google.com/file/d/0BzlPwGX6UtxUNnlfZ01KczF0NHc/view

This zip data files contains the different .sql files with data included within scripts.

Step 3: Load those scripts into SQL Server Database and then use Power BI Desktop ‘Get DATA’ tab to load the Data into Power BI Desktop and do visualization.

blog1

Alternative:  If you haven’t install the SQL server Database then you can use a single file such as ‘dbo.SQLSatSessions.Table.sql’ and then Excel file to manipulate the data in your required format.

Step 4: Load the only data part from the file ‘dbo.SQLSatSessions.Table.sql’ into excel as below

blog2

Then use Text to Columns tab in DATA tab section as below to make the data into proper column format.

blog3

Once you prepare your data as below format then now you can start analysis using Power BI Desktop.

blog4

Step 5: Use Power BI step 3 or Step 4 to load the data into Power BI. We have use step 4 here so now use Power BI to load data from Excel source then you will get data columns in right hand side as below.

blog5

Step 6:  Now start visualization using Power BI Desktop free tool, here we are using data of SQL Saturday session details and prepare different line chart, tree map, filled map , Table , R script Visual different charts that we have used to show different visualization details as below.

blog6

blog61

One can learn how to create these different charts by going through below links

https://powerbi.microsoft.com/en-us/documentation/powerbi-service-visualizations-for-reports/

Step 7: For now just for sample how we can simply create a line chart by just drag and drop feature of Power BI Desktop.

Just go to right side of Power BI Desktop and select EventDate and SQLSATURDAY filed and drop Eventdate in Axis and SQLSATURDAY into Values section and change the SQLSATURDY value as count by just clicking on SQLSATURDY field in Values section.

blog7Just simple so you can try these different charts by simply drag and drop feature.

For R script Visual you need to know some R language first in order to create R visualization chart in Power BI . One can learn R from https://www.r-project.org/

Step 8 :Publish these reports on Power BI cloud service by just clicking Publish button from Power BI Desktop , also now one can also publish these reports to publicly in web. Once you publish your report in Power BI service go to Reports -> Your Report -> File -> Publish to web . After that anyone can go through these reports publicly in the web.

Also, URL that you get through publish to web can be embedded to your website along.

Summary:

In Summary, from these above Power BI Desktop Visualization it helps to tells us that which SQL Saturday is conducted in which state, country in which year and a particular Day along with total sessions conducted. It also helps us to tell us that SQL Saturday trends year on year and mostly in September month most of SQL Saturday is conducted as per SQL Saturday data history.

Also, it helps to tell the distribution of state wise SQL Saturday conducted states .which shows most of SQL Saturday happens in North America.

Speaking at SQLSaturday Nepal SQLSaturday #482 .

Firstly I am happy and excited that I will be speaking at SQLSaturday#482 on March 26 2016. This is my second time for this International SQLSaturday event and I’m really excited that I will be speaking this time.

I will be speaking on ‘Using power query to tell your story form your Facebook data’ here is my abstract detail:

The session is mainly for the one who is trying to extract the story behind their Facebook data by using Power Query. By using Power Query you can extract your Facebook data easily and do analysis your own story by using your own Facebook data.

Talking about Power Query: Microsoft Power Query for Excel is an Excel add-in that enhances the self- service Business Intelligence experience in Excel by simplifying data discovery, access and collaboration.

Power Query can connect data across a wide variety of sources, where Facebook is one of the data source.

This session helps you to learn about the Power Query, Power View, and Power BI and mainly helps you to do self-service BI by using your own Facebook data with the help of Power Query, Power View and MS-Excel 2013.

So, What is SQLSaturday?

SQLSaturday is a training event for SQL Server professionals and those wanting to learn about SQL Server. Admittance to this event is free, all costs are covered by donations and sponsorship. Please register soon as seating is limited, and let friends and colleagues know about the event.

This event will be held on Mar 26 2016  at Hotel Himalaya, Sahid Shukra Marg, Kathmandu, Central Region, 44600, Nepal.

Please do register and be a part of this great event.

http://www.sqlsaturday.com/482/EventHome.aspx

Hope to see you there and don’t forget to say hello to me in the event 🙂

Thanks,

Anil Maharjan

BI Engineer

http://np.linkedin.com/in/maharjananil

How to install spark 1.6.0 and play around with it.

There is always something new to learn in the field of BIG DATA. This time I took a course in Big Data University on Spark Fundamental I. You can go through the free courses in Big Data University in order to learn in different tracks.

http://bigdatauniversity.com/courses/spark-fundamentals/

Firstly, in order to learn spark we need to install it first in your machine or VM. I had previously install Apache Hadoop 2.6.0 on my VM and want to install spark on top of it. Thanks to YARN we don’t need to pre-deploy anything to nodes and as it turned out it was very easy to install and run spark on YARN. You can also install spark in standalone mode.

If you haven’t install Hadoop then You can go through below link in order to install Apache Hadoop on your VM, I had previously gone through below blog post in order to install Hadoop 2.6.0 on my VM. Which is quite simple.

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#

What is Spark?

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

– According to Big Data University – Spark foundation Course:

Apache Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide, Spark is the alternative. Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining. Spark provides in-memory cluster computing for lightning fast speed and supports Java, Scala, and Python APIs for ease of development.

Spark combines SQL, streaming and complex analytics together seamlessly in the same application to handle a wide range of data processing scenarios. Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3

You can find more on Spark from:

http://spark.apache.org/

Now, how to install spark?

In order to install Apache Spark:

You can create separate user for spark as spark user as similar to Hadoop user or just install as root user.

wget http://apache.mirrors.ionfish.org/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz

tar -zxvf spark-1.6.0-bin-hadoop2.6.tgz

mv spark-1.6.0-bin-hadoop2.6 spark

Also you can set the environment variables

vi ~/.bashrc

export SPARK_HOME=/home/spark

Spark supports programming languages java, Scala, Python. So, I setup scala programming language in order to play around with spark.

How to Install scala ?

You can download the scala version 2.11.7 from the site: http://www.scala-lang.org/download/  into your local disk space on your VM. Then use below command:

tar xvf scala-2.11.7.tar

mv scala-2.11.7 scala

Now let’s check the setup of Spark – shell.

# $SPARK_HOME/bin/spark-shell

Spark

You can also check Spark GUI:

sparkgui

Great!!!  You have successfully install Apache Spark. After that, you can go through the course from Big Data University using above link at first and learn more about spark .Through this course you can learn Spark Fundamentals, some of basic related to spark. After completing the course, you should be able to:

  • Describe what Spark is all about know why you would want to use Spark
  • Use Resilient Distributed Datasets operations
  • Use Scala, Java, or Python to create and run a Spark application
  • Creating applications using Spark SQL, MLlib, Spark Streaming, and GraphX
  • Configure, monitor and tune Spark

In summary, it feels always great to learn new things and play around with it. In the Big DATA world, there is always something new to learn and this time Spark is what I got my free time to play around with. But I must say, the more you want to learn these things the more excitement and challenges it brings and want to deep dive more and more.

Through this article, I believe you are now able to install spark 1.6.0 on top of Hadoop 2.6.0. and play around with it.

References:

http://bigdatauniversity.com/courses/spark-fundamentals/

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#

http://spark.apache.org/

https://sparkhub.databricks.com/

https://anilmaharjanonbi.wordpress.com/2015/07/01/big-data-apache-hadoop-hive-then-vertica/

https://www.linkedin.com/pulse/big-data-apache-hadoop-hive-vertica-anil-maharjan

Big Data, Apache Hadoop, Hive & then Vertica

It’s all about Big Data : Previously , in 2012 , I had some research related to Big Data and Apache Hadoop while I was working in Healthcare domain in my old company , I couldn’t spent much time in this sector though I wanted to . So, now back again, I started to dig down little bit more on this topic which is hot topic at that time and now becoming even hottest topic at this time. Big Data, every company now trying to incorporate and deal with these things.

Really speaking, the more we search related to Big Data the more it will be vast and big. Big Data as far as I know just a way of handling the massive amount of data both structural , semi- structural and unstructured data that is beyond the storage and processing capabilities of a single physical machine . Day by Day (V3) data velocity, data variety formats and data volume gets on increasing and from terabytes to petabytes and so on. If we really don’t think and invest enough time to handle these massive amount of growing data and the emerging Big Data Challenges then the future seems little dark outside.

Where Big Data comes there comes the term Hadoop. Hadoop: I think ‘Doug Cutting’ wouldn’t have thought that the tool which he’s naming after his kid’s soft toy will be such a huge elephant now that every top companies trying to ride on that big elephant. Every companies now trying to find out what’s this big elephant is about, what it is capable of, do we need it or how to incorporate it in our existing frameworks, these are some of the things which companies are thinking and investing there valuable time these days.

Hadoop is by no means an out-of-the-box solution. In order to build a truly information- driven enterprise, where decisions are based on data and not guess works, the companies would require a data management solution that not only offers robust data governance, but also is easily manageable and seamlessly integrates with existing enterprise infrastructure.

Hadoop as describer formally from Apache itself:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

It has mainly tow core components:

  1. Data processing framework (MapReduce): MapReduce is a programming model and an associated implementation for processing and generating large data sets with a paralleldistributed algorithm on a cluster.
  2. Distributed file system for data storage. (HDFS): HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

You can find more on these through different sites online.

I have gone through step by step installing Hadoop 2.7.0 in single node cluster on RedHat Linux VMware Machine, if you want to play along with Hadoop by trying installation and configuration then you can go through below link which have all details:

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/

Then after that I try to install HIVE: Hive, another Apache application that helps convert query language into MapReduce jobs, for instance.

http://tecadmin.net/install-apache-hive-on-centos-rhel/

Now, I am trying to connect Vertica Cluster node and Hadoop Cluster node to talk to each other, since I can definitely process the different source unstructured data directly and easily into Hadoop nodes and then for quick ad-hoc analytics & quick decision making, the same data we process can load into Vertica and do further analysis in near real time. I am trying to go through this document for now.

https://my.vertica.com/docs/7.0.x/PDF/HP_Vertica_7.0.x_HadoopIntegration.pdf

For installing Vertica in a single node you can go through the blog that I had previously written:

https://anilmaharjanonbi.wordpress.com/2014/11/07/how-to-install-vertica-in-a-single-node/

So far as Hadoop distribution is concerned, the three companies that really stand out in the competitions are: Cloudera, MapR and Hortonworks.

You can find more on these by going through their sites. One blog I found great to read related to these Hadoop distribution vendor companies by going through below link:

http://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions/

The more we dig down into these topics, the more vast the big data world it became. One who is trying to learn and read related to big data thing should surely need to dig down more related to some other topics too for sure, such as:

HBase, YARN, Cassandra, Pig , Zookeeper, Spark , Ambari ,sqoop  , flume and so on.

These are some of topics one should go through if you want to dig down more into the world of Big Data.

References:

https://hadoop.apache.org/

http://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions/

http://www.quora.com/What-is-Apache-Hadoop-1

https://my.vertica.com/docs/7.0.x/PDF/HP_Vertica_7.0.x_HadoopIntegration.pdf

http://www.smartdatacollective.com/mtariq/120791/hadoop-toolbox-when-use-what

https://anilmaharjanonbi.wordpress.com/2014/11/07/how-to-install-vertica-in-a-single-node/

Thanks,

Anil Maharjan

BI Engineer