Author: Anil

Anil http://maharjananil.com.np Highly motivated Business Intelligence Developer/Programmer having leadership abilities and team work skills as well as the ability to accomplish tasks under minimal direction and supervision. - Experience building enterprise solutions with SQL Server, SSAS, SSRS,SSIS - Excellent development and problem solving techniques in SSAS, SSRS,SSIS - Good T-SQL and database programming skills - Skill for creating great accurate and professional reports with Reporting Services - Has stable knowledge in object-oriented programming and design(C, C , Java) - Has some experience in web development by using (asp.net 3.0,C#) - Strong desire to learn new things Specialties Designing and implementing BI solutions : Microsoft Business Intelligence, SQL Server 2005/2008, Reporting Services (SSRS), Analysis Services (SSAS), Integration Services (SSIS), MDX

Installing SQL Server 2016 Developer Edition and trying out Telco Customer Churn with R services.

During SQL Server Geeks Annual Summit 2016, #SSGAS2016 I was really impressed by Wee Hong Tok for his session on ‘SQL Server R services’ where I got to know more about SQL server 2016 and R services and the demo he presented in that particular session regarding Telco Customer Churn.

Also, another session that I am really impressed by Jen Stirrup for her session on ‘Delivering Practical Analytics and Results with Cortana Analytics’. And also impressed by Andreas Wolter Amit Bansal ,Other Speakers & SQL Server Geeks Community for such a friendliness and hospitality.

Since, I also work in Telco sector so first thing I want to try it out is what Wee Hong Tok had shown in his demo and thanks to Jen Stirrup where she introduce me with Cortana Analytics, I don’t know much of this SQL Server 2016 with R services and regarding Cortana Analytics now I want to know more on these.

One can find regarding Cortana Analytics and Telco Customer Churn from below links.

https://gallery.cortanaintelligence.com/Experiment/Telco-Customer-Churn-5

So, if you also want to try out this then all you need is firstly download the SQL Server 2016 Developer Edition or SQL Server 2016 Evaluation version for 180 Days.

https://www.microsoft.com/en-us/cloud-platform/sql-server-editions-developers

https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016

For step by step installation you can follow below link

http://www.sqlcoffee.com/SQLServer2016_0001.htm

Before installation SQL Server 2016, you need to install Java Development Tool kit. http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Also, don’t forget to check the R Services (In-Database) tab during installation, also there is separate R Server (Standalone) during installation just prefer R Services ( In-Database) option.

You can find more on R Services (In-database) and R Service (Standalone) from below links.

https://msdn.microsoft.com/en-us/library/mt696069.aspx

https://msdn.microsoft.com/en-us/library/mt695941.aspx

Once installation compete, in order to test Telco Customer Churn just go to below link to get the backup file for Telco Customer churn in GitHub link below,

https://github.com/Microsoft/sql-server-samples/tree/master/samples/features/r-services/Telco%20Customer%20Churn/SQL%20Server

Get the teloedw2.bak file and restore in SQL Server 2016 database, now read the Read.md file which contains:

Instructions

Restore the database provided (telcoedw2.bak)

Run the code in TelcoChurn-Main.sql

Description

TelcoChurn-Main.sql – Use this T-SQL script to try out the telco customer churn example.

TelcoChurn-Operationalize.sql – T-SQL scripts to create the stored procedures used in this example.

while going through above scripts and trying out Customer Churn I got few errors, so I am sharing my finding here, anyone if gets same error then this post might be helpful.

After installing SQL Server 2016 developer edition and trying for this Telco Customer churn SQL server scripts. I got certain error while running scripts TelcoChurn-Main.sql and TelcoChurn-Operationalize.sql. For first, after google I found out we need to change the Memory_Limit_percent in order to run above sq. scripts. So, just add MEMORY_LIMIT_PERCENT=50 into the config file for Rlauncher.config which can be found in below location:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn

Change as below;

RHOME=C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES MPI_HOME=C:\Program Files\Microsoft MPI INSTANCE_NAME=MSSQLSERVER TRACE_LEVEL=1 JOB_CLEANUP_ON_EXIT=1 USER_POOL_SIZE=0 WORKING_DIRECTORY=C:\PROGRA~1\MICROS~3\MSSQL1~1.MSS\MSSQL\EXTENS~1 MEMORY_LIMIT_PERCENT=50

While running this script TelcoChurn-Operationalize.sql we need to firstly install the required R packages, I have also uploaded one SQL File which installed required R packages before running the scripts of telco customer churn into my folk repository for sql-server-samples. The file contains enabling sp_execute_external_script to run R scripts in SQL Server 2016 and installing required R packages in order to run the Script TelcoChurn-Main.sql and TelcoChurn-Operationalize.sql successfully. Since we need to install those R packages before we run the scripts in order to avoid the error. https://github.com/maharjananil/sql-server-samples/blob/master/samples/features/r-services/Telco%20Customer%20Churn/SQL%20Server/Enabling%20R%20scripts%20to%20run%20and%20Installing%20required%20R%20Packages.sql

After that all the scripts run successfully and now you can learn more about R Services and R scripts & algorithm used and then test along with your own Telco data. Which I am planning to try out and will try out for sure.

There is always something new to learn into the world of DATA so called BIG DATA & now DATA Science/Analytics and Machine Learning.

Thanks,

Anil Maharjan

Speaking at SQL Server Geeks Annual Summit 2016, Asia’s Only Data & Analytics Conference.

Firstly I am happy and excited that I will be speaking at SQL Server Geeks Annual Summit 2016, Asia’s Only Data & Analytics Conference. I have been selected as a Speaker from Nepal and am representing as only one speaker from Nepal.

http://www.sqlservergeeks.com/ssgas-2016-anil-maharjan/

About the Conference:

SQLServerGeeks Annual Summit 2016 (SSGAS 2016) is Asia’s Only SQL Conference focusing on Microsoft Data. Scheduled from Aug 11-13 (Pre-Con on Aug 10) at NIMHANS, Bangalore, the summit will see 120+ sessions being delivered by 50+ speakers across 3 days. Joseph Sirosh (CVP, Data Group) will key-note the conference. Complete MS Data Platform stack is being covered at the summit. Speakers include Group PMs, Senior PMs, PMs, Premiere Field Engineers, Escalation Engineers & Data Architects from MTC. SQL CAT, SQL TIGER & Global Black Belt Team from Microsoft will deliver top-notch content and spend quality time with attendees at the convention center.

More details are here: http://www.sqlservergeeks.com/summit2016

Why one should attend this conference:

To get real-world training from industry experts
Know the latest trends in Data & Analytics world
Special focus on Analytics, Cloud & Big Data
To network & connect with the MVPs, MCMs
To learn from SQL product team, Redmond
Direct access to product team members
Benefit from new delivery formats like Open-Talks & Chalk-Talks*
Expert level demo-oriented sessions
Five parallel full-day classroom training

Please do register and be a part of this great event and follow #SSGAS2016 on Twitter for more news.

Hope to see you there and don’t forget to say hello to me in the event.

Thanks,

Anil Maharjan

Senior BI Engineer

https://www.linkedin.com/in/maharjananil

SQL Server 2016 Discovery Day – Data Visualization using R and Power BI.

SQL Server 2016 Discovery Day – Data Visualization using R and Power BI.

Last week we conducted the SQL Server 2016 Discovery Day –Release Event Kathmandu, Nepal successfully on July 9^th 2016. SQL Server Launch 2016 Event and Discovery Day is a free, one-day event where individuals come together, learn about SQL Server 2016 and solve a pre-determined problem.

https://www.eventbrite.com/e/sql-server-2016-discovery-dayrelease-event-kathmandu-nepal-tickets-25888946536?aff=efbnreg#

I got a chance again to speak in this great event and it was fun and share knowledge along with other SQL Geeks, I had given presentation on

SQL Server 2016 Discovery Day – Data Visualization using R and Power BI.

You can find my slides using below link:

SQL Server 2016 Discovery Day – Data Visualization using R and Power BI. from Anil Maharjan

Also, we had a small solution development competition in the event where we use Power BI Desktop free tool in order to create some visualization and tell some story behind data.

Below is some quick visualization that I had created in the event by using PASS SQL Saturday real Data.

Steps for Making Data visualization by using Power BI Desktop.

Step 1: Firstly, download and install the Microsoft Power BI Desktop tool, which is free from the link below also for R related charts and play around with R codes we need to download R , RStudio IDE and install too. Below is link for Power BI Desktop, R, and RStudio IDE.

https://powerbi.microsoft.com/en-us/desktop/?gated=0&number=1

https://www.r-project.org

https://cran.r-project.org/src/base/R-3/

https://www.rstudio.com/products/rstudio/download/

Step 2: Get the data related to SQL PASS, SQL Saturday from the link below:

https://drive.google.com/file/d/0BzlPwGX6UtxUNnlfZ01KczF0NHc/view

This zip data files contains the different .sql files with data included within scripts.

Step 3: Load those scripts into SQL Server Database and then use Power BI Desktop ‘Get DATA’ tab to load the Data into Power BI Desktop and do visualization.

Alternative: If you haven’t install the SQL server Database then you can use a single file such as ‘dbo.SQLSatSessions.Table.sql’ and then Excel file to manipulate the data in your required format.

Step 4: Load the only data part from the file ‘dbo.SQLSatSessions.Table.sql’ into excel as below

Then use Text to Columns tab in DATA tab section as below to make the data into proper column format.

Once you prepare your data as below format then now you can start analysis using Power BI Desktop.

Step 5: Use Power BI step 3 or Step 4 to load the data into Power BI. We have use step 4 here so now use Power BI to load data from Excel source then you will get data columns in right hand side as below.

Step 6: Now start visualization using Power BI Desktop free tool, here we are using data of SQL Saturday session details and prepare different line chart, tree map, filled map , Table , R script Visual different charts that we have used to show different visualization details as below.

One can learn how to create these different charts by going through below links

https://powerbi.microsoft.com/en-us/documentation/powerbi-service-visualizations-for-reports/

Step 7: For now just for sample how we can simply create a line chart by just drag and drop feature of Power BI Desktop.

Just go to right side of Power BI Desktop and select EventDate and SQLSATURDAY filed and drop Eventdate in Axis and SQLSATURDAY into Values section and change the SQLSATURDY value as count by just clicking on SQLSATURDY field in Values section.

Just simple so you can try these different charts by simply drag and drop feature.

For R script Visual you need to know some R language first in order to create R visualization chart in Power BI . One can learn R from https://www.r-project.org/

Step 8 :Publish these reports on Power BI cloud service by just clicking Publish button from Power BI Desktop , also now one can also publish these reports to publicly in web. Once you publish your report in Power BI service go to Reports -> Your Report -> File -> Publish to web . After that anyone can go through these reports publicly in the web.

Also, URL that you get through publish to web can be embedded to your website along.

Summary:

In Summary, from these above Power BI Desktop Visualization it helps to tells us that which SQL Saturday is conducted in which state, country in which year and a particular Day along with total sessions conducted. It also helps us to tell us that SQL Saturday trends year on year and mostly in September month most of SQL Saturday is conducted as per SQL Saturday data history.

Also, it helps to tell the distribution of state wise SQL Saturday conducted states .which shows most of SQL Saturday happens in North America.

Speaking at SQLSaturday Nepal SQLSaturday #482 .

Firstly I am happy and excited that I will be speaking at SQLSaturday#482 on March 26 2016. This is my second time for this International SQLSaturday event and I’m really excited that I will be speaking this time.

I will be speaking on ‘Using power query to tell your story form your Facebook data’ here is my abstract detail:

The session is mainly for the one who is trying to extract the story behind their Facebook data by using Power Query. By using Power Query you can extract your Facebook data easily and do analysis your own story by using your own Facebook data.

Talking about Power Query: Microsoft Power Query for Excel is an Excel add-in that enhances the self- service Business Intelligence experience in Excel by simplifying data discovery, access and collaboration.

Power Query can connect data across a wide variety of sources, where Facebook is one of the data source.

This session helps you to learn about the Power Query, Power View, and Power BI and mainly helps you to do self-service BI by using your own Facebook data with the help of Power Query, Power View and MS-Excel 2013.

So, What is SQLSaturday?

SQLSaturday is a training event for SQL Server professionals and those wanting to learn about SQL Server. Admittance to this event is free, all costs are covered by donations and sponsorship. Please register soon as seating is limited, and let friends and colleagues know about the event.

This event will be held on Mar 26 2016 at Hotel Himalaya, Sahid Shukra Marg, Kathmandu, Central Region, 44600, Nepal.

Please do register and be a part of this great event.

http://www.sqlsaturday.com/482/EventHome.aspx

Hope to see you there and don’t forget to say hello to me in the event 🙂

Thanks,

Anil Maharjan

BI Engineer

http://np.linkedin.com/in/maharjananil

How to install spark 1.6.0 and play around with it.

There is always something new to learn in the field of BIG DATA. This time I took a course in Big Data University on Spark Fundamental I. You can go through the free courses in Big Data University in order to learn in different tracks.

http://bigdatauniversity.com/courses/spark-fundamentals/

Firstly, in order to learn spark we need to install it first in your machine or VM. I had previously install Apache Hadoop 2.6.0 on my VM and want to install spark on top of it. Thanks to YARN we don’t need to pre-deploy anything to nodes and as it turned out it was very easy to install and run spark on YARN. You can also install spark in standalone mode.

If you haven’t install Hadoop then You can go through below link in order to install Apache Hadoop on your VM, I had previously gone through below blog post in order to install Hadoop 2.6.0 on my VM. Which is quite simple.

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#

What is Spark?

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

– According to Big Data University – Spark foundation Course:

Apache Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide, Spark is the alternative. Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining. Spark provides in-memory cluster computing for lightning fast speed and supports Java, Scala, and Python APIs for ease of development.

Spark combines SQL, streaming and complex analytics together seamlessly in the same application to handle a wide range of data processing scenarios. Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3

You can find more on Spark from:

http://spark.apache.org/

Now, how to install spark?

In order to install Apache Spark:

You can create separate user for spark as spark user as similar to Hadoop user or just install as root user.

wget http://apache.mirrors.ionfish.org/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz

tar -zxvf spark-1.6.0-bin-hadoop2.6.tgz

mv spark-1.6.0-bin-hadoop2.6 spark

Also you can set the environment variables

vi ~/.bashrc

export SPARK_HOME=/home/spark

Spark supports programming languages java, Scala, Python. So, I setup scala programming language in order to play around with spark.

How to Install scala ?

You can download the scala version 2.11.7 from the site: http://www.scala-lang.org/download/ into your local disk space on your VM. Then use below command:

tar xvf scala-2.11.7.tar

mv scala-2.11.7 scala

Now let’s check the setup of Spark – shell.

# $SPARK_HOME/bin/spark-shell

You can also check Spark GUI:

Great!!! You have successfully install Apache Spark. After that, you can go through the course from Big Data University using above link at first and learn more about spark .Through this course you can learn Spark Fundamentals, some of basic related to spark. After completing the course, you should be able to:

Describe what Spark is all about know why you would want to use Spark
Use Resilient Distributed Datasets operations
Use Scala, Java, or Python to create and run a Spark application
Creating applications using Spark SQL, MLlib, Spark Streaming, and GraphX
Configure, monitor and tune Spark

In summary, it feels always great to learn new things and play around with it. In the Big DATA world, there is always something new to learn and this time Spark is what I got my free time to play around with. But I must say, the more you want to learn these things the more excitement and challenges it brings and want to deep dive more and more.

Through this article, I believe you are now able to install spark 1.6.0 on top of Hadoop 2.6.0. and play around with it.

References:

http://bigdatauniversity.com/courses/spark-fundamentals/

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#

http://spark.apache.org/

https://sparkhub.databricks.com/

https://anilmaharjanonbi.wordpress.com/2015/07/01/big-data-apache-hadoop-hive-then-vertica/

https://www.linkedin.com/pulse/big-data-apache-hadoop-hive-vertica-anil-maharjan

Big Data, Apache Hadoop, Hive & then Vertica

It’s all about Big Data : Previously , in 2012 , I had some research related to Big Data and Apache Hadoop while I was working in Healthcare domain in my old company , I couldn’t spent much time in this sector though I wanted to . So, now back again, I started to dig down little bit more on this topic which is hot topic at that time and now becoming even hottest topic at this time. Big Data, every company now trying to incorporate and deal with these things.

Really speaking, the more we search related to Big Data the more it will be vast and big. Big Data as far as I know just a way of handling the massive amount of data both structural , semi- structural and unstructured data that is beyond the storage and processing capabilities of a single physical machine . Day by Day (V3) data velocity, data variety formats and data volume gets on increasing and from terabytes to petabytes and so on. If we really don’t think and invest enough time to handle these massive amount of growing data and the emerging Big Data Challenges then the future seems little dark outside.

Where Big Data comes there comes the term Hadoop. Hadoop: I think ‘Doug Cutting’ wouldn’t have thought that the tool which he’s naming after his kid’s soft toy will be such a huge elephant now that every top companies trying to ride on that big elephant. Every companies now trying to find out what’s this big elephant is about, what it is capable of, do we need it or how to incorporate it in our existing frameworks, these are some of the things which companies are thinking and investing there valuable time these days.

Hadoop is by no means an out-of-the-box solution. In order to build a truly information- driven enterprise, where decisions are based on data and not guess works, the companies would require a data management solution that not only offers robust data governance, but also is easily manageable and seamlessly integrates with existing enterprise infrastructure.

Hadoop as describer formally from Apache itself:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

It has mainly tow core components:

Data processing framework (MapReduce): MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
Distributed file system for data storage. (HDFS): HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

You can find more on these through different sites online.

I have gone through step by step installing Hadoop 2.7.0 in single node cluster on RedHat Linux VMware Machine, if you want to play along with Hadoop by trying installation and configuration then you can go through below link which have all details:

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/

Then after that I try to install HIVE: Hive, another Apache application that helps convert query language into MapReduce jobs, for instance.

http://tecadmin.net/install-apache-hive-on-centos-rhel/

Now, I am trying to connect Vertica Cluster node and Hadoop Cluster node to talk to each other, since I can definitely process the different source unstructured data directly and easily into Hadoop nodes and then for quick ad-hoc analytics & quick decision making, the same data we process can load into Vertica and do further analysis in near real time. I am trying to go through this document for now.

https://my.vertica.com/docs/7.0.x/PDF/HP_Vertica_7.0.x_HadoopIntegration.pdf

For installing Vertica in a single node you can go through the blog that I had previously written:

https://anilmaharjanonbi.wordpress.com/2014/11/07/how-to-install-vertica-in-a-single-node/

So far as Hadoop distribution is concerned, the three companies that really stand out in the competitions are: Cloudera, MapR and Hortonworks.

You can find more on these by going through their sites. One blog I found great to read related to these Hadoop distribution vendor companies by going through below link:

http://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions/

The more we dig down into these topics, the more vast the big data world it became. One who is trying to learn and read related to big data thing should surely need to dig down more related to some other topics too for sure, such as:

HBase, YARN, Cassandra, Pig , Zookeeper, Spark , Ambari ,sqoop , flume and so on.

These are some of topics one should go through if you want to dig down more into the world of Big Data.

References:

https://hadoop.apache.org/

http://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions/

http://www.quora.com/What-is-Apache-Hadoop-1

https://my.vertica.com/docs/7.0.x/PDF/HP_Vertica_7.0.x_HadoopIntegration.pdf

http://www.smartdatacollective.com/mtariq/120791/hadoop-toolbox-when-use-what

https://anilmaharjanonbi.wordpress.com/2014/11/07/how-to-install-vertica-in-a-single-node/

Thanks,

Anil Maharjan

BI Engineer

Developing Data Products using R language and Shiny App.

Learning new things and playing around with new tools, apps is what we always want and which helps us to grow our carrier and knowledge.

In my free time this month, I took one of the course from Coursera. https://www.coursera.org/

Coursera is an education platform that partners with top universities and organizations worldwide, to offer courses online for anyone to take, for free as well as you can earn or achieve certified certificate too.

Beside than certificate, the knowledge we learn from here is a step towards the success in our carriers.

This time I took a course as ‘Developing Data Products’ which is provided by Johns Hopkins University.

About the Course

A data product is the production output from a statistical analysis. Data products automate complex analysis tasks or use technology to expand the utility of a data informed model, algorithm or inference. This course covers the basics of creating data products using Shiny, R packages, and interactive graphics. The course will focus on the statistical fundamentals of creating a data product that can be used to tell a story about data to a mass audience.

You can find more about this course through:

https://www.coursera.org/course/devdataprod

For this, you need to install R-Studio, R –language, Shinny package and other related packages. Also, you can learn more related to R-language, R-studio, and Shiny app from different courses available in Coursera.

R-Studio:

You can download R-studio,

http://www.rstudio.com/products/rstudio/download/

R version, currently we have R V 3.1.3. You can download from cran.

http://cran.r-project.org/bin/windows/base/

GitHub:

Also, you should open github account which is mostly useful to share codes, docs & as well as you can share ideas and involve in the Projects, also can browse interesting projects & solves all types of interesting problems.

https://github.com/

I have opened my github account as

https://github.com/maharjananil

Shiny & Shiny App:

Then you can open shinyapps account. You can directly login by using the Github same account.

https://www.shinyapps.io/

I have published one app after learning this course as:

https://anilmaharjan.shinyapps.io/FacebookAnalysis/

This shiny app is mainly used to analysis or to find out Facebook site visitor’s vs. other sites. How much time a person spent his/her time in Facebook or in other sites based on per hour, day, and months. Detail analysis can be found out if we can have exact site visit data. This is just a sample data which is not accurate, we can compares trends of site statistics if we can have exact data. .

Thinking of the project in Shinny apps as: Let’s drill into your Facebook data to extract your story. :http://www.slideshare.net/anil_maharjan/using-power-query-to-tell-your-story-form-your-facebook-data

R Markdown / RPubs

Also, you can prepare the documents / presentation as directly publish in web from R. R Markdown documents that you have written in R-studio can be directly published into the web where RPubs comes. Where you can directly publish the R Markdown documents into web RPubs and share among all.

http://rpubs.com/

Summary:

Through this course I am able to learn these new tools, languages, data products, assessments, Quizzes, Projects. It feels like, I have been reading back to my engineering classes again. I must say that course lectures are great and have great knowledge in their respective fields.

This course helps me, how we can directly analysis the data and can tell the story behind the data using different data products, by creating shiny app and pushing in web where anyone can go through the app easily and can understand what the story behind the data is by themselves. Here, we don’t need to load the data into database, create any data models & use any reporting tools to play around data & make decision. Here we can directly play around with raw data using R-language & using shiny app along with different r-charts we can easily tell the story behind data quickly and effectively.

While learning this course I got some issue which I have shared along in forum /discussions.

https://class.coursera.org/devdataprod-012/forum/thread?thread_id=50

https://class.coursera.org/devdataprod-012/forum/thread?thread_id=93

Reference:

http://shiny.rstudio.com/tutorial/lesson1/

http://www.r-project.org/

Regards,

Anil Maharjan

BI Engineer

http://np.linkedin.com/in/maharjananil

How to Install Vertica in a single node.

Install Vertica in a single node.

This article is mainly related to:

Setup of Vertica in a single node.
Vertica Console Management
Tableau
DBeaver

During my free time, I want to try to install Vertica and want to know what it is all about. what are the things we should know while installing Vertica and what will be the issues, difficulties, requirements and process do we need to follow in order to setup Vertica and play around it.

The HP Vertica Analytic Database is based on a massively parallel processing (MPP), shared-nothing architecture, in which the query processing workload is divided among all nodes of the Vertica database.

If you want to try Vertica and play around along with this columnar database then you should follow below steps.

Step 1:

Firstly you should have any Linux OS installed in a machine. For Vertica, the minimum requirement is to have 3 nodes that mean’s three Linux OS running in different nodes. But, in my case I want to try to install in only one node and try it out. I have windows 7 OS install and where I have install Red hat Linux in my virtual machine. So, make sure you have at least one Linux OS installed machine.

You may find the below link to find out the minimum requirements and server configuration:http://my.vertica.com/docs/5.1.6/HTML/index.htm#18671.htm

General Platform Recommendations

ext4 is recommended over ext3 for performance reasons.
Use 2GB of swap space regardless of the amount of installed RAM.
Place the database /catalog directory on the same drive as the OS.

Step 2:

Download all the required software related to Vertica from the site https://my.vertica.com/downloads/

In order to download you can sign up in a community edition. All the stuffs you can know from below video:

http://www.vertica.com/files/myVerticaVideo/myVertica_Audio_Video_Combined_121009J.html

Here, I have downloaded the below versions:

Step 3:

After that move the setup files into Red Hat Linux virtual machine directory. I have put the setup files into directory as /data/software and then open the terminal in VM,

Run the below command as root user:

rpm -Uvh /data/Software/vertica-7.0.2-1.x86_64.RHEL5.rpm

Then after that it will ask to run the /opt/vertica/sbin/install_vertica to complete the installation.

Step 4:

Run the script in master node

# /opt/vertica/sbin/install_vertica -s host_list -r rpm_package -u dba_username

Here I used only one node so below is my script.

/opt/vertica/sbin/install_vertica -s localhost -r /data/Software/vertica-7.0.2-1.x86_64.RHEL5.rpm -u dbadmin

Here, you need to note that if you want to install Vertica in multiple nodes then you can include different nodes or host list.

Where options are:

-s host_list comma-separated list of hostnames or IP addresses to include in the cluster; do not include space characters in the list.

-r “vertica_6.0.x.x86_64.RHEL5.rpm”
-u dbadmin user name
-p dbadmin passowrd
-P root password
-L location of the license
-d where data will be located
-s nodes that will be part of the cluster
-r location of the installation rpm

— If you omit the -u parameter, the default database administrator account name is dbadmin who will only use the admintools.

You can find more about installing vertica in 3nodes or complete cluster Installation in below link:

http://www.aodba.com/main_articles_single.php?art=83&page=vertica

Step5:

After that you will get some issues or errors as below which I have got

>> Validating node and cluster prerequisites…

Failures during local (OS) configuration for verify-127.0.0.1.xml:

HINT (S0305): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0305

TZ is unset for dbadmin. Consider updating .profile or .bashrc

HINT (S0041): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0041

Could not find the following tools normally provided by the mcelog

package: mcelog

HINT (S0040): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0040

Could not find the following tools normally provided by the pstack or

gstack package: pstack/gstack

WARN (N0010): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=N0010

Linux iptables (firewall) has some non-trivial rules in tables: filter

FAIL (S0150): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0150

These disks do not have ‘deadline’ or ‘noop’ IO scheduling: ‘/dev/sda1’

(‘sda’) = ‘cfq’, ‘/dev/sda3’ (‘sda’) = ‘cfq’

FAIL (S0020): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0020

Readahead size of sda (/dev/sda1,/dev/sda3) is too low for typical

systems: 256 < 2048

FAIL (S0030): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0030

ntpd process is not running: [‘ntpd’, ‘ntp’]

FAIL (S0081): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0081

SELinux appears to be enabled and not in permissive mode.

FAIL (S0310): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0310

Transparent hugepages is set to ‘always’. Must be ‘never’ or ‘madvise’.

Then go through each error in the link below

https://community.vertica.com/vertica/topics/

This is all about some cluster prerequisites. Here go through each FAIL(XXXX) in the community forum where you will get the solution for each error.

Step 6:

After that, run the admintools from the dbadmin user then you can see as

After that create the database:

Then choose the host name where database will reside:

Then, define the Catalog location and data path

Since I have installed Vertica in a single node so there will not be the concept of K-Safe method. If we are installing at least 3nodes then we can have k-safe.

After that,

Then create Vertica database as Vertica_DB in a single node.

After that the database will created.

Just click OK and you will see the Database configuration:

Then finally you can exit from the admintools:

Step 7:

Vertica management console:

After successful installation of Vertica Database now you can install the Vertica management console:

The new HP Vertica Management Console is an enterprise database management tool that provides a unified view of your HP Vertica database and lets you monitor multiple clusters from a single point of access.

You can find more on below link:

https://my.vertica.com/docs/5.1.6/HTML/index.htm#16773.htm

Run the command as a root user:

rpm –Uvh vertica-console-7.0.2-1.x86_64.RHEL5.rpm

After successful installation it will show URL as https://localhost.localdomain:5450/webui during the installation. Just go through the URL and accept the license

What you can do with Management Console

Create a database cluster on hosts that do not have HP Vertica installed
Create, import, and monitor multiple HP Vertica databases on one or more clusters from a single point of control
Create MC users and grant them access to MC and MC-managed databases
Manage user information and monitor their activity on MC
Configure database parameters and user settings dynamically
Access a single message box of alerts for all managed databases
Export all database messages or log/query details to a file
View license usage and conformance
Diagnose and resolve MC-related issues through a browser
Access a quick link to recent databases and clusters
View dynamic metrics about your database cluster

The features is so much helpful for the DBA’s and the developer from where they can easily monitor multiple HP Vertica databases on one or more clusters from a single point of control.

Detail setp to install and login setup process is shown in my slideshare , you can see the link below :

https://www.slideshare.net/anil_maharjan/how-to-setup-vertica-in-a-single-node

Tableau:

Tableau can help anyone see and understand their data. Connect to almost any database, drag and drop to create visualizations, and share with a click.

To know the story behind your data which is stored in Vertica DB, you need to have a reporting tool which can make a quick decision and helps you to get the value of your data.

This is where Tableau can be used. Which is one of the best Reporting Tool I must say.

One can go to Tableau portal and download the software and play around any database, Worksheets, excel files and so many other data files.

I have downloaded both desktop and public version of Tableau, and where as desktop version works for 15 days trail.

Now in order to connect Vertica DB through Tableau you need to install the Vertica client in your PC i.e. my windows 7 machine, where you need to download the vertica-client-7.0.2-1.64 from https://my.vertica.com/

Also, you can install the test db along with Vertica as VMART schema but in my case I am getting error so I have posted into Vertica community .One can join this community and can share ideas and issues.

https://community.vertica.com/vertica/topics/cannot-create-vmart-example-db

After that you can open tableau desktop and connect to Vertica DB in order to play around with and to find the story behind your data.

Also, you can have public tableau install in your PC and do some analysis using different worksheets, I have also done some analysis related to ‘Average Percentage of Internet Users across the World ‘ and published into public server which is free that Tableau providing for normal users.

https://public.tableausoftware.com/views/AveragePercentageofInternetUsersaccrostheWorld/AverageofIndividualsUsingtheInternet?:showVizHome=no#1

DBeaver:

DBeaver is free and open source (GPL) universal database tool for developers and database administrators.

Usability is the main goal of this project, program UI is carefully designed and implemented.
It is freeware.
It is multiplatform.
It is based on opensource framework and allows writing of various extensions (plugins).
It supports any database having a JDBC driver.
It may handle any external datasource which may or may not have a JDBC driver.
There is a set of plugins for certain databases (MySQL and Oracle in version 1.x) and different database management utilities (e.g. ERD

You can find more from the link :

http://dbeaver.jkiss.org/about/

I just found this tool so easy to connect with Vertica DB and do some queries analysis

Summary:

Through this article, I am sure you are now able to understand how you can install Vertica Analytic Database in a single node and how you can use Vertica Management Console to monitor multiple clusters from a single point of access. How you can easily get the story behind your data in Vertica DB by using the Tableau Reporting tool.

Also, how you can use quires to do more detail analysis by using the DBeaver tool in Vertica Database.

More detail step by step full artice can be download from the link below :

https://www.slideshare.net/anil_maharjan/how-to-setup-vertica-in-a-single-node

Thanks,

Anil Maharjan

BI Engineer

http://np.linkedin.com/in/maharjananil

Using Power Query to tell your story form your Facebook Data.

Most of the time of this weekend, I spent my time to extract the story behind my Facebook data by using Power Query. Power Query can connect data across a wide variety of sources, where facebook is one of the data source. By using Power Query you can extract your facebook data easily and do analysis your own story by using your own facebook data.

Talking about Power Query: Microsoft Power Query for Excel is an Excel add-in that enhances the self-service Business Intelligence experience in Excel by simplifying data discovery, access and collaboration.

Microsoft Power Query for Excel, is a new add-in that provides a seamless experience for data discovery, data transformation and enrichment for Information Workers, BI professionals and other Excel users. With Power Query you can:

Identify the data you care about from the sources you work with (e.g. relational databases, Excel, text and XML files, OData feeds, web pages, Hadoop HDFS, Facebook Data etc.).
Discover relevant data from inside(*) and outside your organization using the search capabilities within Excel.
Combine data from multiple, disparate data sources and shape it in order to prepare the data for further analysis in tools like Excel and Power Pivot, or visualization in tools like Power View and Power Map.

You can easily download from the link below:

http://www.microsoft.com/en-us/download/details.aspx?id=39379

You can find more about Power View, Power Map, Power BI and Q&A from the official Microsoft Power BI site. http://www.microsoft.com/en-us/powerBI/default.aspx

Now, I think you have installed Power Query add-in in Excel 2013 professional edition,

So, let’s drill into your Facebook data to extract your story .!!!

Firstly you need to open Excel sheet and go to the Power Query ribbon bar in Excel, then go to the ‘From Other Source’ section where you will see the ‘From Facebook option’ from dropdown list.

Due to the so many slides or pages, the detail step by Step guide have uploaded into my Slideshare account. You can download or go through the section in order to know how to use Power Query and how to extract your Facebook Data and how to do analysis using Power View to tell the story. Here is the Slideshare:

Slideshare link: http://www.slideshare.net/anil_maharjan/using-power-query-to-tell-your-story-form-your-facebook-data

Where I have use Power View to analyse the facebook data and give us some story to view.

Here I have used three Power Query Analysis, I think you will find these easy after you once go through the slides in above link where I have listed detail step by step method to do these analysis or if you have some knowledge of Power View then it must be quite easy to do analysis.

Power Query Analysis 1 using Power View:

Let’s see when you have added your friends.
Through this analysis you can find out the total friends added in facebook, in which year, month.
First and recent friends connected within facebook.
You can know when you are connected with which friend.

Happy to see the results. 🙂

Now you want to know first and recent friends connected within Facebook along with which time they have added . You can get it by using period field in Tile By and story field in rows with friend added filter field in column . This shows the first friends that you have added in facebook. Happy to see the result. 🙂

If you need to find out the recent added friends in Facebook then just click the last period.

Here as per chart I can clearly see that I have added a lot of friends in 2009 and 2010. Since that is the year when I had completed my computer engineering course and most of my time I was spending in facebook. Quite amazing it does remind my past college life. Power Query along with Power View actually can tell us story of our own. Happy to find out my past.

This is such a cool tool, Power Query and Power View along with Power BI, You can visualize the things you just want to see.

Power Query Analysis 2 using Power View:

Ø Now, let’s see how many of your friends are male and female.

I think after going through the slides you will surely know how to get the data from Facebook using Power Query for friend list and how to use Power View to analysis. You can go through detail steps again from slides to do this analysis.

Power Query Analysis 3 using Power View:

Let’s see your facebook activities from past to present.

This will help you to track down how much time have you spent in facebook and how frequent you post or update your feeds.

This will helps facebook to track down the facebook user usage trend.

How and when user is spending the time in facebook.
Why the facebook user usage trend is decreasing or increasing.
What is the reason behind the feeds data of user is decreasing or increasing.

For the facebook activities analysis, you can use the same feed information worksheet and by creating new Power View and line chart option you can start analysis of facebook activities from different perspective.

From graph we can clearly see that there is a huge spike in 200907 to somewhat up to 201003 this is still because this is the time when I had completed my computer engineering course and most of my time I was spending in facebook.

Also, going on next year in 201109 there is another spike which is due to that month is my birthday on 04thSeptember, where there is more feeds regarding my birthday wishes.

Now, let’s see the current year facebook activates by using similar Power View and same drag and drop features and by scroll down to last month date 201405 you can see the trend of your current facebook activities.

Here from line trend analysis it seems that the facebook activates from past year to current is frequently decreasing in my case. This is due to that I am spending less time in facebook as compares to previous year where I use to spend most of my time.

Reference:

http://www.microsoft.com/en-us/powerBI/default.aspx

http://www.slideshare.net/murrayfife/using-microsoft-power-query-to-analyze-facebook-data

http://office.microsoft.com/en-001/excel-help/introduction-to-microsoft-power-query-for-excel-HA104003940.aspx

SUMMARY:

Power Query along with Power View actually can tell us story of your own. This is how Power Query along with Power View can tell your story by using your own facebook data. Happy to find out my past.

Quite amazing it does remind my past college life.

This is such a cool tool, Power Query and Power View along with Power BI, You can visualize the things you just want to see.

Also, I am thinking that if facebook provides our friends location data then it will be so easy to find out the friends and family around the world. Where Power Map will be more useful to track down the friends whom we have no idea where she or he is but we do connected with facebook in virtual world. If we can think of creating facebook friends and family location track app which will helps us to find out where exact is our friends and family are located based on the current location information of facebook data.

Thanks,

Anil Maharjan

How to transpose rows to columns in oracle.

In some case we need to transpose the table data that means we need to change the rows value into columns, and then this blog post will be surely helpful.

Let’s say we have to find out the trend of expenses of the customer or say some account no or mobile no for some few months to find out his or her behavior such as we have a data in table ‘Pivot_Table’ as

MobileNO TotalExpenses 	Date
1	3000	20140101
2	5000	20140101
3	7000	20140101
1	5000	20140201
3	9000	20140201
2	3000	20140201
1	1111	20140301
2	2222	20140301
3	3333	20140301

Now, in order to find out the behavior analysis for the certain months per MobileNO, we need the output such as

MobileNO JanExpenses	FebExpenses	MarchExpenses
1	3000 		5000		1111
2	5000		3000		2222
3	7000		9000		3333

This can be easily done by using the PIVOT function for Oracle version11.2

http://www.oracle-base.com/articles/11g/pivot-and-unpivot-operators-11gr1.php

But if your oracle version is old than 11.2 then you can use DECODE function to obtain such solution:

DECODE is a function in Oracle and is used to provide if-then-else type of logic to SQL. It is not available in MySQL or SQL Server. The syntax for DECODE is:

SELECT DECODE ( "column_name", "search_value_1", "result_1", 
["search_value_n", "result_n"],
{"default_result"} );

"search_value" is the value to search for, and "result" is the value that is displayed.

Here is my Query:

SELECT /*+parallel(t,4)*/
    MobileNO,
    	MAX(DECODE(Date, 20140101, TotalExpenses)) AS JanExpenses,
MAX(DECODE(Date, 20140201, TotalExpenses)) AS FebExpenses,
MAX(DECODE(Date, 20140301, TotalExpenses)) AS MarchExpenses
  FROM
    Pivot_Table t
GROUP BY MobileNO
ORDER BY MobileNO;

Hope this post will be helpful.

Thanks,

Anil Maharjan