Using Microsoft Azure Stream Analytics and Power BI: Real-time Telco fraud detection

This post is referenced from Microsoft Azure main articles, during my free time I want to try out Azure Stream Analytics and know more about the Stream Analytics where I found two great articles which help me to understand in detail:

Below are those articles:

1) https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-real-time-fraud-detection

2) https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-power-bi-dashboard

Here one can learn related to Azure Steam Analytics and how Power BI visualization dashboard can be used for real time data analysis and helps to visualize fraudulent phone calls that are detected by a Streaming Analytics job in real time. Since, I am also currently working in Telco sector so this analysis helps me a lot.

This tutorial provides an end-to-end illustration of how to use Azure Stream Analytics. You learn how to:

  • Bring streaming events into an instance of Azure Event Hubs. In this tutorial, you’ll use an app that we provide that simulates a stream of mobile-phone metadata records.
  • Write SQL-like Stream Analytics queries to transform data, aggregating information or looking for patterns. You will see how to use a query to examine the incoming stream and look for calls that might be fraudulent.
  • Send the results to an output sink (storage) that you can analyze for additional insights. In this case, you’ll send the suspicious call data to Azure Blob storage.
  • Also send the results to an output sink (Power BI) where you can analyze for additional insights. How one can build real time telco fraud detection visualization dashboard in Power BI.

In this tutorial, we use the example of real-time fraud detection based on phone-call data. But the technique we illustrate is also suited for other types of fraud detection, such as credit card fraud or identity theft.

Scenario: Telecommunications and SIM fraud detection in real time

A telecommunications company has a large volume of data for incoming calls. The company wants to detect fraudulent calls in real time so that they can notify customers or shut down service for a specific number. One type of SIM fraud involves multiple calls from the same identity around the same time but in geographically different locations. To detect this type of fraud, the company needs to examine incoming phone records and look for specific patterns—in this case, for calls made around the same time in different countries. Any phone records that fall into this category are written to storage for subsequent analysis and also send those results to an output sink (Power BI) where one can analyze for additional insights and helps in real-time fraud detection based on live phone-call data .

I have followed step by step by following above article 1 and article 2, so one should go in details through those articles. Below are some snap from steps I followed:

Now, Streaming Analytics job starts looking for fraudulent calls in the incoming stream. The job also creates the dataset and table in Power BI and starts sending data about the fraudulent calls to them.

Once we finished creating Azure Stream Analytics job and output of that job to Power BI sink table name as ‘Telco_Fraud_Demo’ inside Data and BI Summit 2018 Workspace in Power BI Service. After that one can create real time visualization dashboard in Power BI.

For this just login to Power BI Service, and then under your Workspaces in my case ‘Data and BI Summit 2018 ‘ create dashboard by clicking +Add title tab where you will see Custom streaming Data in REAL-TIME DATA section and then once you select Custom Streaming Data you will see the same output power BI sink table which we had created in Azure Streaming Analytics Job Output

By selecting different visualization Type we can visualize and analyze for additional insights and helps in real-time telco fraud detection based on Streaming dataset.

Also, in near future we can directly connect the dataset as Streaming dataset in near future as per below Power BI feature notice.

Thanks,

Anil Maharjan

Senior BI Engineer | Nepal Power BI User Group Leader

Happy and excited to be a Program committee member for DATA and BI Summit 2018.

Happy and excited to be a Program committee member for DATA and BI Summit 2018 in Visualize Track focusing on Building and Designing Dashboards and Reports.

Thanks to the DATA and BI Summit 2018 Program committee team for giving me an opportunity to work as a Volunteer Program committee member for upcoming DATA and BI Summit 2018 in Visualize Track focusing on Building and Designing Dashboards and Reports.

I had selected Visualize track (Focus: Building and Designing Dashboards and Reports) and regarding top 10 topics that should be covered in this track at DATA & BI Summit are:

  1. How to get story/insights from DATA using Power BI.
  2. Power BI Desktop and it’s features.
  3. R on Power BI.
  4. Different types of Visualization(bar charts ,trends, scatter plot, Geo-spatial , maps ,drill up, drill down, sub-reports ) we can create just by drag and drop approach and also custom visuals.
  5. Power BI service , dashboards, reports , data sources , APIs.
  6. workspaces , Power apps, how can end users be benefit by using different workspaces and power apps and helps in decision making.
  7. Roles and privileges in different workspaces , apps, reports , dashboard.
  8. scheduling, refresh data source , publishing in web, mail subscription, subscriber different reports , dashboards to different teams and end users.
  9. Q&A , Cortana , mobile power BI app, APIs, Gateways.
  10. Phone Layout, Power Query, Data modeling, Get Apps Microsoft App store, Real time data analysis and analytics.

You can find more about application proposal for Data and BI Summit Program committee. Deadline is 12th Jan 2018

https://www.pbiusergroup.com/engage/volunteeropportunities/volunteer-opportunity-details?VolunteerOpportunityKey=3791f992-2f6d-4eb7-b558-3ad62022dc74

Also, if you want to speak in this conference then you can find more about call for proposal for DATA and BI Summit below: Deadline is 12th Jan 2018.

https://www.pbiusergroup.com/engage/volunteeropportunities/volunteer-opportunity-details?VolunteerOpportunityKey=c2bb2d9e-ddab-4be1-9fcd-7204a4e6226d

So What about DATA and BI Summit 2018 ?

Data & BI Summit is hosted by The Power BI User Group (PUG). With over 38,000 members, PUG has established itself as the go-to for Data Professionals and Business Analysts who are eager to collaborate and deepen their expertise in Microsoft business intelligence tools.

DATA and BI Summit is going to be held on 24-26 April 2018 | The Convention Centre Dublin | Dublin, Ireland.

What to expect at the Data & BI Summit:

  • Exceptional, quality content: Learn how to bring your company through the digital transformation by gaining new understandings of your data and deepening your knowledge of the Microsoft Business Intelligence tools. Products will include: Power BI, PowerApps, Flow, SQL Server, Excel, Azure, D365 and more!
  • Answers to your questions: Network with the Microsoft Power BI team, dig-in onsite to find immediate answers with industry experts, Data MVPs, and User Group Leaders while taking advantage of the opportunity to engage in interactive sessions, workshops and labs.
  • Network with your peers: Enjoy countless opportunities to create lasting relationships by connecting and networking with user group peers, partners and Microsoft team members.
  • Stretch your skill set: Advance your career by learning the latest updates and how they can help you and your business.

You can find more on DATA and BI Summit 2018 by going through below link.

https://www.databisummit.com/home

Thanks,

Anil Maharjan

Senior BI Engineer | Nepal Power BI User Group Leader

How to install spark 1.6.0 and play around with it.

There is always something new to learn in the field of BIG DATA. This time I took a course in Big Data University on Spark Fundamental I. You can go through the free courses in Big Data University in order to learn in different tracks.

http://bigdatauniversity.com/courses/spark-fundamentals/

Firstly, in order to learn spark we need to install it first in your machine or VM. I had previously install Apache Hadoop 2.6.0 on my VM and want to install spark on top of it. Thanks to YARN we don’t need to pre-deploy anything to nodes and as it turned out it was very easy to install and run spark on YARN. You can also install spark in standalone mode.

If you haven’t install Hadoop then You can go through below link in order to install Apache Hadoop on your VM, I had previously gone through below blog post in order to install Hadoop 2.6.0 on my VM. Which is quite simple.

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#

What is Spark?

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

– According to Big Data University – Spark foundation Course:

Apache Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide, Spark is the alternative. Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining. Spark provides in-memory cluster computing for lightning fast speed and supports Java, Scala, and Python APIs for ease of development.

Spark combines SQL, streaming and complex analytics together seamlessly in the same application to handle a wide range of data processing scenarios. Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3

You can find more on Spark from:

http://spark.apache.org/

Now, how to install spark?

In order to install Apache Spark:

You can create separate user for spark as spark user as similar to Hadoop user or just install as root user.

wget http://apache.mirrors.ionfish.org/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz

tar -zxvf spark-1.6.0-bin-hadoop2.6.tgz

mv spark-1.6.0-bin-hadoop2.6 spark

Also you can set the environment variables

vi ~/.bashrc

export SPARK_HOME=/home/spark

Spark supports programming languages java, Scala, Python. So, I setup scala programming language in order to play around with spark.

How to Install scala ?

You can download the scala version 2.11.7 from the site: http://www.scala-lang.org/download/  into your local disk space on your VM. Then use below command:

tar xvf scala-2.11.7.tar

mv scala-2.11.7 scala

Now let’s check the setup of Spark – shell.

# $SPARK_HOME/bin/spark-shell

Spark

You can also check Spark GUI:

sparkgui

Great!!!  You have successfully install Apache Spark. After that, you can go through the course from Big Data University using above link at first and learn more about spark .Through this course you can learn Spark Fundamentals, some of basic related to spark. After completing the course, you should be able to:

  • Describe what Spark is all about know why you would want to use Spark
  • Use Resilient Distributed Datasets operations
  • Use Scala, Java, or Python to create and run a Spark application
  • Creating applications using Spark SQL, MLlib, Spark Streaming, and GraphX
  • Configure, monitor and tune Spark

In summary, it feels always great to learn new things and play around with it. In the Big DATA world, there is always something new to learn and this time Spark is what I got my free time to play around with. But I must say, the more you want to learn these things the more excitement and challenges it brings and want to deep dive more and more.

Through this article, I believe you are now able to install spark 1.6.0 on top of Hadoop 2.6.0. and play around with it.

References:

http://bigdatauniversity.com/courses/spark-fundamentals/

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#

http://spark.apache.org/

https://sparkhub.databricks.com/

https://anilmaharjanonbi.wordpress.com/2015/07/01/big-data-apache-hadoop-hive-then-vertica/

https://www.linkedin.com/pulse/big-data-apache-hadoop-hive-vertica-anil-maharjan

Big Data, Apache Hadoop, Hive & then Vertica

It’s all about Big Data : Previously , in 2012 , I had some research related to Big Data and Apache Hadoop while I was working in Healthcare domain in my old company , I couldn’t spent much time in this sector though I wanted to . So, now back again, I started to dig down little bit more on this topic which is hot topic at that time and now becoming even hottest topic at this time. Big Data, every company now trying to incorporate and deal with these things.

Really speaking, the more we search related to Big Data the more it will be vast and big. Big Data as far as I know just a way of handling the massive amount of data both structural , semi- structural and unstructured data that is beyond the storage and processing capabilities of a single physical machine . Day by Day (V3) data velocity, data variety formats and data volume gets on increasing and from terabytes to petabytes and so on. If we really don’t think and invest enough time to handle these massive amount of growing data and the emerging Big Data Challenges then the future seems little dark outside.

Where Big Data comes there comes the term Hadoop. Hadoop: I think ‘Doug Cutting’ wouldn’t have thought that the tool which he’s naming after his kid’s soft toy will be such a huge elephant now that every top companies trying to ride on that big elephant. Every companies now trying to find out what’s this big elephant is about, what it is capable of, do we need it or how to incorporate it in our existing frameworks, these are some of the things which companies are thinking and investing there valuable time these days.

Hadoop is by no means an out-of-the-box solution. In order to build a truly information- driven enterprise, where decisions are based on data and not guess works, the companies would require a data management solution that not only offers robust data governance, but also is easily manageable and seamlessly integrates with existing enterprise infrastructure.

Hadoop as describer formally from Apache itself:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

It has mainly tow core components:

  1. Data processing framework (MapReduce): MapReduce is a programming model and an associated implementation for processing and generating large data sets with a paralleldistributed algorithm on a cluster.
  2. Distributed file system for data storage. (HDFS): HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

You can find more on these through different sites online.

I have gone through step by step installing Hadoop 2.7.0 in single node cluster on RedHat Linux VMware Machine, if you want to play along with Hadoop by trying installation and configuration then you can go through below link which have all details:

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/

Then after that I try to install HIVE: Hive, another Apache application that helps convert query language into MapReduce jobs, for instance.

http://tecadmin.net/install-apache-hive-on-centos-rhel/

Now, I am trying to connect Vertica Cluster node and Hadoop Cluster node to talk to each other, since I can definitely process the different source unstructured data directly and easily into Hadoop nodes and then for quick ad-hoc analytics & quick decision making, the same data we process can load into Vertica and do further analysis in near real time. I am trying to go through this document for now.

https://my.vertica.com/docs/7.0.x/PDF/HP_Vertica_7.0.x_HadoopIntegration.pdf

For installing Vertica in a single node you can go through the blog that I had previously written:

https://anilmaharjanonbi.wordpress.com/2014/11/07/how-to-install-vertica-in-a-single-node/

So far as Hadoop distribution is concerned, the three companies that really stand out in the competitions are: Cloudera, MapR and Hortonworks.

You can find more on these by going through their sites. One blog I found great to read related to these Hadoop distribution vendor companies by going through below link:

http://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions/

The more we dig down into these topics, the more vast the big data world it became. One who is trying to learn and read related to big data thing should surely need to dig down more related to some other topics too for sure, such as:

HBase, YARN, Cassandra, Pig , Zookeeper, Spark , Ambari ,sqoop  , flume and so on.

These are some of topics one should go through if you want to dig down more into the world of Big Data.

References:

https://hadoop.apache.org/

http://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions/

http://www.quora.com/What-is-Apache-Hadoop-1

https://my.vertica.com/docs/7.0.x/PDF/HP_Vertica_7.0.x_HadoopIntegration.pdf

http://www.smartdatacollective.com/mtariq/120791/hadoop-toolbox-when-use-what

https://anilmaharjanonbi.wordpress.com/2014/11/07/how-to-install-vertica-in-a-single-node/

Thanks,

Anil Maharjan

BI Engineer

The power of SIMBA into the world of BI on top of BIG DATA

After a long time gap, I would love to post this blog regarding the power of SIMBA  into the world of BI on top of BIG DATA .

It’s really interesting and exciting to see these kinds of new technology and tools on the way. Well ,I am talking about the SIMBA MDx Provider developed by the simba Technologies .

Yesterday, while I was researching regarding BIG DATA and reading some great blog post then I came to know about the SIMBA MDx Provider which seems to be a great and cool tool .

What actually the Simba’s MDx Provider :

Simba’s MDX Provider is an ODBO provider installed on the same machine as Excel. Simba also has a tool for building cube definitions, which we call schemas. These schemas are saved in XML. Simba’s schema maps MDX metadata constructs to Impala table structures. When an ODBO compliant tool such as Excel issues an MDX query, Simba’s MDX Provider maps the MDX query to HiveQL, sends the HiveQL to Impala, collects the results and returns them to the end user.

The most important technical concept is that there is no intermediate server or cube structure that caches data, all queries go direct from Simba’s MDX Provider to the Cloudera Impala server in real time.

You can find out more on this through the main blog post by samba itself through the link below:

http://blogs.simba.com/simba_technologies_ceo_co/2013/02/demo-microsoft-excel-pivottables-on-cloudera-impala-via-simba-mdx-provider.html

As the technology is in early development, it is not generally available for early testing.

Demo of Simba’s integration of doing MDX queries over Cloudera’s Impala for use with Microsoft Excel Pivot Tables .http://youtu.be/kZahPE9Puv0

Also there is a great PDF regarding the Simba Teradata case- study.

http://www.simba.com/docs/Teradata-Case-Study.pdf

Hope to see and use these kind of great tools in near future into our world of BI.

Reference :

http://blogs.simba.com/simba_technologies_ceo_co/2013/02/demo-microsoft-excel-pivottables-on-cloudera-impala-via-simba-mdx-provider.html

http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

http://cwebbbi.wordpress.com/2013/02/25/mdx-on-cloudera-impala/

Thanks,

Anil Maharjan