Data Analytics Process, Cloud Solutions, and Power BI Solutions Training Course


Overview of On-Premise and Cloud-Based Data Storage and Analysis Solutions

Understanding Big Data

  • Big Data criteria
  • Big Data structure
  • Working with Big Data

Cloud Solutions

  • Azure SQL Database
  • Azure Data Warehouse
  • Azure Data Factory
  • Azure Databricks
  • Power BI

Working with Databases

  • Data warehouse design
  • Dimensional modelling
  • Implementation and deployment

Data Models – A Comparison

  • SSAS Tabular Data Models
  • SSAS Multidimension Models
  • Power BI Models

Data Cleansing

  • Strategies and tools

Report Models

  • Building Power BI tabular models
  • Understanding DAX

PowerBI Reports

  • Designing Power BI reports

Power BI Architecture

  • Workspace generation
  • Licensing
  • Permissions


  • Administering Azure solutions
  • Administering the Power BI Service


  • Maintaining a secure Azure architecture
  • Azure SQL Database/Data Warehouse, Data Factory and Data Bricks
  • Data Masking and Privacy Issues

Big Data Analytics in Health Training Course


21 hours (usually 3 days including breaks)


  • An understanding of machine learning and data mining concepts
  • Advanced programming experience (Python, Java, Scala)
  • Proficiency in data and ETL processes


Big data analytics involves the process of examining large amounts of varied data sets in order to uncover correlations, hidden patterns, and other useful insights.

The health industry has massive amounts of complex heterogeneous medical and clinical data. Applying big data analytics on health data presents huge potential in deriving insights for improving delivery of healthcare. However, the enormity of these datasets poses great challenges in analyses and practical applications to a clinical environment.

In this instructor-led, live training (remote), participants will learn how to perform big data analytics in health as they step through a series of hands-on live-lab exercises.

By the end of this training, participants will be able to:

  • Install and configure big data analytics tools such as Hadoop MapReduce and Spark
  • Understand the characteristics of medical data
  • Apply big data techniques to deal with medical data
  • Study big data systems and algorithms in the context of health applications


  • Developers
  • Data Scientists

Format of the Course

  • Part lecture, part discussion, exercises and heavy hands-on practice.


  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction to Big Data Analytics in Health

Overview of Big Data Analytics Technologies

  • Apache Hadoop MapReduce
  • Apache Spark

Installing and Configuring Apache Hadoop MapReduce

Installing and Configuring Apache Spark

Using Predictive Modeling for Health Data

Using Apache Hadoop MapReduce for Health Data

Performing Phenotyping & Clustering on Health Data

  • Classification Evaluation Metrics
  • Classification Ensemble Methods

Using Apache Spark for Health Data

Working with Medical Ontology

Using Graph Analysis on Health Data

Dimensionality Reduction on Health Data

Working with Patient Similarity Metrics


Summary and Conclusion

Sqoop and Flume for Big Data Training Course


7 hours (usually 1 day including breaks)


  • Experience with SQL


  • Software Engineers


Apache Sqoop is a command line interface for moving data from relational databases and Hadoop. Apache Flume is a distributed software for managing big data. Using Sqoop and Flume, users can transfer data between systems and import big data into storage architectures such as Hadoop.

This instructor-led, live training (online or onsite) is aimed at software engineers who wish to use Sqoop and Flume for transferring data between systems.

By the end of this training, participants will be able to:

  • Ingest big data with Sqoop and Flume.
  • Ingest data from multiple data sources.
  • Move data from relational databases to HDFS and Hive.
  • Export data from HDFS to a relational database.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline


Sqoop and Flume Overview

  • What is Sqoop?
  • What is Flume?
  • Sqoop and Flume features

Preparing the Development Environment

  • Installing and configuring Apache Sqoop
  • Installing and configuring Apache Flume

Apache Flume

  • Creating an agent
  • Using spool sources, file channels, and logger sinks
  • Working with events
  • Accessing data sources

Apache Sqoop

  • Importing MySQL to HDFS and Hive
  • Using Sqoop jobs

Data Ingestion Pipelines

  • Building pipelines
  • Fetching data
  • Ingesting data to HDFS

Summary and Conclusion

Machine Learning and Big Data Training Course


7 hours (usually 1 day including breaks)


  • An understanding of database concepts
  • Experience with software application development


  • Developers


This instructor-led, live training (online or onsite) is aimed at technical persons who wish to learn how to implement a machine learning strategy while maximizing the use of big data.

By the end of this training, participants will:

  • Understand the evolution and trends for machine learning.
  • Know how machine learning is being used across different industries.
  • Become familiar with the tools, skills and services available to implement machine learning within an organization.
  • Understand how machine learning can be used to enhance data mining and analysis.
  • Learn what a data middle backend is, and how it is being used by businesses.
  • Understand the role that big data and intelligent applications are playing across industries.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline


History, Evolution and Trends for Machine Learning

The Role of Big Data in Machine Learning

Infrastructure for Managing Big Data

Using Historical and Real-time Data to Predict Behavior

Case Study: Machine Learning Across Industries

Evaluating Existing Applications and Capabilities

Upskilling for Machine Learning

Tools for Implementing Machine Learning

Cloud vs On-Premise Services

Understanding the Data Middle Backend

Overview of Data Mining and Analysis

Combining Machine Learning with Data Mining

Case Study: Deploying Intelligent Applications to Deliver Personalized Experiences to Users

Summary and Conclusion

Big Data and its Management Process Training Course


14 hours (usually 2 days including breaks)


There are no specific requirements needed to attend this course.


Objective : This training course aims at helping attendees understand why Big Data is changing our lives and how it is altering the way businesses see us as consumers. Indeed, users of big data in businesses find that big data unleashes a wealth of information and insights which translate to higher profits, reduced costs, and less risk. However, the downside was frustration sometimes when putting too much emphasis on individual technologies and not enough focus on the pillars of big data management.

Attendees will learn during this course how to manage the big data using its three pillars of data integration, data governance and data security in order to turn big data into real business value. Different exercices conducted on a case study of customer management will help attendees to better understand the underlying processes.

Course Outline


  • Introducing Big Data : Evolutions over the years
  • The Characteristics of Big Data
  • Identifying Different Sources of Big Data
  • How Big Data Is Used in Business ?

The challenges of Big Data

  • Identifying the Challenges of Big Data : Current and Emerging Challenges
  • Why Businesses are Struggling with Big Data ?
  • State of Big Data Projects
  • Understanding the layers of big data architecture
  • The Big Data Management – Introduction
  • Defining capabilities of big data management
  • Overcoming obstacles with big data management

Building Blocks of an Efficient Big Data Management

  • The Big Data Laboratory versus Big Data Factory
  • Understanding the Three Pillars of Data Management
  • Data Integration
  • Data Governance
  • Data Security
  • Understanding functions of Big Data Management Processes
  • Competencies of the Big Data Team

Implementing Big Data Management

  • Implementing the Big Data Management
  • Identifying Big Data Tools
  • Leveraging the Right Tools
  • What are Commercial Tools built atop Open Source Projects ?
  • How to combine Integration, Governance and Security ?

Conclusion – Tips for Succeeding with Big Data Management

  • Use Cases to provide Business Value
  • Identifying Data Quality Issues Early
  • Aligning Your Vocabulary
  • Centralizing and Automating your Data Management
  • Leveraging Data Lakes
  • Collaborative Methods for Data Governance
  • Using a 360-Degree View on your Data and Relationships
  • How to work with Vendors to Accelerate Your Deployments ?

Big Data – Data Science Training Course


14 hours (usually 2 days including breaks)


Delegates should have an awareness and some experience of storgage tools and an awreness of handling large data sets


This classroom based training session will explore Big Data. Delegates will have computer based examples and case study exercises to undertake with relevant big data tools

Course Outline

  1. Big data fundamentals
    • Big Data and its role in the corporate world
    • The phases of development of a Big Data strategy within a corporation
    • Explain the rationale underlying a holistic approach to Big Data
    • Components needed in a Big Data Platform
    • Big data storage solution
    • Limits of Traditional Technologies
    • Overview of database types
    • The four dimensions of Big Data
  2. Big data impact on business
    • Business importance of Big Data
    • Challenges of extracting useful data
    • Integrating Big data with traditional data
  3. Big data storage technologies
    • Overview of big data technologies
      • Data storage models
      • Hadoop
      • Hive
      • Cassandra
      • MongoDB
    • Choosing the right big data technology
  4. Processing big data
    • Connecting and extracting data from database
    • Transforming and preparation data for processing
    • Using Hadoop MapReduce for processing distributed data
    • Monitoring and executing Hadoop MapReduce jobs
    • Hadoop distributed file system building blocks
    • Mapreduce and Yarn
    • Handling streaming data with Spark
  5. Big data analysis tools and technologies
    • Programming Hadoop with Pig Latin language
    • Querying big data with Hive
    • Mining data with Mahout
    • Visualizing and reporting tools
  6. Big data in business
    • Managing and establishing Big Data needs
    • Business importance of Big Data
    • Selecting the right big data tools for the problem

Data Warehousing Concepts

  • What is Data Ware House?
  • Difference between OLTP and Data Ware Housing
  • Data Acquisition
  • Data Extraction
  • Data Transformation.
  • Data Loading
  • Data Marts
  • Dependent vs Independent data Mart
  • Data Base design

ETL Testing Concepts:

  • Introduction.
  • Software development life cycle.
  • Testing methodologies.
  • ETL Testing Work Flow Process.
  • ETL Testing Responsibilities in Data stage.       

Big data Fundamentals

  • Big Data and its role in the corporate world
  • The phases of development of a Big Data strategy within a corporation
  • Explain the rationale underlying a holistic approach to Big Data
  • Components needed in a Big Data Platform
  • Big data storage solution
  • Limits of Traditional Technologies
  • Overview of database types

NoSQL Databases


Map Reduce

Apache Spark

Big Data Business Intelligence for Criminal Intelligence Analysis Training Course


35 hours (usually 5 days including breaks)


  • Knowledge of law enforcement processes and data systems
  • Basic understanding of SQL/Oracle or relational database
  • Basic understanding of statistics (at Spreadsheet level)


Advances in technologies and the increasing amount of information are transforming how law enforcement is conducted. The challenges that Big Data pose are nearly as daunting as Big Data’s promise. Storing data efficiently is one of these challenges; effectively analyzing it is another.

In this instructor-led, live training, participants will learn the mindset with which to approach Big Data technologies, assess their impact on existing processes and policies, and implement these technologies for the purpose of identifying criminal activity and preventing crime. Case studies from law enforcement organizations around the world will be examined to gain insights on their adoption approaches, challenges and results.

By the end of this training, participants will be able to:

  • Combine Big Data technology with traditional data gathering processes to piece together a story during an investigation
  • Implement industrial big data storage and processing solutions for data analysis
  • Prepare a proposal for the adoption of the most adequate tools and processes for enabling a data-driven approach to criminal investigation


  • Law Enforcement specialists with a technical background

Format of the course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

Day 01
Overview of Big Data Business Intelligence for Criminal Intelligence Analysis

  • Case Studies from Law Enforcement – Predictive Policing
  • Big Data adoption rate in Law Enforcement Agencies and how they are aligning their future operation around Big Data Predictive Analytics
  • Emerging technology solutions such as gunshot sensors, surveillance video and social media
  • Using Big Data technology to mitigate information overload
  • Interfacing Big Data with Legacy data
  • Basic understanding of enabling technologies in predictive analytics
  • Data Integration & Dashboard visualization
  • Fraud management
  • Business Rules and Fraud detection
  • Threat detection and profiling
  • Cost benefit analysis for Big Data implementation

Introduction to Big Data

  • Main characteristics of Big Data — Volume, Variety, Velocity and Veracity.
  • MPP (Massively Parallel Processing) architecture
  • Data Warehouses – static schema, slowly evolving dataset
  • MPP Databases: Greenplum, Exadata, Teradata, Netezza, Vertica etc.
  • Hadoop Based Solutions – no conditions on structure of dataset.
  • Typical pattern : HDFS, MapReduce (crunch), retrieve from HDFS
  • Apache Spark for stream processing
  • Batch- suited for analytical/non-interactive
  • Volume : CEP streaming data
  • Typical choices – CEP products (e.g. Infostreams, Apama, MarkLogic etc)
  • Less production ready – Storm/S4
  • NoSQL Databases – (columnar and key-value): Best suited as analytical adjunct to data warehouse/database

NoSQL solutions

  • KV Store – Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
  • KV Store – Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
  • KV Store (Hierarchical) – GT.m, Cache
  • KV Store (Ordered) – TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
  • KV Cache – Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
  • Tuple Store – Gigaspaces, Coord, Apache River
  • Object Database – ZopeDB, DB40, Shoal
  • Document Store – CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
  • Wide Columnar Store – BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

Varieties of Data: Introduction to Data Cleaning issues in Big Data

  • RDBMS – static structure/schema, does not promote agile, exploratory environment.
  • NoSQL – semi structured, enough structure to store data without exact schema before storing data
  • Data cleaning issues


  • When to select Hadoop?
  • STRUCTURED – Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not good for active exploration)
  • SEMI STRUCTURED data – difficult to carry out using traditional solutions (DW/DB)
  • Warehousing data = HUGE effort and static even after implementation
  • For variety & volume of data, crunched on commodity hardware – HADOOP
  • Commodity H/W needed to create a Hadoop Cluster

Introduction to Map Reduce /HDFS

  • MapReduce – distribute computing over multiple servers
  • HDFS – make data available locally for the computing process (with redundancy)
  • Data – can be unstructured/schema-less (unlike RDBMS)
  • Developer responsibility to make sense of data
  • Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS

Day 02
Big Data Ecosystem — Building Big Data ETL (Extract, Transform, Load) — Which Big Data Tools to use and when?

  • Hadoop vs. Other NoSQL solutions
  • For interactive, random access to data
  • Hbase (column oriented database) on top of Hadoop
  • Random access to data but restrictions imposed (max 1 PB)
  • Not good for ad-hoc analytics, good for logging, counting, time-series
  • Sqoop – Import from databases to Hive or HDFS (JDBC/ODBC access)
  • Flume – Stream data (e.g. log data) into HDFS

Big Data Management System

  • Moving parts, compute nodes start/fail :ZooKeeper – For configuration/coordination/naming services
  • Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
  • Deploy, configure, cluster management, upgrade etc (sys admin) :Ambari
  • In Cloud : Whirr

Predictive Analytics — Fundamental Techniques and Machine Learning based Business Intelligence

  • Introduction to Machine Learning
  • Learning classification techniques
  • Bayesian Prediction — preparing a training file
  • Support Vector Machine
  • KNN p-Tree Algebra & vertical mining
  • Neural Networks
  • Big Data large variable problem — Random forest (RF)
  • Big Data Automation problem – Multi-model ensemble RF
  • Automation through Soft10-M
  • Text analytic tool-Treeminer
  • Agile learning
  • Agent based learning
  • Distributed learning
  • Introduction to Open source Tools for predictive analytics : R, Python, Rapidminer, Mahut

Predictive Analytics Ecosystem and its application in Criminal Intelligence Analysis

  • Technology and the investigative process
  • Insight analytic
  • Visualization analytics
  • Structured predictive analytics
  • Unstructured predictive analytics
  • Threat/fraudstar/vendor profiling
  • Recommendation Engine
  • Pattern detection
  • Rule/Scenario discovery – failure, fraud, optimization
  • Root cause discovery
  • Sentiment analysis
  • CRM analytics
  • Network analytics
  • Text analytics for obtaining insights from transcripts, witness statements, internet chatter, etc.
  • Technology assisted review
  • Fraud analytics
  • Real Time Analytic

Day 03
Real Time and Scalable Analytics Over Hadoop

  • Why common analytic algorithms fail in Hadoop/HDFS
  • Apache Hama- for Bulk Synchronous distributed computing
  • Apache SPARK- for cluster computing and real time analytic
  • CMU Graphics Lab2- Graph based asynchronous approach to distributed computing
  • KNN p — Algebra based approach from Treeminer for reduced hardware cost of operation

Tools for eDiscovery and Forensics

  • eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
  • Predictive coding and Technology Assisted Review (TAR)
  • Live demo of vMiner for understanding how TAR enables faster discovery
  • Faster indexing through HDFS – Velocity of data
  • NLP (Natural Language processing) – open source products and techniques
  • eDiscovery in foreign languages — technology for foreign language processing

Big Data BI for Cyber Security – Getting a 360-degree view, speedy data collection and threat identification

  • Understanding the basics of security analytics — attack surface, security misconfiguration, host defenses
  • Network infrastructure / Large datapipe / Response ETL for real time analytic
  • Prescriptive vs predictive – Fixed rule based vs auto-discovery of threat rules from Meta data

Gathering disparate data for Criminal Intelligence Analysis

  • Using IoT (Internet of Things) as sensors for capturing data
  • Using Satellite Imagery for Domestic Surveillance
  • Using surveillance and image data for criminal identification
  • Other data gathering technologies — drones, body cameras, GPS tagging systems and thermal imaging technology
  • Combining automated data retrieval with data obtained from informants, interrogation, and research
  • Forecasting criminal activity

Day 04
Fraud prevention BI from Big Data in Fraud Analytics

  • Basic classification of Fraud Analytics — rules-based vs predictive analytics
  • Supervised vs unsupervised Machine learning for Fraud pattern detection
  • Business to business fraud, medical claims fraud, insurance fraud, tax evasion and money laundering

Social Media Analytics — Intelligence gathering and analysis

  • How Social Media is used by criminals to organize, recruit and plan
  • Big Data ETL API for extracting social media data
  • Text, image, meta data and video
  • Sentiment analysis from social media feed
  • Contextual and non-contextual filtering of social media feed
  • Social Media Dashboard to integrate diverse social media
  • Automated profiling of social media profile
  • Live demo of each analytic will be given through Treeminer Tool

Big Data Analytics in image processing and video feeds

  • Image Storage techniques in Big Data — Storage solution for data exceeding petabytes
  • LTFS (Linear Tape File System) and LTO (Linear Tape Open)
  • GPFS-LTFS (General Parallel File System –  Linear Tape File System) — layered storage solution for Big image data
  • Fundamentals of image analytics
  • Object recognition
  • Image segmentation
  • Motion tracking
  • 3-D image reconstruction

Biometrics, DNA and Next Generation Identification Programs

  • Beyond fingerprinting and facial recognition
  • Speech recognition, keystroke (analyzing a users typing pattern) and CODIS (combined DNA Index System)
  • Beyond DNA matching: using forensic DNA phenotyping to construct a face from DNA samples

Big Data Dashboard for quick accessibility of diverse data and display :

  • Integration of existing application platform with Big Data Dashboard
  • Big Data management
  • Case Study of Big Data Dashboard: Tableau and Pentaho
  • Use Big Data app to push location based services in Govt.
  • Tracking system and management

Day 05
How to justify Big Data BI implementation within an organization:

  • Defining the ROI (Return on Investment) for implementing Big Data
  • Case studies for saving Analyst Time in collection and preparation of Data – increasing productivity
  • Revenue gain from lower database licensing cost
  • Revenue gain from location based services
  • Cost savings from fraud prevention
  • An integrated spreadsheet approach for calculating approximate expenses vs. Revenue gain/savings from Big Data implementation.

Step by Step procedure for replacing a legacy data system with a Big Data System

  • Big Data Migration Roadmap
  • What critical information is needed before architecting a Big Data system?
  • What are the different ways for calculating Volume, Velocity, Variety and Veracity of data
  • How to estimate data growth
  • Case studies

Review of Big Data Vendors and review of their products.

  • Accenture
  • APTEAN (Formerly CDC Software)
  • Cisco Systems
  • Cloudera
  • Dell
  • EMC
  • GoodData Corporation
  • Guavus
  • Hitachi Data Systems
  • Hortonworks
  • HP
  • IBM
  • Informatica
  • Intel
  • Jaspersoft
  • Microsoft
  • MongoDB (Formerly 10Gen)
  • MU Sigma
  • Netapp
  • Opera Solutions
  • Oracle
  • Pentaho
  • Platfora
  • Qliktech
  • Quantum
  • Rackspace
  • Revolution Analytics
  • Salesforce
  • SAP
  • SAS Institute
  • Sisense
  • Software AG/Terracotta
  • Soft10 Automation
  • Splunk
  • Sqrrl
  • Supermicro
  • Tableau Software
  • Teradata
  • Think Big Analytics
  • Tidemark Systems
  • Treeminer
  • VMware (Part of EMC)

Q/A session

Big Data Analytics for Telecom Regulators Training Course


14 hours (usually 2 days including breaks)


There are no specific requirements needed to attend this course.


To meet compliance of the regulators, CSPs (Communication service providers) can tap into Big Data Analytics which not only help them to meet compliance but within the scope of same project they can increase customer satisfaction and thus reduce the churn. In fact since compliance is related to Quality of service tied to a contract, any initiative towards meeting the compliance, will improve the “competitive edge” of the CSPs. Therefore, it is important that Regulators should be able to advise/guide a set of Big Data analytic practice for CSPs that will be of mutual benefit between the regulators and CSPs.

The course consists of 8 modules (4 on day 1, and 4 on day 2)

Course Outline

1. Module-1 : Case studies of how Telecom Regulators have used Big Data Analytics for imposing compliance :

  • TRAI ( Telecom Regulatory Authority of India)
  • Turkish Telecom regulator : Telekomünikasyon Kurumu
  • FCC -Federal Communication Commission
  • BTRC – Bangladesh Telecommunication Regulatory Authority

2. Module-2 : Reviewing Millions of contract between CSPs and its users using unstructured Big data analytics

  • Elements of NLP ( Natural Language Processing )
  • Extracting SLA ( service level agreements ) from millions of Contracts
  • Some of the known open source and licensed tool for Contract analysis ( eBravia, IBM Watson, KIRA)
  • Automatic discovery of contract and conflict from Unstructured data analysis

3. Module -3 : Extracting Structured information from unstructured Customer Contract and map them to Quality of Service obtained from IPDR data & Crowd Sourced app data. Metric for Compliance. Automatic detection of compliance violations.

4. Module- 4 : USING app approach to collect compliance and QoS data- release a free regulatory mobile app to the users to track & Analyze automatically. In this approach regulatory authority will be releasing free app and distribute among the users-and the app will be collecting data on QoS/Spams etc and report it back in analytic dashboard form :

  • Intelligent spam detection engine (for SMS only) to assist the subscriber in reporting
  • Crowdsourcing of data about offending messages and calls to speed up detection of unregistered telemarketers
  • Updates about action taken on complaints within the App
  • Automatic reporting of voice call quality ( call drop, one way connection) for those who will have the regulatory app installed
  • Automatic reporting of Data Speed

5. Module-5 : Processing of regulatory app data for automatic alarm system generation (alarms will be generated and emailed/sms to stake holders automatically) :
Implementation of dashboard and alarm service

  • Microsoft Azure based dashboard and SNS alarm service
  • AWS Lambda Service based Dashboard and alarming
  • AWS/Microsoft Analytic suite to crunch the data for Alarm generation
  • Alarm generation rules

6. Module-6 : Use IPDR data for QoS and Compliance-IPDR Big data analytics:

  • Metered billing by service and subscriber usage
  • Network capacity analysis and planning
  • Edge resource management
  • Network inventory and asset management
  • Service-level objective (SLO) monitoring for business services
  • Quality of experience (QOE) monitoring
  • Call Drops
  • Service optimization and product development analytics

7. Module-7 : Customer Service Experience & Big Data approach to CSP CRM :

  • Compliance on Refund policies
  • Subscription fees
  • Meeting SLA and Subscription discount
  • Automatic detection of not meeting SLAs

8. Module-8 : Big Data ETL for integrating different QoS data source and combine to a single dashboard alarm based analytics:

  • Using a PAAS Cloud like AWS Lambda, Microsoft Azure
  • Using a Hybrid cloud approach

Vespa: Serving Large-Scale Data in Real-Time Training Course


14 hours (usually 2 days including breaks)


  • An understanding of big concepts
  • An understanding of big data systems such as Hadoop and Storm
  • Experience working with the command line


Vespa is an open-source big data processing and serving engine created by Yahoo.  It is used to respond to user queries, make recommendations, and provide personalized content and advertisements in real-time.

This instructor-led, live training introduces the challenges of serving large-scale data and walks participants through the creation of an application that can compute responses to user requests, over large datasets in real-time.

By the end of this training, participants will be able to:

  • Use Vespa to quickly compute data (store, search, rank, organize) at serving time while a user waits
  • Implement Vespa into existing applications involving feature search, recommendations, and personalization
  • Integrate and deploy Vespa with existing big data systems such as Hadoop and Storm.


  • Developers

Format of the course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

To request a customized course outline for this training, please contact us.

Data Science for Big Data Analytics Training Course


35 hours (usually 5 days including breaks)


Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.

Course Outline

Introduction to Data Science for Big Data Analytics

  • Data Science Overview
  • Big Data Overview
  • Data Structures
  • Drivers and complexities of Big Data
  • Big Data ecosystem and a new approach to analytics
  • Key technologies in Big Data
  • Data Mining process and problems
    • Association Pattern Mining
    • Data Clustering
    • Outlier Detection
    • Data Classification

Introduction to Data Analytics lifecycle

  • Discovery
  • Data preparation
  • Model planning
  • Model building
  • Presentation/Communication of results
  • Operationalization
  • Exercise: Case study

From this point most of the training time (80%) will be spent on examples and exercises in R and related big data technology.

Getting started with R

  • Installing R and Rstudio
  • Features of R language
  • Objects in R
  • Data in R
  • Data manipulation
  • Big data issues
  • Exercises

Getting started with Hadoop

  • Installing Hadoop
  • Understanding Hadoop modes
  • HDFS
  • MapReduce architecture
  • Hadoop related projects overview
  • Writing programs in Hadoop MapReduce
  • Exercises

Integrating R and Hadoop with RHadoop

  • Components of RHadoop
  • Installing RHadoop and connecting with Hadoop
  • The architecture of RHadoop
  • Hadoop streaming with R
  • Data analytics problem solving with RHadoop
  • Exercises

Pre-processing and preparing data

  • Data preparation steps
  • Feature extraction
  • Data cleaning
  • Data integration and transformation
  • Data reduction – sampling, feature subset selection,
  • Dimensionality reduction
  • Discretization and binning
  • Exercises and Case study

Exploratory data analytic methods in R

  • Descriptive statistics
  • Exploratory data analysis
  • Visualization – preliminary steps
  • Visualizing single variable
  • Examining multiple variables
  • Statistical methods for evaluation
  • Hypothesis testing
  • Exercises and Case study

Data Visualizations

  • Basic visualizations in R
  • Packages for data visualization ggplot2, lattice, plotly, lattice
  • Formatting plots in R
  • Advanced graphs
  • Exercises

Regression (Estimating future values)

  • Linear regression
  • Use cases
  • Model description
  • Diagnostics
  • Problems with linear regression
  • Shrinkage methods, ridge regression, the lasso
  • Generalizations and nonlinearity
  • Regression splines
  • Local polynomial regression
  • Generalized additive models
  • Regression with RHadoop
  • Exercises and Case study


  • The classification related problems
  • Bayesian refresher
  • Naïve Bayes
  • Logistic regression
  • K-nearest neighbors
  • Decision trees algorithm
  • Neural networks
  • Support vector machines
  • Diagnostics of classifiers
  • Comparison of classification methods
  • Scalable classification algorithms
  • Exercises and Case study

Assessing model performance and selection

  • Bias, Variance and model complexity
  • Accuracy vs Interpretability
  • Evaluating classifiers
  • Measures of model/algorithm performance
  • Hold-out method of validation
  • Cross-validation
  • Tuning machine learning algorithms with caret package
  • Visualizing model performance with Profit ROC and Lift curves

Ensemble Methods

  • Bagging
  • Random Forests
  • Boosting
  • Gradient boosting
  • Exercises and Case study

Support vector machines for classification and regression

  • Maximal Margin classifiers
    • Support vector classifiers
    • Support vector machines
    • SVM’s for classification problems
    • SVM’s for regression problems
  • Exercises and Case study

Identifying unknown groupings within a data set

  • Feature Selection for Clustering
  • Representative based algorithms: k-means, k-medoids
  • Hierarchical algorithms: agglomerative and divisive methods
  • Probabilistic base algorithms: EM
  • Density based algorithms: DBSCAN, DENCLUE
  • Cluster validation
  • Advanced clustering concepts
  • Clustering with RHadoop
  • Exercises and Case study

Discovering connections with Link Analysis

  • Link analysis concepts
  • Metrics for analyzing networks
  • The Pagerank algorithm
  • Hyperlink-Induced Topic Search
  • Link Prediction
  • Exercises and Case study

Association Pattern Mining

  • Frequent Pattern Mining Model
  • Scalability issues in frequent pattern mining
  • Brute Force algorithms
  • Apriori algorithm
  • The FP growth approach
  • Evaluation of Candidate Rules
  • Applications of Association Rules
  • Validation and Testing
  • Diagnostics
  • Association rules with R and Hadoop
  • Exercises and Case study

Constructing recommendation engines

  • Understanding recommender systems
  • Data mining techniques used in recommender systems
  • Recommender systems with recommenderlab package
  • Evaluating the recommender systems
  • Recommendations with RHadoop
  • Exercise: Building recommendation engine

Text analysis

  • Text analysis steps
  • Collecting raw text
  • Bag of words
  • Term Frequency –Inverse Document Frequency
  • Determining Sentiments
  • Exercises and Case study