Pentaho Open Source BI Suite Community Edition (CE) Training Course

Introduction to Pentaho Open Source BI Suite Community Edition (CE)

Overview of CE Features and Architecture

  • Pentaho Community Edition vs. Enterprise Edition
  • Pentaho CE Tools

Installing and Configuring Pentaho CE

Using the Pentaho CE Business Analytics User Console

Creating Reports with the Pentaho CE Business Analytics Report Designer

Performing Data Integration in Pentaho CE

Working with Databases in Pentaho CE

  • Relational Databases
  • NoSQL Sources
  • Analytic Databases

Working with the Analysis View in Pentaho CE

  • Predictive Analytics

Working with Big Data in Pentaho CE

  • Graphical Designer for Big Data

Maximizing the Community Online Forums of Pentaho CE

Deploying or Embedding Your Pentaho CE Project

  • Licensing

Troubleshooting

Data Analytics With R Training Course

Day One: Language Basics

  • Course Introduction
  • About Data Science
    • Data Science Definition
    • Process of Doing Data Science.
  • Introducing R Language
  • Variables and Types
  • Control Structures (Loops / Conditionals)
  • R Scalars, Vectors, and Matrices
    • Defining R Vectors
    • Matricies
  • String and Text Manipulation
    • Character data type
    • File IO
  • Lists
  • Functions
    • Introducing Functions
    • Closures
    • lapply/sapply functions
  • DataFrames
  • Labs for all sections

Day Two: Intermediate R Programming

  • DataFrames and File I/O
  • Reading data from files
  • Data Preparation
  • Built-in Datasets
  • Visualization
    • Graphics Package
    • plot() / barplot() / hist() / boxplot() / scatter plot
    • Heat Map
    • ggplot2 package (qplot(), ggplot())
  • Exploration With Dplyr
  • Labs for all sections

Day Three: Advanced Programming With R

  • Statistical Modeling With R
    • Statistical Functions
    • Dealing With NA
    • Distributions (Binomial, Poisson, Normal)
  • Regression
    • Introducing Linear Regressions
  • Recommendations
  • Text Processing (tm package / Wordclouds)
  • Clustering
    • Introduction to Clustering
    • KMeans
  • Classification
    • Introduction to Classification
    • Naive Bayes
    • Decision Trees
    • Training using caret package
    • Evaluating Algorithms
  • R and Big Data
    • Connecting R to databases
    • Big Data Ecosystem
  • Labs for all sections

Databricks Training Course

Introduction

  • Overview of Databricks and Apache Spark
  • Understanding the Databricks architecture

Getting Started

  • Setting up the Environment
  • Setting up and configuring Databricks
  • Navigating the Databricks user interface
  • Creating a Databricks workspace

Working with Data in Databricks

  • Connecting to an Apache Spark data source
  • Understanding the basics columns and datatypes
  • Managing file system into Notebooks

Managing Jobs and Clusters

  • Creating and configuring clusters
  • Creating jobs using Notebook
  • Running jobs
  • Viewing jobs and job details

Using Delta Lake in Databricks

  • Loading data into Delta Lake
  • Managing data in Delta Lake

Securing Databricks

  • Managing Databricks security
  • Managing backup and recovery

Troubleshooting

Big Data Analytics in Health Training Course

Duration

21 hours (usually 3 days including breaks)

Requirements

  • An understanding of machine learning and data mining concepts
  • Advanced programming experience (Python, Java, Scala)
  • Proficiency in data and ETL processes

Overview

Big data analytics involves the process of examining large amounts of varied data sets in order to uncover correlations, hidden patterns, and other useful insights.

The health industry has massive amounts of complex heterogeneous medical and clinical data. Applying big data analytics on health data presents huge potential in deriving insights for improving delivery of healthcare. However, the enormity of these datasets poses great challenges in analyses and practical applications to a clinical environment.

In this instructor-led, live training (remote), participants will learn how to perform big data analytics in health as they step through a series of hands-on live-lab exercises.

By the end of this training, participants will be able to:

  • Install and configure big data analytics tools such as Hadoop MapReduce and Spark
  • Understand the characteristics of medical data
  • Apply big data techniques to deal with medical data
  • Study big data systems and algorithms in the context of health applications

Audience

  • Developers
  • Data Scientists

Format of the Course

  • Part lecture, part discussion, exercises and heavy hands-on practice.

Note

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction to Big Data Analytics in Health

Overview of Big Data Analytics Technologies

  • Apache Hadoop MapReduce
  • Apache Spark

Installing and Configuring Apache Hadoop MapReduce

Installing and Configuring Apache Spark

Using Predictive Modeling for Health Data

Using Apache Hadoop MapReduce for Health Data

Performing Phenotyping & Clustering on Health Data

  • Classification Evaluation Metrics
  • Classification Ensemble Methods

Using Apache Spark for Health Data

Working with Medical Ontology

Using Graph Analysis on Health Data

Dimensionality Reduction on Health Data

Working with Patient Similarity Metrics

Troubleshooting

Summary and Conclusion

Data Science for Big Data Analytics Training Course

Duration

35 hours (usually 5 days including breaks)

Overview

Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.

Course Outline

Introduction to Data Science for Big Data Analytics

  • Data Science Overview
  • Big Data Overview
  • Data Structures
  • Drivers and complexities of Big Data
  • Big Data ecosystem and a new approach to analytics
  • Key technologies in Big Data
  • Data Mining process and problems
    • Association Pattern Mining
    • Data Clustering
    • Outlier Detection
    • Data Classification

Introduction to Data Analytics lifecycle

  • Discovery
  • Data preparation
  • Model planning
  • Model building
  • Presentation/Communication of results
  • Operationalization
  • Exercise: Case study

From this point most of the training time (80%) will be spent on examples and exercises in R and related big data technology.

Getting started with R

  • Installing R and Rstudio
  • Features of R language
  • Objects in R
  • Data in R
  • Data manipulation
  • Big data issues
  • Exercises

Getting started with Hadoop

  • Installing Hadoop
  • Understanding Hadoop modes
  • HDFS
  • MapReduce architecture
  • Hadoop related projects overview
  • Writing programs in Hadoop MapReduce
  • Exercises

Integrating R and Hadoop with RHadoop

  • Components of RHadoop
  • Installing RHadoop and connecting with Hadoop
  • The architecture of RHadoop
  • Hadoop streaming with R
  • Data analytics problem solving with RHadoop
  • Exercises

Pre-processing and preparing data

  • Data preparation steps
  • Feature extraction
  • Data cleaning
  • Data integration and transformation
  • Data reduction – sampling, feature subset selection,
  • Dimensionality reduction
  • Discretization and binning
  • Exercises and Case study

Exploratory data analytic methods in R

  • Descriptive statistics
  • Exploratory data analysis
  • Visualization – preliminary steps
  • Visualizing single variable
  • Examining multiple variables
  • Statistical methods for evaluation
  • Hypothesis testing
  • Exercises and Case study

Data Visualizations

  • Basic visualizations in R
  • Packages for data visualization ggplot2, lattice, plotly, lattice
  • Formatting plots in R
  • Advanced graphs
  • Exercises

Regression (Estimating future values)

  • Linear regression
  • Use cases
  • Model description
  • Diagnostics
  • Problems with linear regression
  • Shrinkage methods, ridge regression, the lasso
  • Generalizations and nonlinearity
  • Regression splines
  • Local polynomial regression
  • Generalized additive models
  • Regression with RHadoop
  • Exercises and Case study

Classification

  • The classification related problems
  • Bayesian refresher
  • Naïve Bayes
  • Logistic regression
  • K-nearest neighbors
  • Decision trees algorithm
  • Neural networks
  • Support vector machines
  • Diagnostics of classifiers
  • Comparison of classification methods
  • Scalable classification algorithms
  • Exercises and Case study

Assessing model performance and selection

  • Bias, Variance and model complexity
  • Accuracy vs Interpretability
  • Evaluating classifiers
  • Measures of model/algorithm performance
  • Hold-out method of validation
  • Cross-validation
  • Tuning machine learning algorithms with caret package
  • Visualizing model performance with Profit ROC and Lift curves

Ensemble Methods

  • Bagging
  • Random Forests
  • Boosting
  • Gradient boosting
  • Exercises and Case study

Support vector machines for classification and regression

  • Maximal Margin classifiers
    • Support vector classifiers
    • Support vector machines
    • SVM’s for classification problems
    • SVM’s for regression problems
  • Exercises and Case study

Identifying unknown groupings within a data set

  • Feature Selection for Clustering
  • Representative based algorithms: k-means, k-medoids
  • Hierarchical algorithms: agglomerative and divisive methods
  • Probabilistic base algorithms: EM
  • Density based algorithms: DBSCAN, DENCLUE
  • Cluster validation
  • Advanced clustering concepts
  • Clustering with RHadoop
  • Exercises and Case study

Discovering connections with Link Analysis

  • Link analysis concepts
  • Metrics for analyzing networks
  • The Pagerank algorithm
  • Hyperlink-Induced Topic Search
  • Link Prediction
  • Exercises and Case study

Association Pattern Mining

  • Frequent Pattern Mining Model
  • Scalability issues in frequent pattern mining
  • Brute Force algorithms
  • Apriori algorithm
  • The FP growth approach
  • Evaluation of Candidate Rules
  • Applications of Association Rules
  • Validation and Testing
  • Diagnostics
  • Association rules with R and Hadoop
  • Exercises and Case study

Constructing recommendation engines

  • Understanding recommender systems
  • Data mining techniques used in recommender systems
  • Recommender systems with recommenderlab package
  • Evaluating the recommender systems
  • Recommendations with RHadoop
  • Exercise: Building recommendation engine

Text analysis

  • Text analysis steps
  • Collecting raw text
  • Bag of words
  • Term Frequency –Inverse Document Frequency
  • Determining Sentiments
  • Exercises and Case study