Moving Data from MySQL to Hadoop with Sqoop Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • An understanding of big data concepts (HDFS, Hive, etc.)
  • An understanding of relational databases (MySQL, etc.)
  • Experience with the Linux command line

Overview

Sqoop is an open source software tool for transfering data between Hadoop and relational databases or mainframes. It can be used to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS). Thereafter, the data can be transformed in Hadoop MapReduce, and then re-exported back into an RDBMS.

In this instructor-led, live training, participants will learn how to use Sqoop to import data from a traditional relational database to Hadoop storage such HDFS or Hive and vice versa.

By the end of this training, participants will be able to:

  • Install and configure Sqoop
  • Import data from MySQL to HDFS and Hive
  • Import data from HDFS and Hive to MySQL

Audience

  • System administrators
  • Data engineers

Format of the Course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Note

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

  • Moving data from legacy data stores to Hadoop

Installing and Configuring Sqoop

Overview of Sqoop Features and Architecture

Importing Data from MySQL to HDFS

Importing Data from MySQL to Hive

Transforming Data in Hadoop

Importing Data from HDFS to MySQL

Importing Data from Hive to MySQL

Importing Incrementally with Sqoop Jobs

Troubleshooting

Summary and Conclusion

Snorkel: Rapidly Process Training Data Training Course

Duration

7 hours (usually 1 day including breaks)

Requirements

  • An understanding of machine learning

Overview

Snorkel is a system for rapidly creating, modeling, and managing training data. It focuses on accelerating the development of structured or “dark” data extraction applications for domains in which large labeled training sets are not available or easy to obtain.

In this instructor-led, live training, participants will learn techniques for extracting value from unstructured data such as text, tables, figures, and images through modeling of training data with Snorkel.

By the end of this training, participants will be able to:

  • Programmatically create training sets to enable the labeling of massive training sets
  • Train high-quality end models by first modeling noisy training sets
  • Use Snorkel to implement weak supervision techniques and apply data programming to weakly-supervised machine learning systems

Audience

  • Developers
  • Data scientists

Format of the course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

To request a customized course outline for this training, please contact us.

IBM Cloud Pak for Data Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • Experience with data processing and AI concepts

Audience

  • Data Scientists
  • Business Analysts
  • Data Engineers
  • Developers
  • System Administrators

Overview

IBM Cloud Pak for Data is a multi-cloud software platform for collecting, organizing and analyzing data for use in AI.

This instructor-led, live training (online or onsite) is aimed at data scientists who wish to use IBM Cloud Pak to prepare data for use in AI solutions.

By the end of this training, participants will be able to:

  • Install and configure Cloud Pak for Data.
  • Unify the collection, organization and analysis of data.
  • Integrate Cloud Pak for Data with a variety of services to solve common business problems.
  • Implement workflows for collaborating with team members on the development of an AI solution.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Overview of Cloud Pac for Data Features and Architecture

  • Red Hat OpenShift Container Platform
  • Containers, Kubernetes, and Helm
  • Red Hat OpenShift security

Setting up Cloud Pac for Data

  • Pre-installation tasks
  • Installation
  • Post-installation tasks

Setting up a Workflows

  • Setting up roles and permissions for collaboration
  • Creating a workflow
  • Searching and requesting data

Collecting Data

  • Connecting to a data source
  • Adding data to a project

Organizing Data

  • Working with catalogs
  • Curating catalog data
  • Governing data to comply with regulations
  • Automating the discovery process

Preparing Data

  • Transforming data
  • Refining data
  • Virtualizing data

Analyzing Data

  • Analyzing data using notebooks
  • Analyzing data using other tools
  • Analyzing data automatically using AutoAI

Implementing an AI Solution

  • Building a machine learning model
  • Deploying the model
  • Validating the model
  • Monitoring the model

Integrating Cloud Data Pac with Other Services

  • Finding services in a catalog
  • Finding services outside a catalog
  • Integrating IBM Cloud Pak for Data with other applications

Administering Cloud Data Pac

  • Managing an IBM Cloud Pak for Data cluster
  • Managing an IBM Cloud Pak for Data web client
  • Uninstalling Cloud Pak for Data

Troubleshooting

Summary and Conclusion

Introduction to Data Science and AI using Python Training Course

Duration

35 hours (usually 5 days including breaks)

Requirements

None

Overview

This is a 5 day introduction to Data Science and Artificial Intelligence (AI).

The course is delivered with examples and exercises using Python 

Course Outline

Introduction to Data Science/AI

  • Knowledge acquisition through data
  • Knowledge representation
  • Value creation
  • Data Science overview
  • AI ecosystem and new approach to analytics
  • Key technologies

Data Science workflow

  • Crisp-dm
  • Data preparation
  • Model planning
  • Model building
  • Communication
  • Deployment

Data Science technologies

  • Languages used for prototyping
  • Big Data technologies
  • End to end solutions to common problems
  • Introduction to Python language
  • Integrating Python with Spark

AI in Business

  • AI ecosystem
  • Ethics of AI
  • How to drive AI in business

Data sources

  • Types of data
  • SQL vs NoSQL
  • Data Storage
  • Data preparation

Data Analysis – Statistical approach

  • Probability
  • Statistics
  • Statistical modeling
  • Applications in business using Python

Machine learning in business

  • Supervised vs unsupervised
  • Forecasting problems
  • Classfication problems
  • Clustering problems
  • Anomaly detection
  • Recommendation engines
  • Association pattern mining
  • Solving ML problems with Python language

Deep learning

  • Problems where traditional ML algorithms fails
  • Solving complicated problems with Deep Learning
  • Introduction to Tensorflow

Natural Language processing

Data visualization

  • Visual reporting outcomes from modeling
  • Common pitfalls in visualization
  • Data visualization with Python

From Data to Decision – communication

  • Making impact: data driven story telling
  • Influence effectivnes
  • Managing Data Science projects