Duration
14 hours (usually 2 days including breaks)
Requirements
- An understanding of big data concepts (HDFS, Hive, etc.)
- An understanding of relational databases (MySQL, etc.)
- Experience with the Linux command line
Overview
Sqoop is an open source software tool for transfering data between Hadoop and relational databases or mainframes. It can be used to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS). Thereafter, the data can be transformed in Hadoop MapReduce, and then re-exported back into an RDBMS.
In this instructor-led, live training, participants will learn how to use Sqoop to import data from a traditional relational database to Hadoop storage such HDFS or Hive and vice versa.
By the end of this training, participants will be able to:
- Install and configure Sqoop
- Import data from MySQL to HDFS and Hive
- Import data from HDFS and Hive to MySQL
Audience
- System administrators
- Data engineers
Format of the Course
- Part lecture, part discussion, exercises and heavy hands-on practice
Note
- To request a customized training for this course, please contact us to arrange.
Course Outline
Introduction
- Moving data from legacy data stores to Hadoop
Installing and Configuring Sqoop
Overview of Sqoop Features and Architecture
Importing Data from MySQL to HDFS
Importing Data from MySQL to Hive
Transforming Data in Hadoop
Importing Data from HDFS to MySQL
Importing Data from Hive to MySQL
Importing Incrementally with Sqoop Jobs
Troubleshooting
Summary and Conclusion
Duration
7 hours (usually 1 day including breaks)
Requirements
- An understanding of machine learning
Overview
Snorkel is a system for rapidly creating, modeling, and managing training data. It focuses on accelerating the development of structured or “dark” data extraction applications for domains in which large labeled training sets are not available or easy to obtain.
In this instructor-led, live training, participants will learn techniques for extracting value from unstructured data such as text, tables, figures, and images through modeling of training data with Snorkel.
By the end of this training, participants will be able to:
- Programmatically create training sets to enable the labeling of massive training sets
- Train high-quality end models by first modeling noisy training sets
- Use Snorkel to implement weak supervision techniques and apply data programming to weakly-supervised machine learning systems
Audience
- Developers
- Data scientists
Format of the course
- Part lecture, part discussion, exercises and heavy hands-on practice
Course Outline
To request a customized course outline for this training, please contact us.
Duration
14 hours (usually 2 days including breaks)
Requirements
- Experience with data processing and AI concepts
Audience
- Data Scientists
- Business Analysts
- Data Engineers
- Developers
- System Administrators
Overview
IBM Cloud Pak for Data is a multi-cloud software platform for collecting, organizing and analyzing data for use in AI.
This instructor-led, live training (online or onsite) is aimed at data scientists who wish to use IBM Cloud Pak to prepare data for use in AI solutions.
By the end of this training, participants will be able to:
- Install and configure Cloud Pak for Data.
- Unify the collection, organization and analysis of data.
- Integrate Cloud Pak for Data with a variety of services to solve common business problems.
- Implement workflows for collaborating with team members on the development of an AI solution.
Format of the Course
- Interactive lecture and discussion.
- Lots of exercises and practice.
- Hands-on implementation in a live-lab environment.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Course Outline
Introduction
Overview of Cloud Pac for Data Features and Architecture
- Red Hat OpenShift Container Platform
- Containers, Kubernetes, and Helm
- Red Hat OpenShift security
Setting up Cloud Pac for Data
- Pre-installation tasks
- Installation
- Post-installation tasks
Setting up a Workflows
- Setting up roles and permissions for collaboration
- Creating a workflow
- Searching and requesting data
Collecting Data
- Connecting to a data source
- Adding data to a project
Organizing Data
- Working with catalogs
- Curating catalog data
- Governing data to comply with regulations
- Automating the discovery process
Preparing Data
- Transforming data
- Refining data
- Virtualizing data
Analyzing Data
- Analyzing data using notebooks
- Analyzing data using other tools
- Analyzing data automatically using AutoAI
Implementing an AI Solution
- Building a machine learning model
- Deploying the model
- Validating the model
- Monitoring the model
Integrating Cloud Data Pac with Other Services
- Finding services in a catalog
- Finding services outside a catalog
- Integrating IBM Cloud Pak for Data with other applications
Administering Cloud Data Pac
- Managing an IBM Cloud Pak for Data cluster
- Managing an IBM Cloud Pak for Data web client
- Uninstalling Cloud Pak for Data
Troubleshooting
Summary and Conclusion
Duration
35 hours (usually 5 days including breaks)
Requirements
None
Overview
This is a 5 day introduction to Data Science and Artificial Intelligence (AI).
The course is delivered with examples and exercises using Python
Course Outline
Introduction to Data Science/AI
- Knowledge acquisition through data
- Knowledge representation
- Value creation
- Data Science overview
- AI ecosystem and new approach to analytics
- Key technologies
Data Science workflow
- Crisp-dm
- Data preparation
- Model planning
- Model building
- Communication
- Deployment
Data Science technologies
- Languages used for prototyping
- Big Data technologies
- End to end solutions to common problems
- Introduction to Python language
- Integrating Python with Spark
AI in Business
- AI ecosystem
- Ethics of AI
- How to drive AI in business
Data sources
- Types of data
- SQL vs NoSQL
- Data Storage
- Data preparation
Data Analysis – Statistical approach
- Probability
- Statistics
- Statistical modeling
- Applications in business using Python
Machine learning in business
- Supervised vs unsupervised
- Forecasting problems
- Classfication problems
- Clustering problems
- Anomaly detection
- Recommendation engines
- Association pattern mining
- Solving ML problems with Python language
Deep learning
- Problems where traditional ML algorithms fails
- Solving complicated problems with Deep Learning
- Introduction to Tensorflow
Natural Language processing
Data visualization
- Visual reporting outcomes from modeling
- Common pitfalls in visualization
- Data visualization with Python
From Data to Decision – communication
- Making impact: data driven story telling
- Influence effectivnes
- Managing Data Science projects