A Practical Introduction to Data Science Training Course

Introduction

  • The Data Science Process
  • Roles and responsibilities of a Data Scientist

Preparing the Development Environment

  • Libraries, frameworks, languages and tools
  • Local development
  • Collaborative web-based development

Data Collection

  • Different Types of Data
    • Structured
      • Local databases
      • Database connectors
      • Common formats: xlxs, XML, Json, csv, …
    • Un-Structured
      • Clicks, censors, smartphones
      • APIs
      • Internet of Things (IoT)
      • Documents, pictures, videos, sounds
  • Case study: Collecting large amounts of unstructured data continuosly

Data Storage

  • Relational databases
  • Non-relational databases
  • Hadoop: Distributed File System (HDFS)
  • Spark: Resilient Distributed Dataset (RDD)
  • Cloud storage

Data Preparation

  • Ingestion, selection, cleansing, and transformation
  • Ensuring data quality – correctness, meaningfulness, and security
  • Exception reports

Languages used for Preparation, Processing and Analysis

  • R language
    • Introduction to R
    • Data manipulation, calculation and graphical display
  • Python
    • Introduction to Python
    • Manipulating, processing, cleaning, and crunching data

Data Analytics

  • Exploratory analysis
    • Basic statistics
    • Draft visualizations
    • Understand data 
  • Causality
  • Features and transformations
  • Machine Learning
    • Supervised vs unsurpevised
    • When to use what model
  • Natural Language Processing (NLP)

Data Visualization

  • Best Practices
  • Selecting the right chart for the right data
  • Color pallets
  • Taking it to the next level
    • Dashboards
    • Interactive Visualizations
  • Storytelling with data

Which data storage to choose – from flat files, through SQL, NoSQL to massive distributed systems Training Course

Duration

7 hours (usually 1 day including breaks)

Requirements

Though no technical background is required, understanding the examples requires some level of database theory (e.g. SQL, etc…)

Overview

This course helps customer to chose the write data storage depend on their needs. It covers almost all possible modern approaches.

Course Outline

  1. File Document Storage (Cloud Storage)
    1. Features (OCR, Scalaibility, Search, etc…)
    2. Open Source examples (e.g. Next Cloud)
    3. Some commercial examples
  2. Flat file storage
    1. XML databases
    2. CSV databases
  3. Relational databases
    1. Normalization
    2. Dependencies and Constrants
    3. Scalability – replications, clusters
    4. Open Source and commercial software (MySQL, PostrgreSQL, DM7, Oracle, etc.)
  4. NoSQL Storage
    1. Document Oriented Databases (MongoDB, CouchDB etc…)
    2. Column Orientation (Canadra, Scylla etc…)
    3. Search Orientation (Elasticsearch…
  5. NewSQL
    1. CAP Theorem
    2. Opensource software (SequoiaDB, etc…)
  6. Search Engines
    1. Features (text processing, relevancy, etc…)
    2. Open Source examples
    3. Scalability, High Availability, Load Balacing, etc….
  7. Traditional Datawherehouses
    1. Business Inteligence, OLTP and Datawherehouse
    2. Opensource and commercial solutions
  8. MapReduce and Distributed Parallel Processing
    1. Hadoop-like (Hive, HFS, Impala)
  9. Distributed filesystem
    1. Overview of opensource (Ceph etc…)
  10. In-memory Databases
    1. Opensource solution (e.g. ApacheIgnite)
  11. Others
    1. Hypertable (Google Bigtable)
    2. BigQuery
    3. AWS solutsion (S3, etc…)
  12. Beyond present – future trends