big data – Page 2 – Bluechip AI Asia, AI Development Company

Big Data & Database Systems Fundamentals Training Course

Posted on September 22, 2023 by admin

Duration

14 hours (usually 2 days including breaks)

Overview

The course is part of the Data Scientist skill set (Domain: Data and Technology).

Course Outline

Data Warehousing Concepts

What is Data Ware House?
Difference between OLTP and Data Ware Housing
Data Acquisition
Data Extraction
Data Transformation.
Data Loading
Data Marts
Dependent vs Independent data Mart
Data Base design

ETL Testing Concepts:

Introduction.
Software development life cycle.
Testing methodologies.
ETL Testing Work Flow Process.
ETL Testing Responsibilities in Data stage.

Big data Fundamentals

Big Data and its role in the corporate world
The phases of development of a Big Data strategy within a corporation
Explain the rationale underlying a holistic approach to Big Data
Components needed in a Big Data Platform
Big data storage solution
Limits of Traditional Technologies
Overview of database types

NoSQL Databases

Hadoop

Map Reduce

Apache Spark

Big Data Storage Solution – NoSQL Training Course

Posted on September 22, 2023 by admin

Duration

14 hours (usually 2 days including breaks)

Requirements

Good understanding of traditional technologies for data storage (MySQL, Oracle, SQL Server, etc…)

Overview

When traditional storage technologies don’t handle the amount of data you need to store there are hundereds of alternatives. This course try to guide the participants what are alternatives for storing and analyzing Big Data and what are theirs pros and cons.

This course is mostly focused on discussion and presentation of solutions, though hands-on exercises are available on demand.

Course Outline

Limits of Traditional Technologies

SQL databases
Redundancy: replicas and clusters
Constraints
Speed

Overview of database types

Object Databases
Document Store
Cloud Databases
Wide Column Store
Multidimensional Databases
Multivalue Databases
Streaming and Time Series Databases
Multimodel Databases
Graph Databases
Key Value
XML Databases
Distribute file systems

Popular NoSQL Databases

MongoDB
Cassandra
Apache Hadoop
Apache Spark
other solutions

NewSQL

Overview of available solutions
Performance
Inconsitencies

Document Storage/Search Optimized

Solr/Lucene/Elasticsearch
other solutions

Programming with Big Data in R Training Course

Posted on September 22, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Overview

Big Data is a term that refers to solutions destined for storing and processing large data sets. Developed by Google initially, these Big Data solutions have evolved and inspired other similar projects, many of which are available as open-source. R is a popular programming language in the financial industry.

Course Outline

Introduction to Programming Big Data with R (bpdR)

Setting up your environment to use pbdR
Scope and tools available in pbdR
Packages commonly used with Big Data alongside pbdR

Message Passing Interface (MPI)

Using pbdR MPI 5
Parallel processing
Point-to-point communication
Send Matrices
Summing Matrices
Collective communication
Summing Matrices with Reduce
Scatter / Gather
Other MPI communications

Distributed Matrices

Creating a distributed diagonal matrix
SVD of a distributed matrix
Building a distributed matrix in parallel

Statistics Applications

Monte Carlo Integration
Reading Datasets
Reading on all processes
Broadcasting from one process
Reading partitioned data
Distributed Regression
Distributed Bootstrap

Big Data Architect Training Course

Posted on September 15, 2023 by admin

Duration

35 hours (usually 5 days including breaks)

Overview

Day 1 – provides a high-level overview of essential Big Data topic areas. The module is divided into a series of sections, each of which is accompanied by a hands-on exercise.

Day 2 – explores a range of topics that relate analysis practices and tools for Big Data environments. It does not get into implementation or programming details, but instead keeps coverage at a conceptual level, focusing on topics that enable participants to develop a comprehensive understanding of the common analysis functions and features offered by Big Data solutions.

Day 3 – provides an overview of the fundamental and essential topic areas relating to Big Data solution platform architecture. It covers Big Data mechanisms required for the development of a Big Data solution platform and architectural options for assembling a data processing platform. Common scenarios are also presented to provide a basic understanding of how a Big Data solution platform is generally used.

Day 4 – builds upon Day 3 by exploring advanced topics relatng to Big Data solution platform architecture. In particular, different architectural layers that make up the Big Data solution platform are introduced and discussed, including data sources, data ingress, data storage, data processing and security.

Day 5 – covers a number of exercises and problems designed to test the delegates ability to apply knowledge of topics covered Day 3 and 4.

Course Outline

Day 1 – Fundamental Big Data

Understanding Big Data
Fundamental Terminology & Concepts
Big Data Business & Technology Drivers
Traditional Enterprise Technologies Related to Big Data
Characteristics of Data in Big Data Environments
Dataset Types in Big Data Environments
Fundamental Analysis and Analytics
Machine Learning Types
Business Intelligence & Big Data
Data Visualization & Big Data
Big Data Adoption & Planning Considerations

Day 2 – Big Data Analysis & Technology Concepts

Big Data Analysis Lifecycle (from business case evaluation to data analysis and visualization)
A/B Testing, Correlation
Regression, Heat Maps
Time Series Analysis
Network Analysis
Spatial Data Analysis
Classification, Clustering
Outlier Detection
Filtering (including collaborative filtering & content-based filtering)
Natural Language Processing
Sentiment Analysis, Text Analytics
File Systems & Distributed File Systems, NoSQL
Distributed & Parallel Data Processing,
Processing Workloads, Clusters
Cloud Computing & Big Data
Foundational Big Data Technology Mechanisms

Day 3 – Fundamental Big Data Architecture

New Big Data Mechanisms, including …
- Security Engine
- Cluster Manager
- Data Governance Manager
- Visualization Engine
- Productivity Portal
Data Processing Architectural Models, including …
- Shared-Everything and Shared-Nothing Architectures
Enterprise Data Warehouse and Big Data Integration Approaches, including …
- Series
- Parallel
- Big Data Appliance
- Data Virtualization
Architectural Big Data Environments, including …
- ETL
- Analytics Engine
- Application Enrichment
Cloud Computing & Big Data Architectural Considerations, including …
- how Cloud Delivery and Deployment Models can be used to host and process Big Data Solutions

Day 4 – Advanced Big Data Architecture

Big Data Solution Architectural Layers including …
- Data Sources,
- Data Ingress and Storage,
- Event Stream Processing and Complex Event Processing,
- Egress,
- Visualization and Utilization,
- Big Data Architecture and Security,
- Maintenance and Governance
Big Data Solution Design Patterns, including …
- Patterns pertaining to Data Ingress,
- Data Wrangling,
- Data Storage,
- Data Processing,
- Data Analysis,
- Data Egress,
- Data Visualization
Big Data Architectural Compound Patterns

Day 5 – Big Data Architecture Lab

Incorporates a set of detailed exercises that require delegates to solve various inter-related problems, with the goal of fostering a comprehensive understanding of how different data architecture technologies, mechanisms and techniques can be applied to solve problems in Big Data environments.

From Data to Decision with Big Data and Predictive Analytics Training Course

Posted on September 15, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

Understanding of traditional data management and analysis methods like SQL, data warehouses, business intelligence, OLAP, etc… Understanding of basic statistics and probability (mean, variance, probability, conditional probability, etc….)

Overview

Audience

If you try to make sense out of the data you have access to or want to analyse unstructured data available on the net (like Twitter, Linked in, etc…) this course is for you.

It is mostly aimed at decision makers and people who need to choose what data is worth collecting and what is worth analyzing.

It is not aimed at people configuring the solution, those people will benefit from the big picture though.

Delivery Mode

During the course delegates will be presented with working examples of mostly open source technologies.

Short lectures will be followed by presentation and simple exercises by the participants

Content and Software used

All software used is updated each time the course is run, so we check the newest versions possible.

It covers the process from obtaining, formatting, processing and analysing the data, to explain how to automate decision making process with machine learning.

Course Outline

Quick Overview

Data Sources
Minding Data
Recommender systems
Target Marketing

Datatypes

Structured vs unstructured
Static vs streamed
Attitudinal, behavioural and demographic data
Data-driven vs user-driven analytics
data validity
Volume, velocity and variety of data

Models

Building models
Statistical Models
Machine learning

Data Classification

Clustering
kGroups, k-means, the nearest neighbours
Ant colonies, birds flocking

Predictive Models

Decision trees
Support vector machine
Naive Bayes classification
Neural networks
Markov Model
Regression
Ensemble methods

ROI

Benefit/Cost ratio
Cost of software
Cost of development
Potential benefits

Building Models

Data Preparation (MapReduce)
Data cleansing
Choosing methods
Developing model
Testing Model
Model evaluation
Model deployment and integration

Overview of Open Source and commercial software

Selection of R-project package
Python libraries
Hadoop and Mahout
Selected Apache projects related to Big Data and Analytics
Selected commercial solution
Integration with existing software and data sources

Big Data Business Intelligence for Govt. Agencies Training Course

Posted on September 15, 2023 by admin

Duration

35 hours (usually 5 days including breaks)

Requirements

Basic knowledge of business operation and data systems in Govt. in their domain
Basic understanding of SQL/Oracle or relational database
Basic understanding of Statistics (at Spreadsheet level)

Overview

Advances in technologies and the increasing amount of information are transforming how business is conducted in many industries, including government. Government data generation and digital archiving rates are on the rise due to the rapid growth of mobile devices and applications, smart sensors and devices, cloud computing solutions, and citizen-facing portals. As digital information expands and becomes more complex, information management, processing, storage, security, and disposition become more complex as well. New capture, search, discovery, and analysis tools are helping organizations gain insights from their unstructured data. The government market is at a tipping point, realizing that information is a strategic asset, and government needs to protect, leverage, and analyze both structured and unstructured information to better serve and meet mission requirements. As government leaders strive to evolve data-driven organizations to successfully accomplish mission, they are laying the groundwork to correlate dependencies across events, people, processes, and information.

High-value government solutions will be created from a mashup of the most disruptive technologies:

Mobile devices and applications
Cloud services
Social business technologies and networking
Big Data and analytics

IDC predicts that by 2020, the IT industry will reach $5 trillion, approximately $1.7 trillion larger than today, and that 80% of the industry’s growth will be driven by these 3rd Platform technologies. In the long term, these technologies will be key tools for dealing with the complexity of increased digital information. Big Data is one of the intelligent industry solutions and allows government to make better decisions by taking action based on patterns revealed by analyzing large volumes of data — related and unrelated, structured and unstructured.

But accomplishing these feats takes far more than simply accumulating massive quantities of data.“Making sense of thesevolumes of Big Datarequires cutting-edge tools and technologies that can analyze and extract useful knowledge from vast and diverse streams of information,” Tom Kalil and Fen Zhao of the White House Office of Science and Technology Policy wrote in a post on the OSTP Blog.

The White House took a step toward helping agencies find these technologies when it established the National Big Data Research and Development Initiative in 2012. The initiative included more than $200 million to make the most of the explosion of Big Data and the tools needed to analyze it.

The challenges that Big Data poses are nearly as daunting as its promise is encouraging. Storing data efficiently is one of these challenges. As always, budgets are tight, so agencies must minimize the per-megabyte price of storage and keep the data within easy access so that users can get it when they want it and how they need it. Backing up massive quantities of data heightens the challenge.

Analyzing the data effectively is another major challenge. Many agencies employ commercial tools that enable them to sift through the mountains of data, spotting trends that can help them operate more efficiently. (A recent study by MeriTalk found that federal IT executives think Big Data could help agencies save more than $500 billion while also fulfilling mission objectives.).

Custom-developed Big Data tools also are allowing agencies to address the need to analyze their data. For example, the Oak Ridge National Laboratory’s Computational Data Analytics Group has made its Piranha data analytics system available to other agencies. The system has helped medical researchers find a link that can alert doctors to aortic aneurysms before they strike. It’s also used for more mundane tasks, such as sifting through résumés to connect job candidates with hiring managers.

Course Outline

Each session is 2 hours

Day-1: Session -1: Business Overview of Why Big Data Business Intelligence in Govt.

Case Studies from NIH, DoE
Big Data adaptation rate in Govt. Agencies & and how they are aligning their future operation around Big Data Predictive Analytics
Broad Scale Application Area in DoD, NSA, IRS, USDA etc.
Interfacing Big Data with Legacy data
Basic understanding of enabling technologies in predictive analytics
Data Integration & Dashboard visualization
Fraud management
Business Rule/ Fraud detection generation
Threat detection and profiling
Cost benefit analysis for Big Data implementation

Day-1: Session-2 : Introduction of Big Data-1

Main characteristics of Big Data-volume, variety, velocity and veracity. MPP architecture for volume.
Data Warehouses – static schema, slowly evolving dataset
MPP Databases like Greenplum, Exadata, Teradata, Netezza, Vertica etc.
Hadoop Based Solutions – no conditions on structure of dataset.
Typical pattern : HDFS, MapReduce (crunch), retrieve from HDFS
Batch- suited for analytical/non-interactive
Volume : CEP streaming data
Typical choices – CEP products (e.g. Infostreams, Apama, MarkLogic etc)
Less production ready – Storm/S4
NoSQL Databases – (columnar and key-value): Best suited as analytical adjunct to data warehouse/database

Day-1 : Session -3 : Introduction to Big Data-2

NoSQL solutions

KV Store – Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
KV Store – Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
KV Store (Hierarchical) – GT.m, Cache
KV Store (Ordered) – TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
KV Cache – Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
Tuple Store – Gigaspaces, Coord, Apache River
Object Database – ZopeDB, DB40, Shoal
Document Store – CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
Wide Columnar Store – BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

Varieties of Data: Introduction to Data Cleaning issue in Big Data

RDBMS – static structure/schema, doesn’t promote agile, exploratory environment.
NoSQL – semi structured, enough structure to store data without exact schema before storing data
Data cleaning issues

Day-1 : Session-4 : Big Data Introduction-3 : Hadoop

When to select Hadoop?
STRUCTURED – Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not good for active exploration)
SEMI STRUCTURED data – tough to do with traditional solutions (DW/DB)
Warehousing data = HUGE effort and static even after implementation
For variety & volume of data, crunched on commodity hardware – HADOOP
Commodity H/W needed to create a Hadoop Cluster

Introduction to Map Reduce /HDFS

MapReduce – distribute computing over multiple servers
HDFS – make data available locally for the computing process (with redundancy)
Data – can be unstructured/schema-less (unlike RDBMS)
Developer responsibility to make sense of data
Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS

Day-2: Session-1: Big Data Ecosystem-Building Big Data ETL: universe of Big Data Tools-which one to use and when?

Hadoop vs. Other NoSQL solutions
For interactive, random access to data
Hbase (column oriented database) on top of Hadoop
Random access to data but restrictions imposed (max 1 PB)
Not good for ad-hoc analytics, good for logging, counting, time-series
Sqoop – Import from databases to Hive or HDFS (JDBC/ODBC access)
Flume – Stream data (e.g. log data) into HDFS

Day-2: Session-2: Big Data Management System

Moving parts, compute nodes start/fail :ZooKeeper – For configuration/coordination/naming services
Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
Deploy, configure, cluster management, upgrade etc (sys admin) :Ambari
In Cloud : Whirr

Day-2: Session-3: Predictive analytics in Business Intelligence -1: Fundamental Techniques & Machine learning based BI :

Introduction to Machine learning
Learning classification techniques
Bayesian Prediction-preparing training file
Support Vector Machine
KNN p-Tree Algebra & vertical mining
Neural Network
Big Data large variable problem -Random forest (RF)
Big Data Automation problem – Multi-model ensemble RF
Automation through Soft10-M
Text analytic tool-Treeminer
Agile learning
Agent based learning
Distributed learning
Introduction to Open source Tools for predictive analytics : R, Rapidminer, Mahut

Day-2: Session-4 Predictive analytics eco-system-2: Common predictive analytic problems in Govt.

Insight analytic
Visualization analytic
Structured predictive analytic
Unstructured predictive analytic
Threat/fraudstar/vendor profiling
Recommendation Engine
Pattern detection
Rule/Scenario discovery –failure, fraud, optimization
Root cause discovery
Sentiment analysis
CRM analytic
Network analytic
Text Analytics
Technology assisted review
Fraud analytic
Real Time Analytic

Day-3 : Sesion-1 : Real Time and Scalable Analytic Over Hadoop

Why common analytic algorithms fail in Hadoop/HDFS
Apache Hama- for Bulk Synchronous distributed computing
Apache SPARK- for cluster computing for real time analytic
CMU Graphics Lab2- Graph based asynchronous approach to distributed computing
KNN p-Algebra based approach from Treeminer for reduced hardware cost of operation

Day-3: Session-2: Tools for eDiscovery and Forensics

eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
Predictive coding and technology assisted review (TAR)
Live demo of a Tar product ( vMiner) to understand how TAR works for faster discovery
Faster indexing through HDFS –velocity of data
NLP or Natural Language processing –various techniques and open source products
eDiscovery in foreign languages-technology for foreign language processing

Day-3 : Session 3: Big Data BI for Cyber Security –Understanding whole 360 degree views of speedy data collection to threat identification

Understanding basics of security analytics-attack surface, security misconfiguration, host defenses
Network infrastructure/ Large datapipe / Response ETL for real time analytic
Prescriptive vs predictive – Fixed rule based vs auto-discovery of threat rules from Meta data

Day-3: Session 4: Big Data in USDA : Application in Agriculture

Introduction to IoT ( Internet of Things) for agriculture-sensor based Big Data and control
Introduction to Satellite imaging and its application in agriculture
Integrating sensor and image data for fertility of soil, cultivation recommendation and forecasting
Agriculture insurance and Big Data
Crop Loss forecasting

Day-4 : Session-1: Fraud prevention BI from Big Data in Govt-Fraud analytic:

Basic classification of Fraud analytics- rule based vs predictive analytics
Supervised vs unsupervised Machine learning for Fraud pattern detection
Vendor fraud/over charging for projects
Medicare and Medicaid fraud- fraud detection techniques for claim processing
Travel reimbursement frauds
IRS refund frauds
Case studies and live demo will be given wherever data is available.

Day-4 : Session-2: Social Media Analytic- Intelligence gathering and analysis

Big Data ETL API for extracting social media data
Text, image, meta data and video
Sentiment analysis from social media feed
Contextual and non-contextual filtering of social media feed
Social Media Dashboard to integrate diverse social media
Automated profiling of social media profile
Live demo of each analytic will be given through Treeminer Tool.

Day-4 : Session-3: Big Data Analytic in image processing and video feeds

Image Storage techniques in Big Data- Storage solution for data exceeding petabytes
LTFS and LTO
GPFS-LTFS ( Layered storage solution for Big image data)
Fundamental of image analytics
Object recognition
Image segmentation
Motion tracking
3-D image reconstruction

Day-4: Session-4: Big Data applications in NIH:

Emerging areas of Bio-informatics
Meta-genomics and Big Data mining issues
Big Data Predictive analytic for Pharmacogenomics, Metabolomics and Proteomics
Big Data in downstream Genomics process
Application of Big data predictive analytics in Public health

Big Data Dashboard for quick accessibility of diverse data and display :

Integration of existing application platform with Big Data Dashboard
Big Data management
Case Study of Big Data Dashboard: Tableau and Pentaho
Use Big Data app to push location based services in Govt.
Tracking system and management

Day-5 : Session-1: How to justify Big Data BI implementation within an organization:

Defining ROI for Big Data implementation
Case studies for saving Analyst Time for collection and preparation of Data –increase in productivity gain
Case studies of revenue gain from saving the licensed database cost
Revenue gain from location based services
Saving from fraud prevention
An integrated spreadsheet approach to calculate approx. expense vs. Revenue gain/savings from Big Data implementation.

Day-5 : Session-2: Step by Step procedure to replace legacy data system to Big Data System:

Understanding practical Big Data Migration Roadmap
What are the important information needed before architecting a Big Data implementation
What are the different ways of calculating volume, velocity, variety and veracity of data
How to estimate data growth
Case studies

Day-5: Session 4: Review of Big Data Vendors and review of their products. Q/A session:

Accenture
APTEAN (Formerly CDC Software)
Cisco Systems
Cloudera
Dell
EMC
GoodData Corporation
Guavus
Hitachi Data Systems
Hortonworks
HP
IBM
Informatica
Intel
Jaspersoft
Microsoft
MongoDB (Formerly 10Gen)
MU Sigma
Netapp
Opera Solutions
Oracle
Pentaho
Platfora
Qliktech
Quantum
Rackspace
Revolution Analytics
Salesforce
SAP
SAS Institute
Sisense
Software AG/Terracotta
Soft10 Automation
Splunk
Sqrrl
Supermicro
Tableau Software
Teradata
Think Big Analytics
Tidemark Systems
Treeminer
VMware (Part of EMC)

Python and Spark for Big Data (PySpark) Training Course

Posted on September 15, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

General programming skills

Audience

Developers
IT Professionals
Data Scientists

Overview

Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.

In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.

By the end of this training, participants will be able to:

Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.

Format of the course

Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

Introduction

Understanding Big Data

Overview of Spark

Overview of Python

Overview of PySpark

Distributing Data Using Resilient Distributed Datasets Framework
Distributing Computation Using Spark API Operators

Setting Up Python with Spark

Setting Up PySpark

Using Amazon Web Services (AWS) EC2 Instances for Spark

Setting Up Databricks

Setting Up the AWS EMR Cluster

Learning the Basics of Python Programming

Getting Started with Python
Using the Jupyter Notebook
Using Variables and Simple Data Types
Working with Lists
Using if Statements
Using User Inputs
Working with while Loops
Implementing Functions
Working with Classes
Working with Files and Exceptions
Working with Projects, Data, and APIs

Learning the Basics of Spark DataFrame

Getting Started with Spark DataFrames
Implementing Basic Operations with Spark
Using Groupby and Aggregate Operations
Working with Timestamps and Dates

Working on a Spark DataFrame Project Exercise

Understanding Machine Learning with MLlib

Working with MLlib, Spark, and Python for Machine Learning

Understanding Regressions

Learning Linear Regression Theory
Implementing a Regression Evaluation Code
Working on a Sample Linear Regression Exercise
Learning Logistic Regression Theory
Implementing a Logistic Regression Code
Working on a Sample Logistic Regression Exercise

Understanding Random Forests and Decision Trees

Learning Tree Methods Theory
Implementing Decision Trees and Random Forest Codes
Working on a Sample Random Forest Classification Exercise

Working with K-means Clustering

Understanding K-means Clustering Theory
Implementing a K-means Clustering Code
Working on a Sample Clustering Exercise

Working with Recommender Systems

Implementing Natural Language Processing

Understanding Natural Language Processing (NLP)
Overview of NLP Tools
Working on a Sample NLP Exercise

Streaming with Spark on Python

Overview Streaming with Spark
Sample Spark Streaming Exercise

Closing Remarks

A Practical Introduction to Data Analysis and Big Data Training Course

Posted on September 15, 2023 by admin

Duration

35 hours (usually 5 days including breaks)

Requirements

A general understanding of math.
A general understanding of programming.
A general understanding of databases.

Audience

Developers / programmers
IT consultants

Overview

Participants who complete this instructor-led, live training will gain a practical, real-world understanding of Big Data and its related technologies, methodologies and tools.

Participants will have the opportunity to put this knowledge into practice through hands-on exercises. Group interaction and instructor feedback make up an important component of the class.

The course starts with an introduction to elemental concepts of Big Data, then progresses into the programming languages and methodologies used to perform Data Analysis. Finally, we discuss the tools and infrastructure that enable Big Data storage, Distributed Processing, and Scalability.

Format of the Course

Part lecture, part discussion, hands-on practice and implementation, occasional quizing to measure progress.

Course Outline

Introduction to Data Analysis and Big Data

What Makes Big Data “Big”?
- Velocity, Volume, Variety, Veracity (VVVV)
Limits to Traditional Data Processing
Distributed Processing
Statistical Analysis
Types of Machine Learning Analysis
Data Visualization

Big Data Roles and Responsibilities

Administrators
Developers
Data Analysts

Languages Used for Data Analysis

R Language
- Why R for Data Analysis?
- Data manipulation, calculation and graphical display
Python
- Why Python for Data Analysis?
- Manipulating, processing, cleaning, and crunching data

Approaches to Data Analysis

Statistical Analysis
- Time Series analysis
- Forecasting with Correlation and Regression models
- Inferential Statistics (estimating)
- Descriptive Statistics in Big Data sets (e.g. calculating mean)
Machine Learning
- Supervised vs unsupervised learning
- Classification and clustering
- Estimating cost of specific methods
- Filtering
Natural Language Processing
- Processing text
- Understaing meaning of the text
- Automatic text generation
- Sentiment analysis / topic analysis
Computer Vision
- Acquiring, processing, analyzing, and understanding images
- Reconstructing, interpreting and understanding 3D scenes
- Using image data to make decisions

Big Data Infrastructure

Data Storage
- Relational databases (SQL)
  - MySQL
  - Postgres
  - Oracle
- Non-relational databases (NoSQL)
  - Cassandra
  - MongoDB
  - Neo4js
- Understanding the nuances
  - Hierarchical databases
  - Object-oriented databases
  - Document-oriented databases
  - Graph-oriented databases
  - Other
Distributed Processing
- Hadoop
  - HDFS as a distributed filesystem
  - MapReduce for distributed processing
- Spark
  - All-in-one in-memory cluster computing framework for large-scale data processing
  - Structured streaming
  - Spark SQL
  - Machine Learning libraries: MLlib
  - Graph processing with GraphX
Scalability
- Public cloud
  - AWS, Google, Aliyun, etc.
- Private cloud
  - OpenStack, Cloud Foundry, etc.
- Auto-scalability

Choosing the Right Solution for the Problem

The Future of Big Data

Summary and Conclusion

Python, Spark, and Hadoop for Big Data Training Course

Posted on September 15, 2023 by admin

Duration

21 hours (usually 3 days including breaks)

Requirements

Experience with Spark and Hadoop
Python programming experience

Audience

Data scientists
Developers

Overview

Python is a scalable, flexible, and widely used programming language for data science and machine learning. Spark is a data processing engine used in querying, analyzing, and transforming big data, while Hadoop is a software library framework for large-scale data storage and processing.

This instructor-led, live training (online or onsite) is aimed at developers who wish to use and integrate Spark, Hadoop, and Python to process, analyze, and transform large and complex data sets.

By the end of this training, participants will be able to:

Set up the necessary environment to start processing big data with Spark, Hadoop, and Python.
Understand the features, core components, and architecture of Spark and Hadoop.
Learn how to integrate Spark, Hadoop, and Python for big data processing.
Explore the tools in the Spark ecosystem (Spark MlLib, Spark Streaming, Kafka, Sqoop, Kafka, and Flume).
Build collaborative filtering recommendation systems similar to Netflix, YouTube, Amazon, Spotify, and Google.
Use Apache Mahout to scale machine learning algorithms.

Format of the Course

Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Course Outline

Introduction

Overview of Spark and Hadoop features and architecture
Understanding big data
Python programming basics

Getting Started

Setting up Python, Spark, and Hadoop
Understanding data structures in Python
Understanding PySpark API
Understanding HDFS and MapReduce

Integrating Spark and Hadoop with Python

Implementing Spark RDD in Python
Processing data using MapReduce
Creating distributed datasets in HDFS

Machine Learning with Spark MLlib

Processing Big Data with Spark Streaming

Working with Recommender Systems

Working with Kafka, Sqoop, Kafka, and Flume

Apache Mahout with Spark and Hadoop

Troubleshooting

Summary and Next Steps

Big Data Business Intelligence for Criminal Intelligence Analysis Training Course

Posted on September 12, 2023 by admin

Duration

35 hours (usually 5 days including breaks)

Requirements

Knowledge of law enforcement processes and data systems
Basic understanding of SQL/Oracle or relational database
Basic understanding of statistics (at Spreadsheet level)

Overview

Advances in technologies and the increasing amount of information are transforming how law enforcement is conducted. The challenges that Big Data pose are nearly as daunting as Big Data’s promise. Storing data efficiently is one of these challenges; effectively analyzing it is another.

In this instructor-led, live training, participants will learn the mindset with which to approach Big Data technologies, assess their impact on existing processes and policies, and implement these technologies for the purpose of identifying criminal activity and preventing crime. Case studies from law enforcement organizations around the world will be examined to gain insights on their adoption approaches, challenges and results.

By the end of this training, participants will be able to:

Combine Big Data technology with traditional data gathering processes to piece together a story during an investigation
Implement industrial big data storage and processing solutions for data analysis
Prepare a proposal for the adoption of the most adequate tools and processes for enabling a data-driven approach to criminal investigation

Audience

Law Enforcement specialists with a technical background

Format of the course

Part lecture, part discussion, exercises and heavy hands-on practice

Course Outline

=====
Day 01
=====
Overview of Big Data Business Intelligence for Criminal Intelligence Analysis

Case Studies from Law Enforcement – Predictive Policing
Big Data adoption rate in Law Enforcement Agencies and how they are aligning their future operation around Big Data Predictive Analytics
Emerging technology solutions such as gunshot sensors, surveillance video and social media
Using Big Data technology to mitigate information overload
Interfacing Big Data with Legacy data
Basic understanding of enabling technologies in predictive analytics
Data Integration & Dashboard visualization
Fraud management
Business Rules and Fraud detection
Threat detection and profiling
Cost benefit analysis for Big Data implementation

Introduction to Big Data

Main characteristics of Big Data — Volume, Variety, Velocity and Veracity.
MPP (Massively Parallel Processing) architecture
Data Warehouses – static schema, slowly evolving dataset
MPP Databases: Greenplum, Exadata, Teradata, Netezza, Vertica etc.
Hadoop Based Solutions – no conditions on structure of dataset.
Typical pattern : HDFS, MapReduce (crunch), retrieve from HDFS
Apache Spark for stream processing
Batch- suited for analytical/non-interactive
Volume : CEP streaming data
Typical choices – CEP products (e.g. Infostreams, Apama, MarkLogic etc)
Less production ready – Storm/S4
NoSQL Databases – (columnar and key-value): Best suited as analytical adjunct to data warehouse/database

NoSQL solutions

KV Store – Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
KV Store – Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
KV Store (Hierarchical) – GT.m, Cache
KV Store (Ordered) – TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
KV Cache – Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
Tuple Store – Gigaspaces, Coord, Apache River
Object Database – ZopeDB, DB40, Shoal
Document Store – CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
Wide Columnar Store – BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

Varieties of Data: Introduction to Data Cleaning issues in Big Data

RDBMS – static structure/schema, does not promote agile, exploratory environment.
NoSQL – semi structured, enough structure to store data without exact schema before storing data
Data cleaning issues

Hadoop

When to select Hadoop?
STRUCTURED – Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not good for active exploration)
SEMI STRUCTURED data – difficult to carry out using traditional solutions (DW/DB)
Warehousing data = HUGE effort and static even after implementation
For variety & volume of data, crunched on commodity hardware – HADOOP
Commodity H/W needed to create a Hadoop Cluster

Introduction to Map Reduce /HDFS

MapReduce – distribute computing over multiple servers
HDFS – make data available locally for the computing process (with redundancy)
Data – can be unstructured/schema-less (unlike RDBMS)
Developer responsibility to make sense of data
Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS

=====
Day 02
=====
Big Data Ecosystem — Building Big Data ETL (Extract, Transform, Load) — Which Big Data Tools to use and when?

Hadoop vs. Other NoSQL solutions
For interactive, random access to data
Hbase (column oriented database) on top of Hadoop
Random access to data but restrictions imposed (max 1 PB)
Not good for ad-hoc analytics, good for logging, counting, time-series
Sqoop – Import from databases to Hive or HDFS (JDBC/ODBC access)
Flume – Stream data (e.g. log data) into HDFS

Big Data Management System

Moving parts, compute nodes start/fail :ZooKeeper – For configuration/coordination/naming services
Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
Deploy, configure, cluster management, upgrade etc (sys admin) :Ambari
In Cloud : Whirr

Predictive Analytics — Fundamental Techniques and Machine Learning based Business Intelligence

Introduction to Machine Learning
Learning classification techniques
Bayesian Prediction — preparing a training file
Support Vector Machine
KNN p-Tree Algebra & vertical mining
Neural Networks
Big Data large variable problem — Random forest (RF)
Big Data Automation problem – Multi-model ensemble RF
Automation through Soft10-M
Text analytic tool-Treeminer
Agile learning
Agent based learning
Distributed learning
Introduction to Open source Tools for predictive analytics : R, Python, Rapidminer, Mahut

Predictive Analytics Ecosystem and its application in Criminal Intelligence Analysis

Technology and the investigative process
Insight analytic
Visualization analytics
Structured predictive analytics
Unstructured predictive analytics
Threat/fraudstar/vendor profiling
Recommendation Engine
Pattern detection
Rule/Scenario discovery – failure, fraud, optimization
Root cause discovery
Sentiment analysis
CRM analytics
Network analytics
Text analytics for obtaining insights from transcripts, witness statements, internet chatter, etc.
Technology assisted review
Fraud analytics
Real Time Analytic

=====
Day 03
=====
Real Time and Scalable Analytics Over Hadoop

Why common analytic algorithms fail in Hadoop/HDFS
Apache Hama- for Bulk Synchronous distributed computing
Apache SPARK- for cluster computing and real time analytic
CMU Graphics Lab2- Graph based asynchronous approach to distributed computing
KNN p — Algebra based approach from Treeminer for reduced hardware cost of operation

Tools for eDiscovery and Forensics

eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
Predictive coding and Technology Assisted Review (TAR)
Live demo of vMiner for understanding how TAR enables faster discovery
Faster indexing through HDFS – Velocity of data
NLP (Natural Language processing) – open source products and techniques
eDiscovery in foreign languages — technology for foreign language processing

Big Data BI for Cyber Security – Getting a 360-degree view, speedy data collection and threat identification

Understanding the basics of security analytics — attack surface, security misconfiguration, host defenses
Network infrastructure / Large datapipe / Response ETL for real time analytic
Prescriptive vs predictive – Fixed rule based vs auto-discovery of threat rules from Meta data

Gathering disparate data for Criminal Intelligence Analysis

Using IoT (Internet of Things) as sensors for capturing data
Using Satellite Imagery for Domestic Surveillance
Using surveillance and image data for criminal identification
Other data gathering technologies — drones, body cameras, GPS tagging systems and thermal imaging technology
Combining automated data retrieval with data obtained from informants, interrogation, and research
Forecasting criminal activity

=====
Day 04
=====
Fraud prevention BI from Big Data in Fraud Analytics

Basic classification of Fraud Analytics — rules-based vs predictive analytics
Supervised vs unsupervised Machine learning for Fraud pattern detection
Business to business fraud, medical claims fraud, insurance fraud, tax evasion and money laundering

Social Media Analytics — Intelligence gathering and analysis

How Social Media is used by criminals to organize, recruit and plan
Big Data ETL API for extracting social media data
Text, image, meta data and video
Sentiment analysis from social media feed
Contextual and non-contextual filtering of social media feed
Social Media Dashboard to integrate diverse social media
Automated profiling of social media profile
Live demo of each analytic will be given through Treeminer Tool

Big Data Analytics in image processing and video feeds

Image Storage techniques in Big Data — Storage solution for data exceeding petabytes
LTFS (Linear Tape File System) and LTO (Linear Tape Open)
GPFS-LTFS (General Parallel File System – Linear Tape File System) — layered storage solution for Big image data
Fundamentals of image analytics
Object recognition
Image segmentation
Motion tracking
3-D image reconstruction

Biometrics, DNA and Next Generation Identification Programs

Beyond fingerprinting and facial recognition
Speech recognition, keystroke (analyzing a users typing pattern) and CODIS (combined DNA Index System)
Beyond DNA matching: using forensic DNA phenotyping to construct a face from DNA samples

Big Data Dashboard for quick accessibility of diverse data and display :

Integration of existing application platform with Big Data Dashboard
Big Data management
Case Study of Big Data Dashboard: Tableau and Pentaho
Use Big Data app to push location based services in Govt.
Tracking system and management

=====
Day 05
=====
How to justify Big Data BI implementation within an organization:

Defining the ROI (Return on Investment) for implementing Big Data
Case studies for saving Analyst Time in collection and preparation of Data – increasing productivity
Revenue gain from lower database licensing cost
Revenue gain from location based services
Cost savings from fraud prevention
An integrated spreadsheet approach for calculating approximate expenses vs. Revenue gain/savings from Big Data implementation.

Step by Step procedure for replacing a legacy data system with a Big Data System

Big Data Migration Roadmap
What critical information is needed before architecting a Big Data system?
What are the different ways for calculating Volume, Velocity, Variety and Veracity of data
How to estimate data growth
Case studies

Review of Big Data Vendors and review of their products.

Accenture
APTEAN (Formerly CDC Software)
Cisco Systems
Cloudera
Dell
EMC
GoodData Corporation
Guavus
Hitachi Data Systems
Hortonworks
HP
IBM
Informatica
Intel
Jaspersoft
Microsoft
MongoDB (Formerly 10Gen)
MU Sigma
Netapp
Opera Solutions
Oracle
Pentaho
Platfora
Qliktech
Quantum
Rackspace
Revolution Analytics
Salesforce
SAP
SAS Institute
Sisense
Software AG/Terracotta
Soft10 Automation
Splunk
Sqrrl
Supermicro
Tableau Software
Teradata
Think Big Analytics
Tidemark Systems
Treeminer
VMware (Part of EMC)

Q/A session