Introductory R for Biologists Training Course – Bluechip AI Asia, AI Development Company

Duration

28 hours (usually 4 days including breaks)

Overview

R is an open-source free programming language for statistical computing, data analysis, and graphics. R is used by a growing number of managers and data analysts inside corporations and academia. R has also found followers among statisticians, engineers and scientists without computer programming skills who find it easy to use. Its popularity is due to the increasing use of data mining for various goals such as set ad prices, find new drugs more quickly or fine-tune financial models. R has a wide variety of packages for data mining.

Course Outline

I. Introduction and preliminaries

1. Overview

Making R more friendly, R and available GUIs
Rstudio
Related software and documentation
R and statistics
Using R interactively
An introductory session
Getting help with functions and features
R commands, case sensitivity, etc.
Recall and correction of previous commands
Executing commands from or diverting output to a file
Data permanency and removing objects
Good programming practice: Self-contained scripts, good readability e.g. structured scripts, documentation, markdown
installing packages; CRAN and Bioconductor

2. Reading data

Txt files (read.delim)
CSV files

3. Simple manipulations; numbers and vectors + arrays

Vectors and assignment
Vector arithmetic
Generating regular sequences
Logical vectors
Missing values
Character vectors
Index vectors; selecting and modifying subsets of a data set
- Arrays
Array indexing. Subsections of an array
Index matrices
The array() function + simple operations on arrays e.g. multiplication, transposition
Other types of objects

4. Lists and data frames

Lists
Constructing and modifying lists
- Concatenating lists
Data frames
- Making data frames
- Working with data frames
- Attaching arbitrary lists
- Managing the search path

5. Data manipulation

Selecting, subsetting observations and variables
Filtering, grouping
Recoding, transformations
Aggregation, combining data sets
Forming partitioned matrices, cbind() and rbind()
The concatenation function, (), with arrays
Character manipulation, stringr package
short intro into grep and regexpr

6. More on Reading data

XLS, XLSX files
readr and readxl packages
SPSS, SAS, Stata,… and other formats data
Exporting data to txt, csv and other formats

6. Grouping, loops and conditional execution

Grouped expressions
Control statements
Conditional execution: if statements
Repetitive execution: for loops, repeat and while
intro into apply, lapply, sapply, tapply

7. Functions

Creating functions
Optional arguments and default values
Variable number of arguments
Scope and its consequences

8. Simple graphics in R

Creating a Graph
Density Plots
Dot Plots
Bar Plots
Line Charts
Pie Charts
Boxplots
Scatter Plots
Combining Plots

II. Statistical analysis in R

1. Probability distributions

R as a set of statistical tables
Examining the distribution of a set of data

2. Testing of Hypotheses

Tests about a Population Mean
Likelihood Ratio Test
One- and two-sample tests
Chi-Square Goodness-of-Fit Test
Kolmogorov-Smirnov One-Sample Statistic
Wilcoxon Signed-Rank Test
Two-Sample Test
Wilcoxon Rank Sum Test
Mann-Whitney Test
Kolmogorov-Smirnov Test

3. Multiple Testing of Hypotheses

Type I Error and FDR
ROC curves and AUC
Multiple Testing Procedures (BH, Bonferroni etc.)

4. Linear regression models

Generic functions for extracting model information
Updating fitted models
Generalized linear models
- Families
- The glm() function
Classification
- Logistic Regression
- Linear Discriminant Analysis
Unsupervised learning
- Principal Components Analysis
- Clustering Methods(k-means, hierarchical clustering, k-medoids)

5. Survival analysis (survival package)

Survival objects in r
Kaplan-Meier estimate, log-rank test, parametric regression
Confidence bands
Censored (interval censored) data analysis
Cox PH models, constant covariates
Cox PH models, time-dependent covariates
Simulation: Model comparison (Comparing regression models)

6. Analysis of Variance

One-Way ANOVA
Two-Way Classification of ANOVA
MANOVA

III. Worked problems in bioinformatics

Short introduction to limma package
Microarray data analysis workflow
Data download from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1397
Data processing (QC, normalisation, differential expression)
Volcano plot
Custering examples + heatmaps