Introduction to Biostatistical Analysis
Using R
Statistics course for first-year PhD students
Session 2
Lecture: Introduction to statistical hypothesis testing
Null and alternate hypothesis. Types of error.
Two-sample hypotheses. Correlation. Analysis of frequency data.
Introduction to statistical modeling
Lecturer: Lorenzo Marini
DAFNAE,
University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova.
E-mail: [email protected]
Tel.: +39 0498272807
http://www.biodiversity-lorenzomarini.eu/
Inference
A statistical hypothesis test is a method of making statistical
decisions from and about experimental data. Null-hypothesis testing
just answers the question of "how well do the findings fit the
possibility that chance factors alone might be responsible?”.
sampling
Sample
Estimation
Population
(Uncertainty!!!)
testing
Statistical Model
Key concepts: Session 1
Statistical testing in five steps:
1. Construct a null hypothesis (H0) and alterantive hypothesis
2. Choose a statistical analysis (assumptions!!!)
3. Collect the data (sampling)
Remember the order!!!
4. Calculate P-value and test statistic
5. Reject/accept (H0) if P is small/large
Concept of replication vs. pseudoreplication
1. Spatial dependence (e.g. spatial autocorrelation)
2. Temporal dependence (e.g. repeated measures)
3. Biological dependence (e.g. siblings)
n=6
yi
Key quantities
mean 
var 
 yi
n
deviance  SS   ( yi  mean)
 ( yi  mean)
(n  1)
2
y
residual
mean
2
x
1. Costruire e testare un’ipotesi
Ipotesi: affermazione che ha come oggetto accadimenti nel
mondo reale, che si presta ad essere confermata o smentita
dai dati osservati sperimentalmente
Esempio: gli studenti maschi e femmine presentano gli stessi
voti
4
1. Costruire e testare un’ipotesi
Ipotesi nulla (H0): è un’affermazione riguardo alla popolazione che si
assume essere vera fino a che non ci sia una prova evidente del
contrario (status quo, mancanza di effetto etc.)
Ipotesi alterantiva (Ha): è un’affermazione riguardo alla popolazione
che è contraria all’ipotesi nulla e che viene accettata solo nel caso in
cui ci sia una prova evidente in suo favore
5
1. Costruire e testare un’ipotesi
1. Rifiutare H0 (e quindi
accettare Ha)
Test di ipotesi
consiste in una
decisione fra H0 e
Ha
2. Accettare H0 (e quindi rifiutare
Ha)
6
1. Costruire e testare un’ipotesi
1. Rifiutare H0
?
2. Accettare H0
La statistica inferenziale ci permette di quantificare delle
probabilità per decidere se accettare o rifiutare l’ipotesi nulla:
Quanto attendibile è H0?
7
Livello di significatività (alpha)
Devo definire a priori una probabilità (alpha) per rifiutare l’ipotesi
nulla
Il livello di significatività di un test: probabilità di rifiutare H0,
quando in realtà è vera (quanto confidenti siamo nelle nostre
conclusioni?)
Più piccola è alpha maggiore sarà la certezza nel rifiutare l’ipotesi
nulla
Valori usuali sono 10%, 5%, 1%, 0.1%
I valori più comuni
Hypothesis testing
• 1 – Hypothesis formulation (Null hypothesis H0 vs. alternative
hypothesis H1)
• 2 – Compute the probability P
P-value is often described as the probability of seeing results
as or more extreme as those actually observed if the null
hypothesis was true
• 3 – If this probability is lower than a defined threshold (level of
significance: 0.01, 0.05) we can reject the null hypothesis
Hypothesis testing: Types of error
STATISTICAL DECISION
REALITY
Reject H0
Retain H0
Effect
Correct
Effect detected
Type 2 error ()
Effect not detected
No effect
Type 1 error ()
Effect detected,
none exists
(P-value)
Correct,
No effect detected,
None exists
(POWER)
As power increases, the chances of a Type II error decreases
Statistical power depends on:
-The statistical significance criterion used in the test
-The size of the difference or the strength of the similarity (effect size)
-Variability of the population
-Sample size
-Type of test
Statistical analyses
Mean comparisons for 2 populations
Test the difference between the means drawn by two samples
Correlation
In probability theory and statistics, correlation, (often measured as a
correlation coefficient), indicates the strength and direction of a linear
relationship between two random variables. In general statistical usage,
correlation refers to the departure of two variables from independence.
Analysis of count or proportion data
Whole number or integer numbers (not continuous, different
distributional properties) or proportion
Mean comparisons for 2 samples
The t test
H0: means do not differ
H1: means differ
Assumptions
•
•
•
•
Independence of cases (work with true replications!!!) - this is a requirement
of the design.
Normality - the distributions in each of the groups are normal
Homogeneity of variances - the variance of data in groups should be the
same (use Fisher test or Fligner's test for homogeneity of variances).
These together form the common assumption that the errors are
independently, identically, and normally distributed
Normality
Before we can carry out a test assuming normality of the data we
need to test our distribution (not always before!!!)
Normal qqplot
RESIDUALS MUST
BE NORMAL
0
5
10
2.5
1.5
0.5
50
30
0 10
Frequency
In many cases we
must check this
assumption after
having fitted the
model
(e.g. regression or
multifactorial
ANOVA)
Observed quantiles
Graphics analysis
-2
15
0
1
2
norm quantiles
mass
hist(y)
lines(density(y))
-1
library(car)
qq.plot(y) or qqnorm(y)
Test for normality
Shapiro-Wilk Normality Test
Skew + kurtosis (t test)
shapiro.test()
Normality: Histogram
Histogram of log(fishes$mas)
30
20
0
10
Frequency
30
10
0
Frequency
50
40
Histogram of fishes$mas
0
5
10
fishes$mas
15
-0.5
0.5 1.0 1.5 2.0 2.5
log(fishes$mas)
400
Normality: Histogram
0
100
200
300
Normal distribution must be
symmetrical around the mean
2.5
4.5
6.5
8.5
10.5
12.5
library(animation)
ani.options(nmax = 2000 + 15 -2, interval = 0.003)
freq = quincunx(balls = 2000, col.balls = rainbow(1))
# frequency table
barplot(freq, space = 0)
2.0
0.0
1.0
log(fishes$mass)
10
5
fishes$mass
15
Normality: Q-Q Plot
-3
-2
-1
0
1
norm quantiles
2
3
-3
-2
-1
0
1
norm quantiles
2
3
Normality: Quantile-Quantile Plot
Quantiles are points taken at
regular intervals from the
cumulative distribution function
(CDF) of a random variable.
The quantiles are the data
values marking the boundaries
between consecutive subsets
Normality
In case of non-normality: 2 possible approaches
1. Change the distribution (use GLMs)
Advanced statistics
Probit (proportion)
Box-Cox transformation
30
20
10
Frequency
30
0
Arcsin (percentage)
0 10
Square-root
Frequency
Logaritmic (skewed data)
50
2. Data transformation
40
E.g. Poisson (count data)
E.g. Binomial (proportion)
0
5
10
mass
15
-0.5 0.5
1.5
2.5
fishes$logmass
Homogeneity of variance: two samples
Before we can carry out a test to compare two sample means,
we need to test whether the sample variances are significantly
different. The test could not be simpler. It is called Fisher’s F
To compare two variances, all you do is
divide the larger variance by the smaller variance.
E.g. Students from TESAF vs. Students from DAFNAE
F<-var(A)/var(B)
qf(0.975,nA-1,nB-1)
F calculated
F critical
if the calculated F is larger than
the critical value, we reject the null hypothesis
Test can be carried
out with the
var.test()
Homogeneity of variance : > two samples
It is important to know whether variance differs significantly
from sample to sample. Constancy of variance
(homoscedasticity) is the most important assumption
underlying regression and analysis of variance. For multiple
samples you can choose between the
Bartlett test and the Fligner–Killeen test.
Bartlett.test(response,factor)
Fligner.test(response,factor)
There are differences between the tests: Fisher and Bartlett are
very sensitive to outliers, whereas Fligner–Killeen is not
Mean comparison
In many cases, a researcher is interesting in gathering information about
two populations in order to compare them. As in statistical inference for
one population parameter, confidence intervals and tests of significance
are useful statistical tools for the difference between two population
parameters.
Ho: the two means are the same
H1: the two means differ
- All Assumptions met? Parametric t.test()
- t test with independent or paired sample
-Some assumptions not met? Non-parametric Wilcox.test()
- The Wilcoxon signed-rank test is a non-parametric alternative to the
Student's t-test for the case of two samples.
Il test t
tcalcolato=
Misura legata alla differenza fra le medie
Misura di variabilità dentro i gruppi
Differenza medie
Variabilità
dei gruppi
22
Variabile
Il test t
Caso
1
Caso
2
Differenza
fra le medie
Variabile
Variabilità A
B
A
Caso
3
B
A
Caso
4
B
B
A
A
Variabilità B
23
Il test t
tcalcolato=
Differenza fra le medie
t
t
Errore standard della differenza
Differenza fra
medie
Variabilità dentro i
gruppi
Più estremo sarà t calcolato minore sarà P
maggiore sarà la probabilità di rifiutare H0
24
Il test t
tcalcolato=
Differenza fra le medie
Errore standard della differenza
P
+ estremo sarà tcalcolato
maggiore la probabilità di
rifiutare H0
-Tcritico
Tcritico
25
Come scegliere il test t giusto a partire dalle assunzioni
Indipendenza
NO
SÌ
Test t appaiato
Test t non appaiati
D

t
i
SD
n
s22  s12
Test t per
pop. omoschedastiche
t
( x1  x2 )
1 1
S   
 n1 n2 
2
p
s22  s12
Test t per
pop. eteroschedastiche
Welch t-test
(formula complessa
richiesto un PC)
26
Campioni independenti omoschedastici: Test t!
tcalcolato 
( x1  x2 )
1 1
S   
 n1 n2 
2
p
( n1  1)S12  ( n2  1)S22
S 
)
( n1  1)  ( n2  1)
2
p
?
Varianza combinata
(”pooled”)
I gradi di libertà sono n1 + n2-2 per Tcritico
27
Campioni independenti omoschedastici: Test t!
H0: le due medie sono uguali
Ha: le due medie sono diverse
Test di ipotesi:
1. Calcolo la varianza combinata dei due campioni
2. Determino il valore di tcalcolato
3. Decido il livello di significatività (alpha, 1 o 2 code?)
4. Determino il valore di tcritico
5. Se |tcalcolato|> |t critico| rifiuto H0
6. Conclusione: le medie sono DIVERSE!
I gradi di libertà sono n1+n2-2 per Tcritico
28
Campioni appaiati: 2 casi
1. Misure ripetute
Studente
A
B
C
D
E
F
G
H
Prima Dopo
22
23
23
24
24
24
25
25
20
21
18
18
18
18
19
20
2. Correlazione nello spazio
Misura
a monte
Misura
a valle
Fiume B
Fiume A
Fiume C
Industria tessile
[Ammoniaca] in acqua
29
Campioni appaiati: Test t
D 
D
t
SD
n
SD 
n
Studente Prima Dopo
A
22
23
B
23
24
C
24
24
D
25
25
E
20
21
F
18
18
G
18
18
H
19
20
Di
1
1
0
0
1
0
0
1
Di
n
Media delle differenze
(D  D )
i
n 1
2
Deviazione standard delle
differenze
Numero di coppie
I gradi di libertà sono n-1 per tcritico
30
Campioni appaiati: Test t
H0: le due medie sono uguali
Ha: le due medie sono diverse
?
Test di ipotesi:
1. Determino il valore di tcalcolato
2. Decido il livello di significatività (alpha, 1 o 2 code?)
3. Determino il valore di tcritico
4. Se |tcalcolato|> |tcritico| rifiuto H0
5. Conclusione: le medie sono DIVERSE!
I gradi di libertà sono n-1 per tcritico
31
Non parametrica: Wilcoxon
A
3
4
4
3
2
3
1
3
5
2
I ranghi
U  n1n2 
n1 (n1  1)
 R1
2
n1 and n2 sono I numeri delle osservazioni
R1 è la somma dei rnaghi nel campione 1
Test can be carried out with
the wilcox.test()
function
B
5
5
6
7
4
4
3
5
6
5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
ozone
1
2
2
3
3
3
3
3
4
4
4
4
5
5
5
5
5
6
6
7
label ranks
A
1
A
2.5
A
2.5
A
6
A
6
A
6
A
6
B
6
A
10.5
A
10.5
B
10.5
B
10.5
A
15
B
15
B
15
B
15
B
15
B
18.5
B
18.5
B
20
Correlation
Correlation, (often measured as a correlation coefficient),
indicates the strength and direction of a linear relationship
between two random variables
Bird species Plant species
richness
richness
1
x1
l1
2
x2
l2
3
x3
l3
4
x4
l4
…
…
…
458
x458
l458
Sampling unit
Three alternative approaches
1. Parametric - cor()
2. Nonparametric - cor()
3. Bootstrapping - replicate(), boot()
Correlation: causal relationship?
Which is the response variable in a correlation analysis?
Bird species Plant species
richness
richness
1
2
3
4
…
458
x1
x2
x3
x4
…
x458
l1
l2
l3
l4
…
l458
Sampling unit
Correlation
Plot the two variables in a Cartesian space
A correlation of +1 means that there is a perfect positive LINEAR
relationship between variables.
A correlation of -1 means that there is a perfect negative LINEAR
relationship between variables.
A correlation of 0 means there is no LINEAR relationship between the two
variables.
Correlation
Same correlation coefficient!
r= 0.816
Parametric correlation: when is it significant?
Pearson product-moment correlation coefficient
Correlation coefficient:
cor 
 ( xy)
x y
2
2
Hypothesis testing using the t distribution:
Ho: Is cor = 0
H1: Is cor ≠ 0
cor
t
SEcor
t critic value for d.f. = n-2
Assumptions
-Two random variables from a random populations
- cor() detects ONLY linear relationships
SEcor
(1  cor 2 )

n2
Nonparametric correlation
Rank procedures
Distribution-free but
less power
Spearman correlation index
cor.spearman 
 (rank rank )
 rank  rank
x
y
2
x
2
y
The Kendall tau rank correlation coefficient
cor.kendall 
4P
1
n(n  1)
P is the number of concordant pairs
n is the total number of pairs
Issues related to correlation
1. Temporal autocorrelation
Values in close years are more similar
Dependence of the data
2. Spatial autocorrelation
Values in close sites are more similar
Dependence of the data
Moran's I or Geary’s C
Measures of global spatial autocorrelation
Moran's I = 0
Moran's I = 1
Three issues related to correlation
2. Temporal autocorrelation
Values in close years are more similar
Dependence of the data
Working with time series is likely
to have temporal pattern in the
data
E.g. Ring width series
Autoregressive models
(not covered!)
Three issues related to correlation
3. Spatial autocorrelation
Values in close sites are more similar
Dependence of the data
ISSUE: can we explain the
spatial autocorrelation with our
models?
Moran's I or Geary’s C (univariate response) Measures of global spatial
autocorrelation
Raw response
Residuals after
model fitting
Hint: If you find spatial autocorrelation in your residuals, you
should start worrying
Estimate correlation with bootstrap
BOOTSTRAP
Bootstrapping is a statistical method for estimating the sampling
distribution of an estimator by sampling with replacement from the
original sample, most often with the purpose of deriving robust
estimates of SEs and CIs of a population parameter
Sampling with replacement
>a<-c(1:5)
> a
1 original sample
[1] 1 2 3 4 5
10 bootstrapped samples
> replicate(10, sample(a, replace=TRUE))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]
2
3
2
1
4
2
1
2
1
3
[2,]
1
5
2
3
5
3
1
1
3
2
[3,]
4
4
4
5
4
4
5
1
1
5
[4,]
4
1
1
3
3
2
3
1
5
2
[5,]
5
5
1
3
5
2
4
1
5
4
Estimate correlation with bootstrap
Why bootstrap?
It doesn’t depend on normal distribution assumption
It allows the computation of unbiased SE and CIs
Sample
Bootstrap
N samples
with
replacement
…
Statistic
distribution
Quantiles
Estimate correlation with bootstrap
CIs are asymmetric because our
distribution reflects the structure
of the data and not a defined
probability distribution
If we repeat the sample n time we
will find 0.95*n values included in
the CIs
Frequency data
Properties of frequency data:
-Count data
-Proportion data
Count data: where we count how many times something
happened, but we have no way of knowing how often it did
not happen (e.g. number of students coming at the first
lesson)
Proportion data: where we know the number doing a
particular thing, but also the number not doing that thing (e.g.
‘mortality’ of the students who attend the first lesson, but not the
second)
Count data
Straightforward linear methods (assuming constant variance,
normal errors) are not appropriate for count data for four main
reasons:
• The linear model might lead to the prediction of negative counts.
• The variance of the response variable is likely to increase with
the mean.
• The errors will not be normally distributed.
• Many zeros are difficult to handle in transformations.
- Classical test with contingency tables
- Generalized linear models with Poisson distribution
and log-link function (extremely powerful and flexible!!!)
Count data: contingency tables
We can assess the significance of the differences between
observed and expected frequencies in a variety of ways:
- Pearson’s chi-squared (χ2)
- G test
- Fisher’s exact test
Group 1
Group 2
Row total
Trait 1
a
b
a+b
Trait 2
c
d
c+d
Column total
a+c
b+d
a+b+c+d
H0: frequencies found in rows are independent from frequencies
in columns
Count data: contingency tables
- Pearson’s chi-squared (χ2)
We need a model to define the expected frequencies (E)
(many possibilities) – E.g. perfect independence
Ei 
(R i x C i )
G
df  (r - 1) x (c - 1)
(O i - E i ) 2 /E i
 
Ei
Critic value
2
Oak
Beech
Row total (Ri)
With ants
22
30
52
Without ants
31
18
49
Column total (Ci)
53
48
101 (G)
X
Count data: contingency tables
- G test
1. We need a model to define the expected frequencies (E)
(many possibilities) – E.g. perfect independence
(R i x C i )
Ei 
G
 Oi 
G  2  Oi ln  
 Ei 
χ2 distribution
- Fisher’s exact test fisher.test()
If expected values are less than 4 o 5
Proportion data
Proportion data have three important properties that affect the way
the data should be analyzed:
• the data are strictly bounded (0-1);
• the variance is non-constant (it depends on the mean);
• errors are non-normal.
- Classical test with probit or arcsin transformation
- Generalized linear models with binomial distribution
and logit-link function (extremely powerful and flexible!!!)
Proportion data: traditional approach
Transform the data!
Arcsine transformation
The arcsine transformation takes care of the error distribution
p' arcsin
p
p are percentages (0-100%)
Probit transformation
The probit transformation takes care of the non-linearity
p are proportions (0-1)
Proportion data: modern analysis
An important class of problems involves data on proportions such as:
• studies on percentage mortality (LD50),
• infection rates of diseases,
• proportion responding to clinical treatment (bioassay),
• sex ratios, or in general
• data on proportional response to an experimental treatment
2 approaches
1. It is often needed to transform both response and explanatory variables
or
2. To use Generalized Linear Models (GLM) using different error
distributions
Statistical modelling
MODEL
Generally speaking, a statistical model is a function of your
explanatory variables to explain the variation in your response
variable (y)
E.g. Y=a+bx1+cx2+ dx3
Y= response variable (performance of the students)
xi= explanatory variables (ability of the teacher, background, age)
The object is to determine the values of the parameters (a, b, c and
d) in a specific model that lead to the best fit of the model to the data
The best model is the model that produces the least unexplained
variation (the minimal residual deviance), subject to the
constraint that all the parameters in the model should be
statistically significant (many ways to reach this!)
deviance  SS   ( yi  mean)
2
Statistical modelling
Getting started with complex statistical modeling
It is essential, that you can answer the following questions:
• Which of your variables is the response variable?
• Which are the explanatory variables?
• Are the explanatory variables continuous or categorical, or a
mixture of both?
• What kind of response variable do you have: is it a
continuous measurement, a count, a proportion, a time at
death, or a category?
Statistical modelling
Getting started with complex statistical modeling
The explanatory variables
(a) All explanatory variables continuous - Regression
(b) All explanatory variables categorical - Analysis of variance
(ANOVA)
(c) Explanatory variables both continuous and categorical Analysis of covariance (ANCOVA)
The response variable
(a) Continuous - Normal regression, ANOVA or ANCOVA
(b) Proportion - Logistic regression, GLM logit-linear models
(c) Count - GLM Log-linear models
(d) Binary - GLM binary logistic analysis
(e) Time at death - Survival analysis
Statistical modelling: multicollinearity
1. Multicollinearity
Correlation between predictors in a non-orthogonal multiple linear
models
Confounding effects difficult to separate
Variables are not independent
This makes an important difference to our statistical modelling
because, in orthogonal designs, the variation that is attributed to a
given factor is constant, and does not depend upon the order in which
factors are removed from the model.
In contrast, with non-orthogonal data, we find that the variation
attributable to a given factor does depend upon the order in which
factors are removed from the model
The order of variable selection makes a huge difference
(please wait for session 4!!!)
Statistical modelling
Each analysis estimate a MODEL
You want the model to be minimal (parsimony), and adequate
(must describe a significant fraction of the variation in the data)
It is very important to understand that there is not just one
model.
• given the data,
• and given our choice of model,
• what values of the parameters of that model make the observed
data most likely?
Model building: estimate of parameters
(slopes and level of factors)
Occam’s Razor
Statistical modelling
Occam’s Razor
• Models should have as few parameters as possible;
• linear models should be preferred to non-linear models;
• experiments relying on few assumptions should be preferred to those
relying on many;
• models should be pared down until they are minimal adequate;
• simple explanations should be preferred to complex explanations.
MODEL SIMPLIFICATION
The process of model simplification is an integral part of
hypothesis testing in R. In general, a variable is retained
in the model only if it causes a significant increase in
deviance when it is removed from the current model.
Scarica

Lecture 2 - Lorenzo Marini