Statistics


2021

René Staritzbichler

Art of statistics

Describe reality by a few numbers

  • mean / median / mode
  • range / quantile / standard deviation
  • confidence / significance
  • correlation

Location

  • mean: average income
  • median: income of average person
    • median = 50th percentile
  • mode: most common income
  • normal distribution: mean, median and mode are the same

Spread or width

  • range (sensitive to extreme values)
  • inter quartile range
  • standard deviation (symmetric data)

Dependencies

Two or more variables

  • Pearson correlation
  • Spearman rank correlation

Correlation is no causality!

Conclusions and predictions

Significance: p, t, z, -values

  • z-test: normal distribution with known variance
  • t-test: normal distribution with unknown variance

Hypothesises

  • Similarity / difference / effect
  • Nullhypothesis: no significant relation
  • Alternative hypothesis: significant relation
  • Test whether Nullhypothesis can be rejected

Significance

  • Select significance level, generally $ \alpha = 0.05$ or 0.01
  • Perform e.g. t-test (returns p-value)
  • $p < \alpha:$ reject Nullhypothesis $\Rightarrow$ significant
  • $p \geq \alpha:$ Nullhypothesis not rejected $\Rightarrow$ insignificant

Student t-test

Compare mean values of 2 distributions

  • one sample location test:
  • two sample location test:

Are Man larger than Women?

  • Two sample test of independent variables
  • Nullhypothesis: there are no significant differences in the mean values

p-Value

Distributions

  • Normal
  • t-Distribution
  • Bernoulli
  • Binomial
  • Poisson
  • Exponential
  • Logarithmic

Crowd wisdom

Bean distribution

Outliers

Example: income of all citizens

  • many people with low to moderate income
  • a few exceedingly rich people
  • resulting issues:
    • mean value is bad descriptor
    • not possible to draw

Logaritm

  • Inverse of exponential function
  • $y=a^x \; \Rightarrow log_a y = x $
  • Can show both small values and very large

Logaritmic scale

  • In x or y or both
  • Can show both small values and very large
  • Beggars and billionaires

Confusion matrix

  • too much confusion!

Measures of trust

  • Sensitivity
  • Specificity
  • Accuracy

Biases

  • framing
  • priming
  • rounding
  • social pressure

Framing

Wording has significant influence:

  • 'giving 16 to 17 yrs old the right to vote': 52%+, 41%-
  • 'reducing the voting age to 16': 37%+, 56%-

Priming

Answers depend on previous questions

  • 10% of young people feel lonely
  • BBC, after long list of questions: 42%

Rounding

In surveys people tend to use round numbers

Psychology of numbers

  • 98% survival versus
  • 2% death rate

Which sounds better?

  • Biontech: 20% of german economical growth
  • Biontech: 5 permille of german GNP

both are equivalent (2.5% growth rate)

Correlation is not causation

  • 99% divorce rate in Maine and per capita consumption of margarine
  • 95% marriage rate in Kentucky and people drowning after falling out of a fisher boat

https://www.tylervigen.com

Reverse causation

  • "A nearby Waitrose adds £36,000 to house price"
  • Moderate drinkers live longer than non-drinkers

Lurking factors

  • Being a pope helps living longer?
  • Do right handers live longer?

Discrimination or adjustment?

Regression to the mean

Systems aim to return to their mean

Soccer: new trainer, return to normal

Speed cameras after accidents

Placebo

  • Some may be healed by the belief in something
  • Some are healed by normal function of the body (return to mean)

Pitfalls

"Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns. There are things we know we know. We also know there are known unknowns. That is to say, we know there are some things we do not know. But there are also unknown unknowns — the ones we don't know we don't know,"

Donald Rumsfeld, 2002

Pitfalls

  • too few samples
  • relative vs absolute changes
  • deceptive representation
  • confusing representations

Too few samples

4 random normal distributions (logscale)

mean: 0, stdev: 1

20, 200, 2000, 20000 samples

Relative versus absolute

pit 1: IARC 2015: processed meat group I carcinogen

$\Rightarrow$ Daily Record: 'Bacon, ham and sausages have the same cancer risk as cigarettes warn experts'

$\Rightarrow$ IARC: confidence that there is an increased risk

pit 2: 50g/day: relative: 18% (abs: 6% $\rightarrow$ 7%)

$\Rightarrow$ Media used absolute: 6% $\rightarrow$ 24%

Deceptive representation

Confusing representations

Reference

All scans were taken from:

David Spiegelhalter 'The art of statistics'