class: title-slide, center, middle # Statistical methods for archaeological data analysis I: Basic methods ## 04 - Descriptive statistics ### Martin Hinz #### Institut für Archäologische Wissenschaften, Universität Bern 18.03.2025 .footnote[ .right[ .tiny[ You can download a [pdf of this presentation](smada04.pdf). ] ] ] --- ## Loading data for the following steps ### download data * [muensingen_fib.csv](https://raw.githubusercontent.com/BernCoDALab/smada/refs/heads/main/lectures/04/muensingen_fib.csv) ### Read the Data on Muensingen Fibulae ``` r muensingen <- read.csv2("muensingen_fib.csv") head(muensingen) ``` ``` ## X Grave Mno FL BH BFA FA CD BRA ED FEL C BW BT FEW Coils Length ## 1 1 121 348 28 17 1 10 10 2 8 6 20 2.5 2.6 2.2 4 53 ## 2 2 130 545 29 15 3 8 6 3 6 10 17 11.7 3.9 6.4 6 47 ## 3 3 130 549 22 15 3 8 7 3 13 1 17 5.0 4.6 2.5 10 47 ## 4 8 157 85 23 13 3 8 6 2 10 7 15 5.2 2.7 5.4 12 41 ## 5 11 181 212 94 15 7 10 12 5 11 31 50 4.3 4.3 NA 6 128 ## 6 12 193 611 68 18 7 9 9 7 3 50 18 9.3 6.5 NA 4 110 ## fibula_scheme ## 1 B ## 2 B ## 3 B ## 4 B ## 5 C ## 6 C ``` --- ## Descriptive Statistics ### Summary of a amount of observed data The distribution of the data in the sample is displayed. ### Ways of display Table – contingency table Graphical – charts Numeric – with specific parameters of the distribution Descriptive statistics do (effectivly) not making statements about the population but describes the sample! (in difference to statistical inference) --- ## Parameters of distributions ### Central tendency What is the typical individual mean, median, mode ### Dispersion How much variation is there Range, variance, standard deviation, coefficient of variation ### Shape Shape of the distribution curve symmetric/asymmetric Skewness and curtosis ---  .caption[source: Phillips 1997] --- ## Central tendency [1] ### mean The classical. Suitable for metric data (interval or ratio) Sum of values/number of values, or $$ \bar{x} = \frac {\sum_{i=1}^{n} x_i} {n} $$ ``` r sum(muensingen$Length) / length(muensingen$Length) ``` ``` ## [1] 57.58824 ``` ``` r mean(muensingen$Length) ``` ``` ## [1] 57.58824 ``` --- ## Central tendency [2] ### Median .small[ Suitable for metric and ordinal variables. Uneven number: the central value of a sorted vector. ``` 1 2 3 4 5 6 7 | ``` R: ``` r median(c(1,2,3,4,5,6,7)) ``` ``` ## [1] 4 ``` Even number: the mean of the two central values of a sorted vector. ``` 1 2 3 4 5 6 7 8 | ``` R: ``` r median(c(1,2,3,4,5,6,7,8)) ``` ``` ## [1] 4.5 ``` ] --- ## Central tendency [3] ### Mode The most frequent value of a vector. Suitable for metric, ordinal and nominal variables. goat sheep goat cattle cattle goat pig goat Modus: goat In R: ``` r which.max( table( c("goat", "sheep", "goat", "cattle","cattle", "goat", "pig", "goat") ) ) ``` ``` ## goat ## 2 ``` --- ## Central tendency [4] ### Variable is | nominal | ordinal | intervall+ | |-|-|-| | mode | mode | mode | | - | median | median | | - | - | mean | .caption[after: Dolić 2004] --- ## Central tendency [5] ### Comparison of central values: .small[ Strongly affected by outliers: the mean is very sensitive for outliers, the median less, the mode hardly ``` r test<-c(1,2,2,3,3,3,4,4,5,5,6,7,8,8,8,9,120) mean(test) ``` ``` ## [1] 11.64706 ``` ``` r median(test) ``` ``` ## [1] 5 ``` ``` r which.max(table(test)) ``` ``` ## 3 ## 3 ``` The mode is of little value for describing metric or ordinal data, only when a more or less symmetric distribution is present ``` r which.max(table(c(1,2,2,3,3,3,4,4,4,4,5,5,5,6,6,7))) ``` ``` ## 4 ## 4 ``` ] --- class:center, middle  --- class:inverse ## Central tendency exercise ### Describe the central tendency Analyse the measurements of the width of cups (in cm) from the burial ground Walternienburg (Müller 2001, 534; selection): * [tassen.csv](https://raw.githubusercontent.com/BernCoDALab/smada/refs/heads/main/lectures/04/tassen.csv) ``` r tassen<-read.csv2("tassen.csv",row.names=1) tassen$x ``` ``` ## [1] 12.0 19.5 18.6 12.9 13.2 9.9 19.5 8.4 21.0 18.9 7.5 18.9 8.1 9.0 7.8 ## [16] 9.9 10.2 8.1 12.0 9.0 26.1 20.4 ``` Identify the mode, median and mean and determine if the distribution is symmetric, positive or negative skewed. --- class:center, middle  .caption[source: Phillips 1997] --- ## Dispersion [1] ### Range Simply the range of the values of a data vector. ``` r range(muensingen$Length) ``` ``` ## [1] 26 128 ``` ``` r range(tassen$x) ``` ``` ## [1] 7.5 26.1 ``` Because the measurement is related to the extreme values it is very sensitive for outliers. --- class: middle, center ### How far deviates the individual values from the mean in the mean? <img src="data:image/png;base64,#smada04_files/figure-html/unnamed-chunk-15-1.png" width="100%" /> --- ## Dispersion [2] ### (empirical) variance Measure for the variability of the data, more insensitive against outliers Equals to the sum of the squared distances from the mean divided by the number of observations $$ s^2 = \frac {\sum_{i=1}^{n} (x_i - \bar{x})^2} {n-1} $$ In R: ``` r sum((tassen$x-mean(tassen$x))^2)/(length(tassen$x)-1) ``` ``` ## [1] 31.11136 ``` ``` r var(tassen$x) ``` ``` ## [1] 31.11136 ``` .footnote[.tiny[ Attention: there is another variance σ2 (with n instead of n-1) which is only suitable for analysis of the population (which is not known most of the times), not for samples ]] --- ## Dispersion [3] ### (empirical) standard deviation Variance has through the squaring squared units (mm → mm2) For a parameter with the original units: square root → standard deviation $$ s = \sqrt{\frac {\sum_{i=1}^{n} (x_i - \bar{x})^2} {n-1}} $$ ``` r sqrt(sum((tassen$x-mean(tassen$x))^2)/(length(tassen$x)-1)) ``` ``` ## [1] 5.577756 ``` ``` r sd(tassen$x) ``` ``` ## [1] 5.577756 ``` Equals the mean distance from the mean .footnote[.tiny[ Attention: there is another standard deviation σ (with n instead of n-1) which is only suitable for analysis of the population (which is not known most of the times), not for samples ]] --- ## Dispersion [4] ### coefficient of variation Standard deviation has the unit of the original data (e.g. mm). To compare two distributions with different units: coefficient of variation = standard deviation/mean Example: Vary foot length and total length equal? ``` r sd(muensingen$Length)/mean(muensingen$Length) ``` ``` ## [1] 0.4508988 ``` ``` r sd(muensingen$FL)/mean(muensingen$FL) ``` ``` ## [1] 0.7732486 ``` Foot length vary more than total length --- ## Dispersion [5] ### Quantile .small[ Oh, we've done that one... The 1., 2., 3. and 4. quarter of the data (sorted and counted) resp. there boundaries  .caption[Phillips 1997] ] --- ## Dispersion [5] ### Quantile .small[ Oh, we've done that one... The 1., 2., 3. and 4. quarter of the data (sorted and counted) resp. there boundaries ``` r quantile(tassen$x) ``` ``` ## 0% 25% 50% 75% 100% ## 7.5 9.0 12.0 18.9 26.1 ``` new: percentile (the same for percents) ``` r quantile(tassen$x, probs=seq(0,1,0.1)) ``` ``` ## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ## 7.50 8.10 8.52 9.27 10.02 12.00 13.08 18.81 19.38 20.31 26.10 ``` Dispersion measure inner quartile range ``` r IQR(tassen$x) ``` ``` ## [1] 9.9 ``` More insensitive against outliers than the standard deviation, but information is lost ] --- class: inverse ## Dispersion exercise ### Determine the dispersion of the data Analyse the sizes of areas visible from different megalithic graves of the Altmark (Demnick 2009): * [altmark_denis2.csv](https://raw.githubusercontent.com/BernCoDALab/smada/refs/heads/main/lectures/04/altmark_denis2.csv) ``` r altmark<-read.csv2("altmark_denis2.csv",row.names=1) head(altmark) ``` ``` ## sichtflaeche region ## La01 2.72 Mitte ## Lg1 26.78 Mitte ## Li02 26.96 Mitte ## Sa01 27.05 Mitte ## Li06 32.93 Mitte ## K\xf601 34.76 Mitte ``` Evaluate in which region the visible area is more equal (less disperse). --- ## Shape of the distribution [1] .pull-left[ ### Important Parameters Number of peaks of the distribution: unimodal, bimodal, multimodal Skewness of the distribution: positive, negative Curtosis (curvature) of the distribution: flat, medium, steep ] .pull-right[  .caption[Shape of distributions (after Bortz 2006)] ] --- class:center, middle  --- ## Shape of the distribution [2] ### Skewness Mean right or left of the median Read from the chart ;-) calculate: $$ \hat{S} = \frac {\sum_{i=1}^n (x_i - \bar{x})^3} {n * s^3} $$ Positive value indicates positive skew, negative resp. --- ## Shape of the distribution [2] ### Skewness There is no function in R currently available to calculate this. So we build our own: ``` r skewness <- function(x) { m3 <- sum((x-mean(x))^3) #numerator skew <- m3 / ((sd(x)^3)*length(x)) #denominator skew } ``` Test: ``` r test<-c(1,1,1,1,1,1,1,1,1,1,2,3,4,5) skewness(test) ``` ``` ## [1] 1.406826 ``` ``` r test<-c(3,3,3,3,3,3,3,3,3,3,3,3,2,1) skewness(test) ``` ``` ## [1] -2.231232 ``` --- ## Shape of the distribution [3] ### Kurtosis .pull-left[ The curvature of the distribution Read from the chart ;-) calculate: $$ K = \frac {\sum_{i=1}^n (x_i - \bar{x})^4} {n*s^4} $$ Positive if steeper, negative if flatter curve than the normal distribution ] .pull-right[  ] --- ## Shape of the distribution [3] ### Kurtosis We write a function for that, too: ``` r kurtosis <- function (x) { m3 <- sum((x-mean(x))^4) skew <- m3 / ((sd(x)^4)*length(x))-3 skew } ``` Test: ``` r test<-c(1,2,3,4,4,5,6,7) kurtosis(test) ``` ``` ## [1] -1.46875 ``` ``` r test<-c(1,2,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,6,7) kurtosis(test) ``` ``` ## [1] 2.011364 ```