See this datascience+ https://datascienceplus.com/one-way-anova-in-r/ for the original work that inspired my efforts. Credit to Bidyut Ghosh (Assistant Professor at Symbiosis International University, India) for the original posting.

Special thanks also to Dani Navarro, The University of New South Wales (Sydney) for the book Learning Statistics with R http://dj-navarro.appspot.com/lsr/lsr-0.5.1.pdf (hereafter simply LSR) and the lsr packages available through CRAN. I highly recommend it.

Let’s load the required R packages. See https://github.com/cran/lsr for more on lsr.

```
library(ggplot2)
library(lsr)
library(psych)
library(car)
library(tidyverse)
library(dunn.test)
library(BayesFactor)
library(scales)
library(knitr)
library(kableExtra)
options(width = 130)
options(knitr.table.format = "html")
```

The Oneway ANOVA is a statistical technique that allows us to compare mean differences of one outcome (dependent) variable across two or more groups (levels) of one independent variable (factor). If there are only two levels (e.g. Male/Female) of the independent (predictor) variable the results are analogous to Student’s t test. It is also true that ANOVA is a special case of the GLM or regression models so as the number of levels increase it might make more sense to try one of those approaches. ANOVA also allows for comparisons of mean differences across multiple factors (Factorial or Nway ANOVA) which we won’t address here.

Professor Ghosh’s original scenario can be summarized this way. Imagine that you are interested in understanding whether knowing the brand of car tire can help you predict whether you will get more or less mileage before you need to replace them. We’ll draw what is hopefully a random sample of 60 tires from four different manufacturers and use the mean mileage by brand to help inform our thinking. While we expect variation across our sample we’re interested in whether the differences between the tire brands (the groups) is significantly different than what we would expect in random variation within the groups.

Our research or testable hypothesis is then described \[\mu_{Apollo} \ne \mu_{Bridgestone} \ne \mu_{CEAT} \ne \mu_{Falken}\] as at least one of the tire brand populations is different than the other three. Our null is basically “nope, brand doesn’t matter in predicting tire mileage – all brands are the same”.

He provides the following data set with 60 observations. I’ve chosen to download directly but you could of course park the file locally and work from there.

Column | Contains | Type |
---|---|---|

Brands | What brand tyre | factor |

Mileage | Tyre life in thousands | num |

```
tyre<-read.csv("https://datascienceplus.com/wp-content/uploads/2017/08/tyre.csv")
# tyre<-read.csv("tyre.csv") # if you have it on your local hard drive
str(tyre)
```

```
## 'data.frame': 60 obs. of 2 variables:
## $ Brands : Factor w/ 4 levels "Apollo","Bridgestone",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Mileage: num 33 36.4 32.8 37.6 36.3 ...
```

`summary(tyre)`

```
## Brands Mileage
## Apollo :15 Min. :27.88
## Bridgestone:15 1st Qu.:32.69
## CEAT :15 Median :34.84
## Falken :15 Mean :34.74
## 3rd Qu.:36.77
## Max. :41.05
```

`head(tyre)`

```
## Brands Mileage
## 1 Apollo 32.998
## 2 Apollo 36.435
## 3 Apollo 32.777
## 4 Apollo 37.637
## 5 Apollo 36.304
## 6 Apollo 35.915
```

`# View(tyre) # if you use RStudio this is a nice way to see the data in spreadsheet format`

The data set contains what we expected. The dependent variable `Mileage`

is numeric and the independent variable `Brand`

is of type factor. R is usually adept at coercing a chr string or an integer as the independent variable but I find it best to explicitly make it a factor when you’re working on ANOVAs.

Let’s graph and describe the basics. First a simple `boxplot`

of all 60 data points along with a summary using the `describe`

command from the package `psych`

. Then in reverse order lets describe `describeBy`

and boxplot breaking it down by group (in our case tire brand).

```
boxplot(tyre$Mileage,
horizontal = TRUE,
main="Mileage distribution across all brands",
col = "blue")
```

`describe(tyre) # the * behind Brands reminds us it's a factor and some of the numbers are nonsensical`

```
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Brands* 1 60 2.50 1.13 2.50 2.50 1.48 1.00 4.00 3.00 0.00 -1.41 0.15
## Mileage 2 60 34.74 2.98 34.84 34.76 3.09 27.88 41.05 13.17 -0.11 -0.44 0.38
```

`describeBy(tyre$Mileage,group = tyre$Brand, mat = TRUE, digits = 2)`

```
## item group1 vars n mean sd median trimmed mad min max range skew kurtosis se
## X11 1 Apollo 1 15 34.80 2.22 34.84 34.85 2.37 30.62 38.33 7.71 -0.18 -1.24 0.57
## X12 2 Bridgestone 1 15 31.78 2.20 32.00 31.83 1.65 27.88 35.01 7.13 -0.29 -1.05 0.57
## X13 3 CEAT 1 15 34.76 2.53 34.78 34.61 2.03 30.43 41.05 10.62 0.64 0.33 0.65
## X14 4 Falken 1 15 37.62 1.70 37.38 37.65 1.18 34.31 40.66 6.35 0.13 -0.69 0.44
```

```
boxplot(tyre$Mileage~tyre$Brands,
main="Boxplot comparing Mileage of Four Brands of Tyre",
col= rainbow(4),
horizontal = TRUE)
```

Let’s format the table `describeby`

generates to make it a little nicer using the `kable`

package. Luckily `describeby`

generates a dataframe with `mat=TRUE`

and after that we can select which columns to publish (dropping some of the less used) as well as changing the column labels as desired.

```
describeBy(tyre$Mileage,group = tyre$Brand, mat = TRUE) %>% #create dataframe
select(Brand=group1, N=n, Mean=mean, SD=sd, Median=median, Min=min, Max=max,
Skew=skew, Kurtosis=kurtosis, SEM=se) %>%
kable(align=c("lrrrrrrrr"), digits=2, row.names = FALSE,
caption="Tire Mileage Brand Descriptive Statistics") %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
```

Brand | N | Mean | SD | Median | Min | Max | Skew | Kurtosis | SEM |
---|---|---|---|---|---|---|---|---|---|

Apollo | 15 | 34.80 | 2.22 | 34.84 | 30.62 | 38.33 | -0.18 | -1.24 | 0.57 |

Bridgestone | 15 | 31.78 | 2.20 | 32.00 | 27.88 | 35.01 | -0.29 | -1.05 | 0.57 |

CEAT | 15 | 34.76 | 2.53 | 34.78 | 30.43 | 41.05 | 0.64 | 0.33 | 0.65 |

Falken | 15 | 37.62 | 1.70 | 37.38 | 34.31 | 40.66 | 0.13 | -0.69 | 0.44 |

Certainly much nicer looking and I only scratched the surface of the options available. We can certainly look at the numbers and learn a lot. But let’s see if we can also improve our plotting to be more informative.

The more I use `ggplot2`

the more I love the ability to use it to customize the presentation of the data to optimize understanding! The next plot might be accused of being a little “busy” but essentially answers our Oneway ANOVA question in one picture (note that I have stayed with the original decision to set \(\alpha\) = 0.01 significance level (99% confidence intervals)).

```
ggplot(tyre, aes(reorder(Brands,Mileage),Mileage,fill=Brands))+
# ggplot(tyre, aes(Brands,Mileage,fill=Brands))+ # if you want to leave them alphabetic
geom_jitter(colour = "dark gray",width=.1) +
stat_boxplot(geom ='errorbar',width = 0.4) +
geom_boxplot()+
labs(title="Boxplot, dotplot and SEM plot of mileage for four brands of tyres",
x = "Brands (sorted)",
y = "Mileage (in thousands)",
subtitle ="Gray dots=sample data points, Black dot=outlier, Blue dot=mean, Red=99% confidence interval",
caption = "Data from https://datascienceplus.com/one-way-anova-in-r/") +
guides(fill=FALSE) +
stat_summary(fun.data = "mean_cl_normal", colour = "red", size = 1.5, fun.args = list(conf.int=.99)) +
stat_summary(geom="point", fun.y=mean, color="blue") +
theme_bw()
```