1 Introduction

This is an example for how to get started with analysing a road crash dataset (on a dummy/fake dataset). I’ll be using predominantly dplyr since it’s an intuitive package for carrying out data manipulations that I really like to use!

Note: I reduced the size of the dataset for speed

2 Load libraries

library(xlsx)
library(dplyr)

3 Load data

# Reads sheet 1 of the spreadsheet
data <- read.xlsx("/home/lkemp/forensics_road_crashes_starter/dummy.xlsx", 1, colNames = TRUE, rowNames = TRUE)

Throws an error because of lack of java memory, probably because it a large file?

Solution found on stack overflow

options(java.parameters = "-Xmx1000m")
data <- read.xlsx("/home/lkemp/forensics_road_crashes_starter/dummy.xlsx", 1, colNames = TRUE, rowNames = TRUE)

Still getting an error…found another solution on stack overflow using an additional library to load the data

library(openxlsx)
data <- read.xlsx("/home/lkemp/forensics_road_crashes_starter/dummy.xlsx", 1, colNames = TRUE, rowNames = TRUE)

Great that loaded the data!

However, the Date.of.crash column isn’t formatted quite right:

head(data)
##   year Date.of.crash   month Time Day.of.Week    District speed
## 1 1981         29591 january 1300    Saturday Los Angeles     x
## 2 1981         29592 january 1300    Saturday  Washington   100
## 3 1981         29593 january 1300    Saturday       Texas     x
## 4 1981         29594 january 1300      Sunday     Florida   100
## 5 1981         29595 january 1300    Saturday    New York   100
## 6 1981         29596 january 1300   Wednesday        Ohio   100
##   deceased's.vehicle other.vehicle #.vehicles.involved #.deceased Age Gender
## 1                car         truck                   2          1  99      M
## 2          motorbike             x                   1          1  27      M
## 3          motorbike           car                   2          1  39      M
## 4          motorbike     HMV truck                   2          1  21      M
## 5                car             x                   1          1  60      M
## 6                car     HMV truck                   2          1  40      F
##         culpability       Occupation Ethnicity Blood.alcohol      THC
## 1 deceased at fault                x      Mars             0       nd
## 2                na dairy farm owner     Venus           207        x
## 3 deceased at fault        farm hand     pluto             0 cannabis
## 4 deceased at fault          tourist      Mars             0       nd
## 5                na       unemployed     Venus             0 cannabis
## 6 deceased at fault                x     pluto            78       nd
##        stimulants opioids  sedatives Other.drugs hospital.meds
## 1              nd      nd zopiclone  paracetamol             x
## 2               x       x          x           x             x
## 3              nd      nd         nd          nd             x
## 4              nd      nd         nd          nd             x
## 5 methamphetamine      nd         nd          nd             x
## 6              nd      nd         nd          nd             x

We might be able to deal with this when loading the data using the detectDates = TRUE argument

data <- read.xlsx("/home/lkemp/forensics_road_crashes_starter/dummy.xlsx", 1, colNames = TRUE, rowNames = TRUE, detectDates = TRUE)

See if it worked:

head(data)
##   year Date.of.crash   month Time Day.of.Week    District speed
## 1 1981    1981-01-05 january 1300    Saturday Los Angeles     x
## 2 1981    1981-01-06 january 1300    Saturday  Washington   100
## 3 1981    1981-01-07 january 1300    Saturday       Texas     x
## 4 1981    1981-01-08 january 1300      Sunday     Florida   100
## 5 1981    1981-01-09 january 1300    Saturday    New York   100
## 6 1981    1981-01-10 january 1300   Wednesday        Ohio   100
##   deceased's.vehicle other.vehicle #.vehicles.involved #.deceased Age Gender
## 1                car         truck                   2          1  99      M
## 2          motorbike             x                   1          1  27      M
## 3          motorbike           car                   2          1  39      M
## 4          motorbike     HMV truck                   2          1  21      M
## 5                car             x                   1          1  60      M
## 6                car     HMV truck                   2          1  40      F
##         culpability       Occupation Ethnicity Blood.alcohol      THC
## 1 deceased at fault                x      Mars             0       nd
## 2                na dairy farm owner     Venus           207        x
## 3 deceased at fault        farm hand     pluto             0 cannabis
## 4 deceased at fault          tourist      Mars             0       nd
## 5                na       unemployed     Venus             0 cannabis
## 6 deceased at fault                x     pluto            78       nd
##        stimulants opioids  sedatives Other.drugs hospital.meds
## 1              nd      nd zopiclone  paracetamol             x
## 2               x       x          x           x             x
## 3              nd      nd         nd          nd             x
## 4              nd      nd         nd          nd             x
## 5 methamphetamine      nd         nd          nd             x
## 6              nd      nd         nd          nd             x

Looks like it did!

Note. you might need to convert the “x” cell values to “NA” for handeling them in R

4 Data cleaning

Convert cells with x to NA for better handling in R

data <- data %>%
  mutate(across(where(is.character), ~na_if(., "x")))

Check it worked ok

data %>%
  head()
##   year Date.of.crash   month Time Day.of.Week    District speed
## 1 1981    1981-01-05 january 1300    Saturday Los Angeles  <NA>
## 2 1981    1981-01-06 january 1300    Saturday  Washington   100
## 3 1981    1981-01-07 january 1300    Saturday       Texas  <NA>
## 4 1981    1981-01-08 january 1300      Sunday     Florida   100
## 5 1981    1981-01-09 january 1300    Saturday    New York   100
## 6 1981    1981-01-10 january 1300   Wednesday        Ohio   100
##   deceased's.vehicle other.vehicle #.vehicles.involved #.deceased Age Gender
## 1                car         truck                   2          1  99      M
## 2          motorbike          <NA>                   1          1  27      M
## 3          motorbike           car                   2          1  39      M
## 4          motorbike     HMV truck                   2          1  21      M
## 5                car          <NA>                   1          1  60      M
## 6                car     HMV truck                   2          1  40      F
##         culpability       Occupation Ethnicity Blood.alcohol      THC
## 1 deceased at fault             <NA>      Mars             0       nd
## 2                na dairy farm owner     Venus           207     <NA>
## 3 deceased at fault        farm hand     pluto             0 cannabis
## 4 deceased at fault          tourist      Mars             0       nd
## 5                na       unemployed     Venus             0 cannabis
## 6 deceased at fault             <NA>     pluto            78       nd
##        stimulants opioids  sedatives Other.drugs hospital.meds
## 1              nd      nd zopiclone  paracetamol          <NA>
## 2            <NA>    <NA>       <NA>        <NA>          <NA>
## 3              nd      nd         nd          nd          <NA>
## 4              nd      nd         nd          nd          <NA>
## 5 methamphetamine      nd         nd          nd          <NA>
## 6              nd      nd         nd          nd          <NA>

Looks good!

Re-name the column names with # in them (R reads this as a comments and won’t evaluate the code)

data <- data %>%
  rename(no.deceased = "#.deceased")

data <- data %>%
  rename(no.vehicles.involved = "#.vehicles.involved")

Now I want to change some of the variables

data$speed <- as.numeric(data$speed)
data$no.deceased <- as.numeric(data$no.deceased)
data$year <- as.character(data$year)

5 Summarise the data

5.1 Total number of drivers

Get the number of rows in the dataframe, and therefore the total number of drivers (assuming each row represents a distinct driver)

data %>%
  count()
##    n
## 1 99

99 individuals were involved in road crashes (note. I reduced the dataset so this is less)

5.2 Number of drivers over a certain age

This time we do the same thing (counting the number of rows) but after we have “filtered” (removed the rows) for which Age is greater than 60

data %>%
  filter(Age > 60) %>%
  count()
##    n
## 1 20

20 of these 99 people involved in road crashes were over 60 years old

Another thing of interest, “cutting” up a variable (in this case “Age”) into portions and summarising over it

data %>%
  group_by(Age.range = cut(Age, breaks= seq(0, 120, by = 10))) %>%
  summarise(num_individuals = n()) %>%
  arrange(as.numeric(Age.range))
## # A tibble: 9 x 2
##   Age.range num_individuals
##   <fct>               <int>
## 1 (10,20]                17
## 2 (20,30]                21
## 3 (30,40]                13
## 4 (40,50]                11
## 5 (50,60]                17
## 6 (60,70]                 6
## 7 (70,80]                 7
## 8 (80,90]                 5
## 9 (90,100]                2

Try the same but looking at a different variable (Blood.alcohol) and different age ranges

data %>%
  group_by(Age.range = cut(Age, breaks= seq(0, 120, by = 20))) %>%
  summarise(mean_blood_alcohol = mean(Blood.alcohol)) %>%
  arrange(as.numeric(Age.range))
## # A tibble: 5 x 2
##   Age.range mean_blood_alcohol
##   <fct>                  <dbl>
## 1 (0,20]                  59.2
## 2 (20,40]                 67.1
## 3 (40,60]                 29.3
## 4 (60,80]                 18.4
## 5 (80,100]                 0

5.3 Mean and standard deviations

data %>%
  # Choose which variables to look at (you can use more or less variables)
  group_by(District, Gender) %>%
  # Choose the variable to summarise (in this case I used "Age")
  summarise(mean_age = mean(Age), sd_age = sd(Age))
## # A tibble: 12 x 4
## # Groups:   District [6]
##    District    Gender mean_age sd_age
##    <chr>       <chr>     <dbl>  <dbl>
##  1 Florida     F          41.5   30.4
##  2 Florida     M          37.2   21.6
##  3 Los Angeles F          49.6   23.0
##  4 Los Angeles M          45.8   27.4
##  5 New York    F          32     NA  
##  6 New York    M          49.7   22.6
##  7 Ohio        F          36.6   19.2
##  8 Ohio        M          34.8   15.2
##  9 Texas       F          38.7   20.1
## 10 Texas       M          46.2   21.5
## 11 Washington  F          58.7   20.8
## 12 Washington  M          39.3   20.7
data %>%
  # Try looking at only "Day.of.Week" 
  group_by(Day.of.Week) %>%
  # Try looking at a different variable (in this case I used "speed")
  summarise(mean_Blood.alcohol = mean(Blood.alcohol), sd_Blood.alcohol = sd(Blood.alcohol))
## # A tibble: 7 x 3
##   Day.of.Week mean_Blood.alcohol sd_Blood.alcohol
##   <chr>                    <dbl>            <dbl>
## 1 Friday                   30.9              73.8
## 2 Monday                    5.53             21.4
## 3 Saturday                 76                88.9
## 4 Sunday                   85.1              95.6
## 5 Thursday                 17.8              60.2
## 6 Tuesday                  42.1              85.8
## 7 Wednesday                21                42.0

These results can be written to file:

data %>%
  group_by(District) %>%
  summarise(mean_speed = mean(speed), sd_speed = sd(speed)) %>%
  write.csv("speed_by_district.csv", row.names = FALSE)

5.4 Max and min

data %>%
  group_by(District, Gender) %>%
  summarise(min_blood_alcohol = min(Blood.alcohol), max_blood_alcohol = max(Blood.alcohol))
## # A tibble: 12 x 4
## # Groups:   District [6]
##    District    Gender min_blood_alcohol max_blood_alcohol
##    <chr>       <chr>              <dbl>             <dbl>
##  1 Florida     F                      0                 0
##  2 Florida     M                      0               209
##  3 Los Angeles F                      0                 0
##  4 Los Angeles M                      0               226
##  5 New York    F                      0                 0
##  6 New York    M                      0               217
##  7 Ohio        F                      0               239
##  8 Ohio        M                      0               207
##  9 Texas       F                      0               235
## 10 Texas       M                      0                68
## 11 Washington  F                      0               121
## 12 Washington  M                      0               228

5.5 Number of deceased drivers with alcohol in their blood

Create a new column/variable that is “yes” or “no” to presence of alcohol in blood based on the Blood.alcohol column

data <- data %>%
  mutate(alcohol_in_system = case_when(
    Blood.alcohol > 0 ~ "yes",
    Blood.alcohol == 0 ~ "no"))

Get the number of rows in the dataframe, and therefore the total number of drivers (assuming each row represents a distinct driver) after we have “filtered” (removed the rows) for which alcohol_in_system is yes

data %>%
  filter(alcohol_in_system == "yes") %>%
  count()
##    n
## 1 30

An alternative way of doing this, filter for rows/cases where at least one person was deceased and get the number of rows/cases where there was/wasn’t any alcohol in their system

data %>%
  filter(no.deceased > 0) %>%
  group_by(alcohol_in_system) %>%
  summarise(count = n())
## # A tibble: 2 x 2
##   alcohol_in_system count
##   <chr>             <int>
## 1 no                   68
## 2 yes                  30