This is an example for how to get started with analysing a road crash dataset (on a dummy/fake dataset). I’ll be using predominantly dplyr since it’s an intuitive package for carrying out data manipulations that I really like to use!
Note: I reduced the size of the dataset for speed
library(xlsx)
library(dplyr)
# Reads sheet 1 of the spreadsheet
data <- read.xlsx("/home/lkemp/forensics_road_crashes_starter/dummy.xlsx", 1, colNames = TRUE, rowNames = TRUE)
Throws an error because of lack of java memory, probably because it a large file?
Solution found on stack overflow
options(java.parameters = "-Xmx1000m")
data <- read.xlsx("/home/lkemp/forensics_road_crashes_starter/dummy.xlsx", 1, colNames = TRUE, rowNames = TRUE)
Still getting an error…found another solution on stack overflow using an additional library to load the data
library(openxlsx)
data <- read.xlsx("/home/lkemp/forensics_road_crashes_starter/dummy.xlsx", 1, colNames = TRUE, rowNames = TRUE)
Great that loaded the data!
However, the Date.of.crash
column isn’t formatted quite right:
head(data)
## year Date.of.crash month Time Day.of.Week District speed
## 1 1981 29591 january 1300 Saturday Los Angeles x
## 2 1981 29592 january 1300 Saturday Washington 100
## 3 1981 29593 january 1300 Saturday Texas x
## 4 1981 29594 january 1300 Sunday Florida 100
## 5 1981 29595 january 1300 Saturday New York 100
## 6 1981 29596 january 1300 Wednesday Ohio 100
## deceased's.vehicle other.vehicle #.vehicles.involved #.deceased Age Gender
## 1 car truck 2 1 99 M
## 2 motorbike x 1 1 27 M
## 3 motorbike car 2 1 39 M
## 4 motorbike HMV truck 2 1 21 M
## 5 car x 1 1 60 M
## 6 car HMV truck 2 1 40 F
## culpability Occupation Ethnicity Blood.alcohol THC
## 1 deceased at fault x Mars 0 nd
## 2 na dairy farm owner Venus 207 x
## 3 deceased at fault farm hand pluto 0 cannabis
## 4 deceased at fault tourist Mars 0 nd
## 5 na unemployed Venus 0 cannabis
## 6 deceased at fault x pluto 78 nd
## stimulants opioids sedatives Other.drugs hospital.meds
## 1 nd nd zopiclone paracetamol x
## 2 x x x x x
## 3 nd nd nd nd x
## 4 nd nd nd nd x
## 5 methamphetamine nd nd nd x
## 6 nd nd nd nd x
We might be able to deal with this when loading the data using the detectDates = TRUE
argument
data <- read.xlsx("/home/lkemp/forensics_road_crashes_starter/dummy.xlsx", 1, colNames = TRUE, rowNames = TRUE, detectDates = TRUE)
See if it worked:
head(data)
## year Date.of.crash month Time Day.of.Week District speed
## 1 1981 1981-01-05 january 1300 Saturday Los Angeles x
## 2 1981 1981-01-06 january 1300 Saturday Washington 100
## 3 1981 1981-01-07 january 1300 Saturday Texas x
## 4 1981 1981-01-08 january 1300 Sunday Florida 100
## 5 1981 1981-01-09 january 1300 Saturday New York 100
## 6 1981 1981-01-10 january 1300 Wednesday Ohio 100
## deceased's.vehicle other.vehicle #.vehicles.involved #.deceased Age Gender
## 1 car truck 2 1 99 M
## 2 motorbike x 1 1 27 M
## 3 motorbike car 2 1 39 M
## 4 motorbike HMV truck 2 1 21 M
## 5 car x 1 1 60 M
## 6 car HMV truck 2 1 40 F
## culpability Occupation Ethnicity Blood.alcohol THC
## 1 deceased at fault x Mars 0 nd
## 2 na dairy farm owner Venus 207 x
## 3 deceased at fault farm hand pluto 0 cannabis
## 4 deceased at fault tourist Mars 0 nd
## 5 na unemployed Venus 0 cannabis
## 6 deceased at fault x pluto 78 nd
## stimulants opioids sedatives Other.drugs hospital.meds
## 1 nd nd zopiclone paracetamol x
## 2 x x x x x
## 3 nd nd nd nd x
## 4 nd nd nd nd x
## 5 methamphetamine nd nd nd x
## 6 nd nd nd nd x
Looks like it did!
Note. you might need to convert the “x” cell values to “NA” for handeling them in R
Convert cells with x
to NA
for better handling in R
data <- data %>%
mutate(across(where(is.character), ~na_if(., "x")))
Check it worked ok
data %>%
head()
## year Date.of.crash month Time Day.of.Week District speed
## 1 1981 1981-01-05 january 1300 Saturday Los Angeles <NA>
## 2 1981 1981-01-06 january 1300 Saturday Washington 100
## 3 1981 1981-01-07 january 1300 Saturday Texas <NA>
## 4 1981 1981-01-08 january 1300 Sunday Florida 100
## 5 1981 1981-01-09 january 1300 Saturday New York 100
## 6 1981 1981-01-10 january 1300 Wednesday Ohio 100
## deceased's.vehicle other.vehicle #.vehicles.involved #.deceased Age Gender
## 1 car truck 2 1 99 M
## 2 motorbike <NA> 1 1 27 M
## 3 motorbike car 2 1 39 M
## 4 motorbike HMV truck 2 1 21 M
## 5 car <NA> 1 1 60 M
## 6 car HMV truck 2 1 40 F
## culpability Occupation Ethnicity Blood.alcohol THC
## 1 deceased at fault <NA> Mars 0 nd
## 2 na dairy farm owner Venus 207 <NA>
## 3 deceased at fault farm hand pluto 0 cannabis
## 4 deceased at fault tourist Mars 0 nd
## 5 na unemployed Venus 0 cannabis
## 6 deceased at fault <NA> pluto 78 nd
## stimulants opioids sedatives Other.drugs hospital.meds
## 1 nd nd zopiclone paracetamol <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 nd nd nd nd <NA>
## 4 nd nd nd nd <NA>
## 5 methamphetamine nd nd nd <NA>
## 6 nd nd nd nd <NA>
Looks good!
Re-name the column names with #
in them (R reads this as a comments and won’t evaluate the code)
data <- data %>%
rename(no.deceased = "#.deceased")
data <- data %>%
rename(no.vehicles.involved = "#.vehicles.involved")
Now I want to change some of the variables
data$speed <- as.numeric(data$speed)
data$no.deceased <- as.numeric(data$no.deceased)
data$year <- as.character(data$year)
Get the number of rows in the dataframe, and therefore the total number of drivers (assuming each row represents a distinct driver)
data %>%
count()
## n
## 1 99
99 individuals were involved in road crashes (note. I reduced the dataset so this is less)
This time we do the same thing (counting the number of rows) but after we have “filtered” (removed the rows) for which Age
is greater than 60
data %>%
filter(Age > 60) %>%
count()
## n
## 1 20
20 of these 99 people involved in road crashes were over 60 years old
Another thing of interest, “cutting” up a variable (in this case “Age”) into portions and summarising over it
data %>%
group_by(Age.range = cut(Age, breaks= seq(0, 120, by = 10))) %>%
summarise(num_individuals = n()) %>%
arrange(as.numeric(Age.range))
## # A tibble: 9 x 2
## Age.range num_individuals
## <fct> <int>
## 1 (10,20] 17
## 2 (20,30] 21
## 3 (30,40] 13
## 4 (40,50] 11
## 5 (50,60] 17
## 6 (60,70] 6
## 7 (70,80] 7
## 8 (80,90] 5
## 9 (90,100] 2
Try the same but looking at a different variable (Blood.alcohol) and different age ranges
data %>%
group_by(Age.range = cut(Age, breaks= seq(0, 120, by = 20))) %>%
summarise(mean_blood_alcohol = mean(Blood.alcohol)) %>%
arrange(as.numeric(Age.range))
## # A tibble: 5 x 2
## Age.range mean_blood_alcohol
## <fct> <dbl>
## 1 (0,20] 59.2
## 2 (20,40] 67.1
## 3 (40,60] 29.3
## 4 (60,80] 18.4
## 5 (80,100] 0
data %>%
# Choose which variables to look at (you can use more or less variables)
group_by(District, Gender) %>%
# Choose the variable to summarise (in this case I used "Age")
summarise(mean_age = mean(Age), sd_age = sd(Age))
## # A tibble: 12 x 4
## # Groups: District [6]
## District Gender mean_age sd_age
## <chr> <chr> <dbl> <dbl>
## 1 Florida F 41.5 30.4
## 2 Florida M 37.2 21.6
## 3 Los Angeles F 49.6 23.0
## 4 Los Angeles M 45.8 27.4
## 5 New York F 32 NA
## 6 New York M 49.7 22.6
## 7 Ohio F 36.6 19.2
## 8 Ohio M 34.8 15.2
## 9 Texas F 38.7 20.1
## 10 Texas M 46.2 21.5
## 11 Washington F 58.7 20.8
## 12 Washington M 39.3 20.7
data %>%
# Try looking at only "Day.of.Week"
group_by(Day.of.Week) %>%
# Try looking at a different variable (in this case I used "speed")
summarise(mean_Blood.alcohol = mean(Blood.alcohol), sd_Blood.alcohol = sd(Blood.alcohol))
## # A tibble: 7 x 3
## Day.of.Week mean_Blood.alcohol sd_Blood.alcohol
## <chr> <dbl> <dbl>
## 1 Friday 30.9 73.8
## 2 Monday 5.53 21.4
## 3 Saturday 76 88.9
## 4 Sunday 85.1 95.6
## 5 Thursday 17.8 60.2
## 6 Tuesday 42.1 85.8
## 7 Wednesday 21 42.0
These results can be written to file:
data %>%
group_by(District) %>%
summarise(mean_speed = mean(speed), sd_speed = sd(speed)) %>%
write.csv("speed_by_district.csv", row.names = FALSE)
data %>%
group_by(District, Gender) %>%
summarise(min_blood_alcohol = min(Blood.alcohol), max_blood_alcohol = max(Blood.alcohol))
## # A tibble: 12 x 4
## # Groups: District [6]
## District Gender min_blood_alcohol max_blood_alcohol
## <chr> <chr> <dbl> <dbl>
## 1 Florida F 0 0
## 2 Florida M 0 209
## 3 Los Angeles F 0 0
## 4 Los Angeles M 0 226
## 5 New York F 0 0
## 6 New York M 0 217
## 7 Ohio F 0 239
## 8 Ohio M 0 207
## 9 Texas F 0 235
## 10 Texas M 0 68
## 11 Washington F 0 121
## 12 Washington M 0 228
Create a new column/variable that is “yes” or “no” to presence of alcohol in blood based on the Blood.alcohol
column
data <- data %>%
mutate(alcohol_in_system = case_when(
Blood.alcohol > 0 ~ "yes",
Blood.alcohol == 0 ~ "no"))
Get the number of rows in the dataframe, and therefore the total number of drivers (assuming each row represents a distinct driver) after we have “filtered” (removed the rows) for which alcohol_in_system
is yes
data %>%
filter(alcohol_in_system == "yes") %>%
count()
## n
## 1 30
An alternative way of doing this, filter for rows/cases where at least one person was deceased and get the number of rows/cases where there was/wasn’t any alcohol in their system
data %>%
filter(no.deceased > 0) %>%
group_by(alcohol_in_system) %>%
summarise(count = n())
## # A tibble: 2 x 2
## alcohol_in_system count
## <chr> <int>
## 1 no 68
## 2 yes 30