Who is going to finish the Köln Marathon?
By Lukas Steger
Inspired by this question on Quora I wanted to dig into marathon results and in particular the “DNFs” (participants who started the marathon but didn’t make it to the finish line).
I would have liked to do my little analysis on the results from the Berlin Marathon, but unfortunately I couldn’t find any results including the DNFs (if you have a hint where to find them please let me know 😄). So instead I used results from the Köln Marathon.
In python I made a rather simple web scraper using the selenium library to get the 2018 and 2019 results incling all split times from the official website and implemented some rough preprocessing. Both notebooks and the results as csv files can be found in the Github repo. First I will load the roughly preprocessed data from there and do some more preprocessing on it:
library(tidyverse)
library(janitor)
library(knitr)
library(kableExtra)
library(sf)
library(osmdata)
library(ggrepel)
library(patchwork)
library(openrouteservice)
library(geosphere)
library(ggmap)
#loading data from github and give clean names with the awesome janitor::clean_names() function
raw_result <- read.csv(file = "https://github.com/quickcoffee/koeln_marathon/raw/master/Data/marathon_cologne_complete.csv") %>%
janitor::clean_names() %>%
select(-x)
head(raw_result) %>%
kable() %>%
kable_styling(font_size = 7)
overall_place | bib | name | birth_year | nationality | club | gender_place | age_group_place | netto_time | brutto_time | x5km | x10km | x15km | hm | x25km | x30km | x35km | x40km | year | dnf | gender |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | Tobias BLUM* | 1995 | GER | LC Rehlingen |
|
|
8215 | 8217 | 944 | 1904 | 2862 | 4040 | 4780 | 5743 | 6701 | 7751 | 2018 | 0 | M |
2 | 4 | Marcin BLAZINSKI* | 1988 | GER | LG farbtex Nordschwarzwald |
|
|
8404 | 8406 | 940 | 1899 | 2854 | 4034 | 4784 | 5806 | 6835 | 7938 | 2018 | 0 | M |
3 | 2 | Dominik FABIANOWSKI* | 1989 | GER | ASV Köln |
|
|
8477 | 8479 | 997 | 1990 | 2983 | 4211 | 4992 | 6009 | 6986 | 8022 | 2018 | 0 | M |
4 | 3 | Christian SCHREINER* | 1986 | GER | LAZ Puma Rhein-Sieg |
|
|
8626 | 8628 | 998 | 1990 | 2983 | 4211 | 4992 | 6009 | 7026 | 8146 | 2018 | 0 | M |
5 | 7 | Nikki JOHNSTONE* | 1984 | GBR | ART-Düsseldorf |
|
|
8743 | 8746 | 1033 | 2087 | 3141 | 4421 | 5217 | 6249 | 7244 | 8298 | 2018 | 0 | M |
6 | 8 | Philippe GILLEN* | 1989 | LUX | Run Squad CGN |
|
|
8772 | 8774 | 1033 | 2087 | 3142 | 4421 | 5220 | 6249 | 7257 | 8312 | 2018 | 0 | M |
results <- raw_result %>%
#extract city names from "club" variable via regular expression and remove the city from club
mutate(city = trimws(str_extract(club, "(?<=\\().+?(?=\\))")),
city = na_if(city, ""), #replace empty city strings with na
club = trimws(str_replace(club, "\\(.+\\)", replacement = NA_character_)),
club = na_if(club, ""), #replace empty club strings with na
#rough estimate of the participant's age
age = year-birth_year,
#extract age group and age group place
gender_place = str_extract(gender_place, "[:digit:]{1,6}(?=\\.)|DNF"),
age_group = str_extract(age_group_place, "(?<=([:digit:]{1,6}\\.[:blank:])|DNF[:blank:]).{3,5}"),
age_group_place = str_extract(age_group_place, "[:digit:]{1,6}(?=\\.[:blank:].{3,5})|DNF"),
#calculate split times
start_delay = brutto_time-netto_time,
split_5km = x5km,
split_10km = x10km-x5km,
split_15km = x15km-x10km,
split_hm = hm-x15km,
split_25km = x25km-hm,
split_30km = x30km-x25km,
split_35km = x35km-x30km,
split_40km = x40km-x35km,
split_finish = netto_time-x40km,
#remove * from name
name = str_replace(name, "\\*", ""),
#add age bins
age_bins = cut(age, breaks = c(-Inf, 25, 35, 45, 55, 65, Inf),
labels = c("0-25", "26-35", "36-45", "46-55", "56-65", "66+"),
include.lowest = FALSE),
#lump top 12 nationalities together
nationality = fct_lump_n(nationality, 12),
#create dummy variable for dnf
dnf = case_when(is.na(overall_place) ~ "DNF",
!is.na(overall_place) ~ "Finisher")) %>%
#filter out pace runners
filter(str_detect(name, "Pace", negate = TRUE)) %>%
#count the number of participations (only 2018 and 2019)
add_count(name, nationality, birth_year, gender, name = "participations") %>%
#drop bib and name
select(-bib, -name) %>%
#add runner_id
mutate(runner_id = row_number())
#load geo coordinates and distances to the start csv file (requested via Google Maps Geocode API)
city_dist <- read_csv("../../static/data/koeln_marathon/koeln_marathon_cities.csv") %>%
select(city, dist)
results <- results %>%
left_join(city_dist, by = "city")
After that is done let’s have look where the participants “drop-out” during the race:
There is a rapid increase after the halfway point. To visualise the increase also in relation to the actual marathon track see the plot below:

Map of DNFs at Cologne Marathon
Female starters tend to stop earlier, while male starters have to drop out of the race closer to the finish:
The share of DNFs among all participants in the age group is the highest for the oldest and youngest starters:
When comparing finishers and DNFs where it was possible to extract the city (usually if the runner did not signup as part of a club) it looks like the median distance to Köln from their hometown is lower for the DNFs than for the finishers.
Across the top 12 nationalities by participants runners from Spain and the USA have the lowest share of DNFs, while the Dutch rank the highest.
Based on all of this information, it could be interesting to see if it is possible to predict DNFs at the halfway point. In the next step I will engeneer some additional features and prepare a dataset to train with {tidymodels}
.
Feature Engineering and Preprocessing
The first feature I would like add is the club size for the participants who signed up as part of a club, as it seems like there is some difference between finshers and DNFs:
#engineer club size
results_preprocessed <- results %>%
#filter out runners which droped out before KM21
filter(!(dnf=="DNF" & is.na(hm))) %>%
#drop not-used features or features with data leakage (eg. start delay is only available for finishers)
select(-overall_place, -birth_year, -gender_place, -age_group_place, -brutto_time, -netto_time, -city, -start_delay, -age_bins,
-x25km, -x30km, -x35km, -x40km, -split_25km, -split_30km, -split_35km, -split_40km, -split_finish, -age_group) %>%
#group by year and club to geth club size per year
group_by(year, club) %>%
#set club size to 0 for people without a club
mutate(club_starters = case_when(!is.na(club) ~ as.numeric(n()),
is.na(club) ~ 0)) %>%
ungroup() %>%
select(-club)
#plot regarding clubsize
club_size_p <- results_preprocessed %>%
ggplot(aes(x=dnf, y=club_starters, fill=dnf))+
geom_boxplot()+
scale_y_continuous(trans='log10')+
theme_light()+
labs(y="Number of starters from the same club",
x=NULL)+
theme(plot.background = element_rect(fill = "#F4F4F4"),
panel.background = element_rect(fill = "#F4F4F4",
colour = "#F4F4F4",
size = 0.5, linetype = "solid"))
#plot
club_size_p
As the next feature the standard variation across the relative split times is calculated for each runner:
results_preprocessed <- results_preprocessed %>%
mutate(pace_variation = pmap_dbl(.l = select(., starts_with("split_")), .f = ~sd(c(...), na.rm = TRUE)))
results_preprocessed %>%
ggplot(aes(x=dnf, y=pace_variation, fill=dnf))+
geom_boxplot()+
theme_light()+
labs(y="Relative split times standard deviation",
x=NULL)+
theme(plot.background = element_rect(fill = "#F4F4F4"),
panel.background = element_rect(fill = "#F4F4F4",
colour = "#F4F4F4",
size = 0.5, linetype = "solid"))
In the final feature engineering step I focus on the relative and absolute split times. First I compare the pace of the relative split times to the average pace during the first half of the marathon. On these new features rel_pace...
I’m conducting a linear regression in order to get the slope across the new feature (–> did a runner keep a steady pace or did the pace increas/decrease during the first half of the marathon).
A similar approach is applied to the perecentile rank on the relative and absolut split times in order to see if a runner got faster or slower compared to the other runners.
As the last preprocessing step the distance to the start from a participant’s hometown is discretized into for roughly equal groups and a “missing” category added.
results_preprocessed <- results_preprocessed %>%
#drop observations with missing split times in order to be able to do linear regression to get relative pace slope
drop_na(starts_with("split_")) %>%
#get the relative pace compared to the average pace of the first half of the marathon
mutate(rel_pace_5km = (split_5km/5)/(hm/21.0975),
rel_pace_10km = (split_10km/5)/(hm/21.0975),
rel_pace_15km = (split_15km/5)/(hm/21.0975),
rel_pace_hm = (split_hm/6.0975)/(hm/21.0975)) %>%
#calculate the slope of the relative pace
mutate(rel_pace_slope = pmap_dbl(.l = select(., starts_with("rel_pace_")), .f = ~lm(rel_pace~split,
tibble(rel_pace = c(...),
split = c(1:4)))$coeff[[2]])) %>%
#calculate the percentile per year on absolute and relative split times
group_by(year) %>%
mutate(rel_percentile_5km = percent_rank(split_5km),
rel_percentile_10km = percent_rank(split_10km),
rel_percentile_15km = percent_rank(split_15km),
rel_percentile_hm = percent_rank(split_hm),
percentile_5km = percent_rank(split_5km),
percentile_10km = percent_rank(x10km),
percentile_15km = percent_rank(x15km),
percentile_hm = percent_rank(hm)) %>%
ungroup() %>%
select(-year) %>%
#calculate the slope of the percentiles
mutate(rel_percentile_slope = pmap_dbl(.l = select(., starts_with("rel_percentile_")), .f = ~lm(rel_percentile~split,
tibble(rel_percentile = c(...),
split = c(1:4)))$coeff[[2]]),
percentile_slope = pmap_dbl(.l = select(., starts_with("percentile_")), .f = ~lm(percentile~split,
tibble(percentile = c(...),
split = c(1:4)))$coeff[[2]]))
#discretize distance into 4 equal sized groups and add "missing" as an explicit group
results_preprocessed <- results_preprocessed %>%
mutate(dist_cat = cut_number(dist, n=4) %>% fct_explicit_na()) %>%
#drop no longer used variables
select(-dist, -x5km, -x10km, -x15km, -hm, -starts_with("split_"))
Modelling!!
Splitting the data, creating recipe (with SMOTE via the {themis}
package due to heavy class imbalance) and create folds for resampling:
library(tidymodels)
library(themis) #for SMOTE
#create initial split with strata
set.seed(1234)
marathon_split <- initial_split(results_preprocessed, prop = .8, strata = dnf)
marathon_train <- training(marathon_split)
marathon_test <- testing(marathon_split)
#create folds for resampling
set.seed(1234)
marathon_folds <- vfold_cv(marathon_train, strata = dnf, v = 5)
#create recipe
marathon_rec <- recipe(dnf ~ ., data = marathon_train) %>%
step_rm(runner_id) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_smote(dnf) %>%
step_zv(all_numeric()) %>%
step_normalize(all_numeric()) %>%
prep()
We will create two classification models: One using logistic regression and the other using eXtreme Gradient Boosting (xgboost).
Both models will be tuned on various tuning parameters via a grid search using the {tune}
package.
library(workflows)
library(tune)
#set model specification
glm_spec <- logistic_reg(mode = "classification",
penalty = tune(),
mixture = tune()) %>%
set_engine("glmnet")
#workflow
glm_wflow <-
workflow() %>%
add_model(glm_spec) %>%
add_recipe(marathon_rec)
#train resamples
set.seed(1291)
glm_res <-
tune_grid(
glm_wflow,
resamples = marathon_folds,
grid = 5,
metrics = metric_set(roc_auc, sens, spec),
control = control_resamples(save_pred = TRUE, verbose = TRUE, allow_par = TRUE)
)
glm_res %>%
show_best(metric = "roc_auc") %>%
kable() %>%
kable_styling(font_size = 8)
penalty | mixture | .metric | .estimator | mean | n | std_err |
---|---|---|---|---|---|---|
0.0000008 | 0.7329869 | roc_auc | binary | 0.7111074 | 5 | 0.0134748 |
0.0000000 | 0.4571387 | roc_auc | binary | 0.7110537 | 5 | 0.0135265 |
0.0000044 | 0.9276452 | roc_auc | binary | 0.7106831 | 5 | 0.0137492 |
0.0007173 | 0.4105434 | roc_auc | binary | 0.7105717 | 5 | 0.0139064 |
0.1769055 | 0.1284538 | roc_auc | binary | 0.6893192 | 5 | 0.0067731 |
Across the resample the best hyperparameter combination for the logistig regression model reached an AUC of 0.73.
Let’s see if the xgboost performs better?
#set model specification
xgb_spec <- boost_tree(mode = "classification",
mtry = tune(),
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune()) %>%
set_engine("xgboost")
#workflow
xgb_wflow <-
workflow() %>%
add_model(xgb_spec) %>%
add_recipe(marathon_rec)
#train resamples
#doParallel for paralell processing:
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores(logical = T)-1)
set.seed(1291)
xgb_res <-
tune_grid(
xgb_wflow,
resamples = marathon_folds,
grid = 20,
metrics = metric_set(roc_auc, sens, spec),
control = control_resamples(save_pred = TRUE, verbose = TRUE, allow_par = TRUE)
)
xgb_res %>%
show_best(metric = "roc_auc") %>%
kable() %>%
kable_styling(font_size = 8)
mtry | trees | min_n | tree_depth | learn_rate | loss_reduction | sample_size | .metric | .estimator | mean | n | std_err |
---|---|---|---|---|---|---|---|---|---|---|---|
24 | 816 | 22 | 10 | 0.0401959 | 0.0000000 | 0.0894969 | roc_auc | binary | 0.7728297 | 5 | 0.0099047 |
17 | 1772 | 14 | 8 | 0.0266403 | 0.0000000 | 0.0380707 | roc_auc | binary | 0.7648839 | 5 | 0.0072832 |
9 | 595 | 4 | 11 | 0.0000033 | 0.0000000 | 0.0505695 | roc_auc | binary | 0.7406154 | 5 | 0.0067896 |
2 | 1909 | 11 | 13 | 0.0000001 | 0.0003618 | 0.0846281 | roc_auc | binary | 0.7330481 | 5 | 0.0101271 |
36 | 1859 | 12 | 8 | 0.0032850 | 0.1676054 | 0.0711439 | roc_auc | binary | 0.7315023 | 5 | 0.0115941 |
It seems like the AUC is a bit better than the one of the logistic regression. The ROC plot shows the difference between the two models across all hyperparameter combinations:
Across all hyperparameter combinations on all folds the logistic regression seems to perform better than the xgboost. Let’s fit the best combination of each model on the complete training data and get the predictions for the test data:
Both models perform quite similar on the test data.
Below you can find the confusion matrix for the logistic regression model:
## Truth
## Prediction DNF Finisher
## DNF 47 525
## Finisher 39 1306
Among the DNFs the model made most of false predicitions for runners within the 0.25 to 0.75 percentiles, while among the Finishers ir didn’t perform so good for the very low and very high ranked runners at the half-way point.
Overall the logistic regression model does quite a good job at identifying DNFs but also falsley a big portion of runners who actually finished.
Such a model could be further enhanced with booking data (eg. how many days before the marathon did a participant sign up?) and historical data (eg. when was the last time a participant ran the Köln Marathon?), but it already gives a great opportunity of identifing struggling participants and eventually give them a bit of extra support in form of energy gels, drinks or cheering!
Let me know how you think this model could be improved! 😄
/Lukas