Who is going to finish the Köln Marathon?

By Lukas Steger

March 5, 2020

Inspired by this question on Quora I wanted to dig into marathon results and in particular the “DNFs” (participants who started the marathon but didn’t make it to the finish line).
I would have liked to do my little analysis on the results from the Berlin Marathon, but unfortunately I couldn’t find any results including the DNFs (if you have a hint where to find them please let me know 😄). So instead I used results from the Köln Marathon.

In python I made a rather simple web scraper using the selenium library to get the 2018 and 2019 results incling all split times from the official website and implemented some rough preprocessing. Both notebooks and the results as csv files can be found in the Github repo. First I will load the roughly preprocessed data from there and do some more preprocessing on it:

library(tidyverse)
library(janitor)
library(knitr)
library(kableExtra)
library(sf)
library(osmdata)
library(ggrepel)
library(patchwork)
library(openrouteservice)
library(geosphere)
library(ggmap)

#loading data from github and give clean names with the awesome janitor::clean_names() function
raw_result <- read.csv(file = "https://github.com/quickcoffee/koeln_marathon/raw/master/Data/marathon_cologne_complete.csv") %>% 
  janitor::clean_names() %>% 
  select(-x)

head(raw_result) %>%
  kable() %>%
  kable_styling(font_size = 7)

overall_place	bib	name	birth_year	nationality	club	age_group_place	netto_time	brutto_time	x5km	x10km	x15km	hm	x25km	x30km	x35km	x40km	year	gender
1	1	Tobias BLUM*	1995	GER	LC Rehlingen	Elite	8215	8217	944	1904	2862	4040	4780	5743	6701	7751	2018	M
2	4	Marcin BLAZINSKI*	1988	GER	LG farbtex Nordschwarzwald	Elite	8404	8406	940	1899	2854	4034	4784	5806	6835	7938	2018	M
3	2	Dominik FABIANOWSKI*	1989	GER	ASV Köln	Elite	8477	8479	997	1990	2983	4211	4992	6009	6986	8022	2018	M
4	3	Christian SCHREINER*	1986	GER	LAZ Puma Rhein-Sieg	Elite	8626	8628	998	1990	2983	4211	4992	6009	7026	8146	2018	M
5	7	Nikki JOHNSTONE*	1984	GBR	ART-Düsseldorf	Elite	8743	8746	1033	2087	3141	4421	5217	6249	7244	8298	2018	M
6	8	Philippe GILLEN*	1989	LUX	Run Squad CGN	Elite	8772	8774	1033	2087	3142	4421	5220	6249	7257	8312	2018	M

results <- raw_result %>%
  #extract city names from "club" variable via regular expression and remove the city from club
  mutate(city = trimws(str_extract(club, "(?<=\\().+?(?=\\))")),
         city = na_if(city, ""), #replace empty city strings with na
         club = trimws(str_replace(club, "\\(.+\\)", replacement = NA_character_)),
         club = na_if(club, ""), #replace empty club strings with na
         #rough estimate of the participant's age
         age = year-birth_year,
         #extract age group and age group place
         gender_place = str_extract(gender_place, "[:digit:]{1,6}(?=\\.)|DNF"),
         age_group = str_extract(age_group_place, "(?<=([:digit:]{1,6}\\.[:blank:])|DNF[:blank:]).{3,5}"),
         age_group_place = str_extract(age_group_place, "[:digit:]{1,6}(?=\\.[:blank:].{3,5})|DNF"),
         #calculate split times
         start_delay = brutto_time-netto_time,
         split_5km = x5km,
         split_10km = x10km-x5km,
         split_15km = x15km-x10km,
         split_hm = hm-x15km,
         split_25km = x25km-hm,
         split_30km = x30km-x25km,
         split_35km = x35km-x30km,
         split_40km = x40km-x35km,
         split_finish = netto_time-x40km,
         #remove * from name
         name = str_replace(name, "\\*", ""),
         #add age bins
         age_bins = cut(age, breaks = c(-Inf, 25, 35, 45, 55, 65, Inf),
                        labels = c("0-25", "26-35", "36-45", "46-55", "56-65", "66+"),
                                                       include.lowest = FALSE),
         #lump top 12 nationalities together
         nationality = fct_lump_n(nationality, 12),
         #create dummy variable for dnf
         dnf = case_when(is.na(overall_place) ~ "DNF",
                         !is.na(overall_place) ~ "Finisher")) %>% 
  #filter out pace runners
  filter(str_detect(name, "Pace", negate = TRUE)) %>%
  #count the number of participations (only 2018 and 2019)
  add_count(name, nationality, birth_year, gender, name = "participations") %>%
  #drop bib and name
  select(-bib, -name) %>% 
  #add runner_id
  mutate(runner_id = row_number())

#load geo coordinates and distances to the start csv file (requested via Google Maps Geocode API)
city_dist <- read_csv("../../static/data/koeln_marathon/koeln_marathon_cities.csv") %>% 
  select(city, dist)

results <- results %>%
  left_join(city_dist, by = "city")

After that is done let’s have look where the participants “drop-out” during the race:

There is a rapid increase after the halfway point. To visualise the increase also in relation to the actual marathon track see the plot below:

Map of DNFs at Cologne Marathon

Female starters tend to stop earlier, while male starters have to drop out of the race closer to the finish:

The share of DNFs among all participants in the age group is the highest for the oldest and youngest starters:

When comparing finishers and DNFs where it was possible to extract the city (usually if the runner did not signup as part of a club) it looks like the median distance to Köln from their hometown is lower for the DNFs than for the finishers.

Across the top 12 nationalities by participants runners from Spain and the USA have the lowest share of DNFs, while the Dutch rank the highest.

Based on all of this information, it could be interesting to see if it is possible to predict DNFs at the halfway point. In the next step I will engeneer some additional features and prepare a dataset to train with {tidymodels}.

Feature Engineering and Preprocessing

The first feature I would like add is the club size for the participants who signed up as part of a club, as it seems like there is some difference between finshers and DNFs:

#engineer club size
results_preprocessed <- results %>%
  #filter out runners which droped out before KM21
  filter(!(dnf=="DNF" & is.na(hm))) %>% 
  #drop not-used features or features with data leakage (eg. start delay is only available for finishers) 
  select(-overall_place, -birth_year, -gender_place, -age_group_place, -brutto_time, -netto_time, -city, -start_delay, -age_bins,
         -x25km, -x30km, -x35km, -x40km, -split_25km, -split_30km, -split_35km, -split_40km, -split_finish, -age_group) %>%
  #group by year and club to geth club size per year
  group_by(year, club) %>%
  #set club size to 0 for people without a club
  mutate(club_starters = case_when(!is.na(club) ~ as.numeric(n()),
                                   is.na(club) ~ 0)) %>% 
  ungroup() %>% 
  select(-club)

#plot regarding clubsize
club_size_p <- results_preprocessed %>%
  ggplot(aes(x=dnf, y=club_starters, fill=dnf))+
  geom_boxplot()+
  scale_y_continuous(trans='log10')+
  theme_light()+
  labs(y="Number of starters from the same club",
       x=NULL)+
  theme(plot.background = element_rect(fill = "#F4F4F4"),
        panel.background = element_rect(fill = "#F4F4F4",
                                colour = "#F4F4F4",
                                size = 0.5, linetype = "solid"))
#plot
club_size_p

As the next feature the standard variation across the relative split times is calculated for each runner:

results_preprocessed <- results_preprocessed %>%
  mutate(pace_variation = pmap_dbl(.l = select(., starts_with("split_")), .f = ~sd(c(...), na.rm = TRUE)))

results_preprocessed %>% 
  ggplot(aes(x=dnf, y=pace_variation, fill=dnf))+
  geom_boxplot()+
  theme_light()+
  labs(y="Relative split times standard deviation",
       x=NULL)+
  theme(plot.background = element_rect(fill = "#F4F4F4"),
        panel.background = element_rect(fill = "#F4F4F4",
                                colour = "#F4F4F4",
                                size = 0.5, linetype = "solid"))

In the final feature engineering step I focus on the relative and absolute split times. First I compare the pace of the relative split times to the average pace during the first half of the marathon. On these new features rel_pace... I’m conducting a linear regression in order to get the slope across the new feature (–> did a runner keep a steady pace or did the pace increas/decrease during the first half of the marathon).
A similar approach is applied to the perecentile rank on the relative and absolut split times in order to see if a runner got faster or slower compared to the other runners.
As the last preprocessing step the distance to the start from a participant’s hometown is discretized into for roughly equal groups and a “missing” category added.

results_preprocessed <- results_preprocessed %>%
  #drop observations with missing split times in order to be able to do linear regression to get relative pace slope
  drop_na(starts_with("split_")) %>%
  #get the relative pace compared to the average pace of the first half of the marathon
  mutate(rel_pace_5km = (split_5km/5)/(hm/21.0975),
         rel_pace_10km = (split_10km/5)/(hm/21.0975),
         rel_pace_15km = (split_15km/5)/(hm/21.0975),
         rel_pace_hm = (split_hm/6.0975)/(hm/21.0975)) %>%
  #calculate the slope of the relative pace
  mutate(rel_pace_slope = pmap_dbl(.l = select(., starts_with("rel_pace_")), .f = ~lm(rel_pace~split,
                                                                                      tibble(rel_pace = c(...),
                                                                                             split = c(1:4)))$coeff[[2]])) %>%

  #calculate the percentile per year on absolute and relative split times
  group_by(year) %>%
  mutate(rel_percentile_5km = percent_rank(split_5km),
         rel_percentile_10km = percent_rank(split_10km),
         rel_percentile_15km = percent_rank(split_15km),
         rel_percentile_hm = percent_rank(split_hm),
         percentile_5km = percent_rank(split_5km),
         percentile_10km = percent_rank(x10km),
         percentile_15km = percent_rank(x15km),
         percentile_hm = percent_rank(hm)) %>%
  ungroup() %>%
  select(-year) %>%
  #calculate the slope of the percentiles
  mutate(rel_percentile_slope = pmap_dbl(.l = select(., starts_with("rel_percentile_")), .f = ~lm(rel_percentile~split,
                                                                                      tibble(rel_percentile = c(...),
                                                                                             split = c(1:4)))$coeff[[2]]),
         percentile_slope = pmap_dbl(.l = select(., starts_with("percentile_")), .f = ~lm(percentile~split,
                                                                                      tibble(percentile = c(...),
                                                                                             split = c(1:4)))$coeff[[2]]))
#discretize distance into 4 equal sized groups and add "missing" as an explicit group
results_preprocessed <- results_preprocessed %>% 
  mutate(dist_cat = cut_number(dist, n=4) %>% fct_explicit_na()) %>%
  #drop no longer used variables
  select(-dist, -x5km, -x10km, -x15km, -hm, -starts_with("split_"))

Modelling!!

Splitting the data, creating recipe (with SMOTE via the {themis} package due to heavy class imbalance) and create folds for resampling:

library(tidymodels)
library(themis) #for SMOTE

#create initial split with strata
set.seed(1234)
marathon_split <- initial_split(results_preprocessed, prop = .8, strata = dnf)

marathon_train <- training(marathon_split)
marathon_test <- testing(marathon_split)

#create folds for resampling
set.seed(1234)
marathon_folds <- vfold_cv(marathon_train, strata = dnf, v = 5)

#create recipe
marathon_rec <- recipe(dnf ~ ., data = marathon_train) %>%
  step_rm(runner_id) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_smote(dnf) %>%
  step_zv(all_numeric()) %>%
  step_normalize(all_numeric()) %>% 
  prep()

We will create two classification models: One using logistic regression and the other using eXtreme Gradient Boosting (xgboost).
Both models will be tuned on various tuning parameters via a grid search using the {tune} package.

library(workflows)
library(tune)

#set model specification
glm_spec <- logistic_reg(mode = "classification",
                         penalty = tune(),
                         mixture = tune()) %>%
  set_engine("glmnet")

#workflow
glm_wflow <-
  workflow() %>%
  add_model(glm_spec) %>%
  add_recipe(marathon_rec)


#train resamples
set.seed(1291)
glm_res <- 
  tune_grid(
    glm_wflow, 
    resamples = marathon_folds,
    grid = 5,
    metrics = metric_set(roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE, verbose = TRUE, allow_par = TRUE)
  )

glm_res %>%
  show_best(metric = "roc_auc") %>%
  kable() %>%
  kable_styling(font_size = 8)

penalty	mixture	.metric	.estimator	mean	n	std_err
0.0000008	0.7329869	roc_auc	binary	0.7111074	5	0.0134748
0.0000000	0.4571387	roc_auc	binary	0.7110537	5	0.0135265
0.0000044	0.9276452	roc_auc	binary	0.7106831	5	0.0137492
0.0007173	0.4105434	roc_auc	binary	0.7105717	5	0.0139064
0.1769055	0.1284538	roc_auc	binary	0.6893192	5	0.0067731

Across the resample the best hyperparameter combination for the logistig regression model reached an AUC of 0.73.
Let’s see if the xgboost performs better?

#set model specification
xgb_spec <- boost_tree(mode = "classification",
                       mtry = tune(),
                       trees = tune(),
                       min_n = tune(),
                       tree_depth = tune(),
                       learn_rate = tune(),
                       loss_reduction = tune(),
                       sample_size = tune()) %>%
  set_engine("xgboost")

#workflow
xgb_wflow <-
  workflow() %>%
  add_model(xgb_spec) %>%
  add_recipe(marathon_rec)

#train resamples
#doParallel for paralell processing:
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores(logical = T)-1)
set.seed(1291)
xgb_res <- 
  tune_grid(
    xgb_wflow, 
    resamples = marathon_folds,
    grid = 20,
    metrics = metric_set(roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE, verbose = TRUE, allow_par = TRUE)
  )

xgb_res %>%
  show_best(metric = "roc_auc") %>%
  kable() %>%
  kable_styling(font_size = 8)

mtry	trees	min_n	tree_depth	learn_rate	loss_reduction	sample_size	.metric	.estimator	mean	n	std_err
24	816	22	10	0.0401959	0.0000000	0.0894969	roc_auc	binary	0.7728297	5	0.0099047
17	1772	14	8	0.0266403	0.0000000	0.0380707	roc_auc	binary	0.7648839	5	0.0072832
9	595	4	11	0.0000033	0.0000000	0.0505695	roc_auc	binary	0.7406154	5	0.0067896
2	1909	11	13	0.0000001	0.0003618	0.0846281	roc_auc	binary	0.7330481	5	0.0101271
36	1859	12	8	0.0032850	0.1676054	0.0711439	roc_auc	binary	0.7315023	5	0.0115941

It seems like the AUC is a bit better than the one of the logistic regression. The ROC plot shows the difference between the two models across all hyperparameter combinations:

Across all hyperparameter combinations on all folds the logistic regression seems to perform better than the xgboost. Let’s fit the best combination of each model on the complete training data and get the predictions for the test data:

Both models perform quite similar on the test data.
Below you can find the confusion matrix for the logistic regression model:

##           Truth
## Prediction  DNF Finisher
##   DNF        47      525
##   Finisher   39     1306

Among the DNFs the model made most of false predicitions for runners within the 0.25 to 0.75 percentiles, while among the Finishers ir didn’t perform so good for the very low and very high ranked runners at the half-way point.

Overall the logistic regression model does quite a good job at identifying DNFs but also falsley a big portion of runners who actually finished.
Such a model could be further enhanced with booking data (eg. how many days before the marathon did a participant sign up?) and historical data (eg. when was the last time a participant ran the Köln Marathon?), but it already gives a great opportunity of identifing struggling participants and eventually give them a bit of extra support in form of energy gels, drinks or cheering!

Let me know how you think this model could be improved! 😄

/Lukas