Web scraping refers to various methods used to collect data across the web. The {rvest} extension, included inside the tidyverse, allows you to perform web scraping with R. the data can then be analyzed according to the protocols usually used with R.
In this tutorial, we will see how we can use {rvest} to answer one question: how do women perform compared to men in ultra-trail running races? We will focus on one of the most famous races of the discipline: the UTMB, in the French Alps (170km, +10,000m of elevation gain).
1. “Scrape” the data
Open the results for the UTMB 2021 edition from the website of the International Trail Running Association (ITRA), which compiles the results of trail running races from 2012 to 2021. Then we will copy the url of this page to paste it inside read_html().
# Load {rvest} with the whole tidyverse
library(tidyverse)
# Read HTML page with read_html()
utmb_2021 <- rvest::read_html('https://itra.run/Races/RaceResults?raceYearId=72496')
utmb_2021
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="container-fluid p-0 m-0">\r\n <header><div class="">\r\n ...
It is easier to extract data inside an HTML table, which is fortunately the case for the ranking. we will identify this table with html_element(), then convert it to a tibble with html_table()
ranking_2021 <- utmb_2021 %>%
rvest::html_element(".table") %>%
rvest::html_table()
ranking_2021
## # A tibble: 1,526 x 7
## `` Runner Time Score Age Gender Nationality
## <int> <chr> <chr> <chr> <int> <chr> <chr>
## 1 1 D HAENE Fr~ 20:45~ "Become an ITRA member for~ 36 H FRA
## 2 2 DUNAND PAL~ 20:58~ "Become an ITRA member for~ 29 H FRA
## 3 3 BLANCHARD ~ 21:12~ "Become an ITRA member for~ 34 H FRA
## 4 4 POMMERET L~ 21:38~ "Become an ITRA member for~ 46 H FRA
## 5 5 GRANGIER G~ 21:52~ "Become an ITRA member for~ 31 H FRA
## 6 6 Namberger ~ 22:22~ "Become an ITRA member for~ 32 H GER
## 7 7 DAUWALTER ~ 22:30~ "Become an ITRA member for~ 36 F USA
## 8 8 CURMER Gre~ 23:00~ "Become an ITRA member for~ 31 H FRA
## 9 8 PAZOS Diego 23:00~ "Become an ITRA member for~ 37 H SUI
## 10 10 CLEMENT Ma~ 23:08~ "Become an ITRA member for~ 26 H SUI
## # ... with 1,516 more rows
We need a few more steps to format this table:
ranking_2021<-ranking_2021 %>%
# Rename first column
rename(Rank=1) %>%
mutate(Rank=as.numeric(Rank)) %>%
# Remove column with ITRA score (only available if subscription)
select(-Score) %>%
# Change initial for gender
mutate(Gender=case_when(
Gender=="F"~"Women",
Gender=="H"~"Men"
))%>%
# Add year of the race in first position
add_column(Year=2021,.before = 1)
ranking_2021
## # A tibble: 1,526 x 7
## Year Rank Runner Time Age Gender Nationality
## <dbl> <dbl> <chr> <chr> <int> <chr> <chr>
## 1 2021 1 D HAENE Francois 20:45:59 36 Men FRA
## 2 2021 2 DUNAND PALLAZ Aurelien 20:58:31 29 Men FRA
## 3 2021 3 BLANCHARD Mathieu 21:12:43 34 Men FRA
## 4 2021 4 POMMERET Ludovic 21:38:44 46 Men FRA
## 5 2021 5 GRANGIER Germain 21:52:47 31 Men FRA
## 6 2021 6 Namberger Hannes 22:22:06 32 Men GER
## 7 2021 7 DAUWALTER Courtney 22:30:54 36 Women USA
## 8 2021 8 CURMER Gregoire 23:00:10 31 Men FRA
## 9 2021 8 PAZOS Diego 23:00:10 37 Men SUI
## 10 2021 10 CLEMENT Mathieu 23:08:05 26 Men SUI
## # ... with 1,516 more rows
2. Define a funtion
Now that the procedure have been defined, we will resume these steps in a function, to easily extract data from other dates from their URL.
FunRank <- function(html,year){
rank <- html %>%
rvest::html_element(".table") %>%
rvest::html_table() %>%
rename(Rank=1) %>%
mutate(Rank=as.numeric(Rank)) %>%
select(-Score) %>%
mutate(Gender=case_when(
Gender=="F"~"Women",
Gender=="H"~"Men"
))%>%
add_column(Year=year,.before = 1)
return(rank)
}
Let’s apply this function to extract the UTMB ranking for 2013.
# Read HTML page for 2013
utmb_2013 <- rvest::read_html('https://itra.run/Races/RaceResults?raceYearId=3940')
# Apply custom function
ranking_2013 <- FunRank(html=utmb_2013, year=2013)
ranking_2013
## # A tibble: 1,687 x 7
## Year Rank Runner Time Age Gender Nationality
## <dbl> <dbl> <chr> <chr> <int> <chr> <chr>
## 1 2013 1 THEVENARD Xavier 20:34:57 33 Men FRA
## 2 2013 2 HERAS Miguel 20:54:08 46 Men ESP
## 3 2013 3 DOMINGUEZ LEDO Javier 21:17:38 47 Men ESP
## 4 2013 4 OLSON Tim 21:38:23 38 Men USA
## 5 2013 5 FOOTE Mike 21:53:19 38 Men USA
## 6 2013 6 CHORIER Julien 22:08:11 41 Men FRA
## 7 2013 7 BOSIO Rory 22:37:26 37 Women USA
## 8 2013 8 Collomb Patton Bertrand 23:14:16 46 Men FRA
## 9 2013 9 LEJEUNE Arnaud 23:18:05 42 Men FRA
## 10 2013 10 TIDD John 23:18:27 58 Men ESP
## # ... with 1,677 more rows
We may now merge the results for both years in one table.
# Merging ranking for both years
ranking <- bind_rows(ranking_2013,ranking_2021)
3. Analyze the data
Now that the data is formatted in a tibble, the usual processing procedures can be implemented. Let’s start by answering a question: did the percentage of women among finishers increase between 2013 and 2021?
# Merging ranking for both years
gender_ratio <- ranking%>%
group_by(Year,Gender)%>%
# Add variable to count participants
mutate(ct=1)%>%
# Sum by gender and years
summarize(
Finishers=sum(ct)
)%>%
ungroup()%>%
group_by(Year)%>%
# Percentage of women by year
summarize(
PercentageWomen = Finishers[Gender=='Women']/sum(Finishers)*100
)%>%
drop_na()
gender_ratio
## # A tibble: 2 x 2
## # Groups: Year [2]
## Year PercentageWomen
## <dbl> <dbl>
## 1 2013 8.30
## 2 2021 7.21
For both years, the percentage of women among the finishers is low (this is also the case among the participants). This percentage was lower in 2021 than in 2013.
Next, we will see how women perform compared to men.
# Load lubridate for time manipulation
library(lubridate)
gender_time <-ranking%>%
group_by(Year,Gender)%>%
# Convert hour:minute:second to second
mutate(
Time=period_to_seconds(hms(Time))
)%>%
# Mean time for finisher by year and gender
summarize(
MeanTime=mean(Time)
)%>%
ungroup()%>%
drop_na()
# Plot results
ggplot(
data=gender_time,
aes(y=as.factor(Year),x=MeanTime,color=Gender))+
geom_point(size=5)+
labs(
title='Mean time for UTMB finishers',
subtitle='Comparison by genders for years 2013 and 2021',
y="",
x="Mean finishing time"
)+
scale_x_continuous(breaks=c(39.5*3600,40*3600,40.5*3600),labels=c("39h30min","40h","40h30min"))+
theme_minimal()
We then see that in 2021, on average, women were faster than men to finish the UTMB!
References
Wickham H., 2021. {rvest} Easily Harvest (Scrape) Web Pages