[R] Web scraping with {rvest}

Web scraping refers to various methods used to collect data across the web. The {rvest} extension, included inside the tidyverse, allows you to perform web scraping with R. the data can then be analyzed according to the protocols usually used with R.

In this tutorial, we will see how we can use {rvest} to answer one question: how do women perform compared to men in ultra-trail running races? We will focus on one of the most famous races of the discipline: the UTMB, in the French Alps (170km, +10,000m of elevation gain).

1. “Scrape” the data

Open the results for the UTMB 2021 edition from the website of the International Trail Running Association (ITRA), which compiles the results of trail running races from 2012 to 2021. Then we will copy the url of this page to paste it inside read_html().

# Load {rvest} with the whole tidyverse
library(tidyverse)

# Read HTML page with read_html()
utmb_2021 <- rvest::read_html('https://itra.run/Races/RaceResults?raceYearId=72496')

utmb_2021
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="container-fluid p-0 m-0">\r\n    <header><div class="">\r\n  ...

It is easier to extract data inside an HTML table, which is fortunately the case for the ranking. we will identify this table with html_element(), then convert it to a tibble with html_table()

ranking_2021 <- utmb_2021 %>%
    rvest::html_element(".table") %>% 
    rvest::html_table()

ranking_2021
## # A tibble: 1,526 x 7
##       `` Runner      Time   Score                         Age Gender Nationality
##    <int> <chr>       <chr>  <chr>                       <int> <chr>  <chr>      
##  1     1 D HAENE Fr~ 20:45~ "Become an ITRA member for~    36 H      FRA        
##  2     2 DUNAND PAL~ 20:58~ "Become an ITRA member for~    29 H      FRA        
##  3     3 BLANCHARD ~ 21:12~ "Become an ITRA member for~    34 H      FRA        
##  4     4 POMMERET L~ 21:38~ "Become an ITRA member for~    46 H      FRA        
##  5     5 GRANGIER G~ 21:52~ "Become an ITRA member for~    31 H      FRA        
##  6     6 Namberger ~ 22:22~ "Become an ITRA member for~    32 H      GER        
##  7     7 DAUWALTER ~ 22:30~ "Become an ITRA member for~    36 F      USA        
##  8     8 CURMER Gre~ 23:00~ "Become an ITRA member for~    31 H      FRA        
##  9     8 PAZOS Diego 23:00~ "Become an ITRA member for~    37 H      SUI        
## 10    10 CLEMENT Ma~ 23:08~ "Become an ITRA member for~    26 H      SUI        
## # ... with 1,516 more rows

We need a few more steps to format this table:

ranking_2021<-ranking_2021 %>%
  # Rename first column
  rename(Rank=1) %>%
  mutate(Rank=as.numeric(Rank)) %>%
  # Remove column with ITRA score (only available if subscription)
  select(-Score) %>%
  # Change initial for gender
  mutate(Gender=case_when(
    Gender=="F"~"Women",
    Gender=="H"~"Men"
  ))%>%
  # Add year of the race in first position
  add_column(Year=2021,.before = 1)

ranking_2021
## # A tibble: 1,526 x 7
##     Year  Rank Runner                 Time       Age Gender Nationality
##    <dbl> <dbl> <chr>                  <chr>    <int> <chr>  <chr>      
##  1  2021     1 D HAENE Francois       20:45:59    36 Men    FRA        
##  2  2021     2 DUNAND PALLAZ Aurelien 20:58:31    29 Men    FRA        
##  3  2021     3 BLANCHARD Mathieu      21:12:43    34 Men    FRA        
##  4  2021     4 POMMERET Ludovic       21:38:44    46 Men    FRA        
##  5  2021     5 GRANGIER Germain       21:52:47    31 Men    FRA        
##  6  2021     6 Namberger Hannes       22:22:06    32 Men    GER        
##  7  2021     7 DAUWALTER Courtney     22:30:54    36 Women  USA        
##  8  2021     8 CURMER Gregoire        23:00:10    31 Men    FRA        
##  9  2021     8 PAZOS Diego            23:00:10    37 Men    SUI        
## 10  2021    10 CLEMENT Mathieu        23:08:05    26 Men    SUI        
## # ... with 1,516 more rows

2. Define a funtion

Now that the procedure have been defined, we will resume these steps in a function, to easily extract data from other dates from their URL.

FunRank <- function(html,year){

  rank <- html %>%
    rvest::html_element(".table") %>% 
    rvest::html_table() %>%
    rename(Rank=1) %>%
    mutate(Rank=as.numeric(Rank)) %>%
    select(-Score) %>%
    mutate(Gender=case_when(
      Gender=="F"~"Women",
      Gender=="H"~"Men"
    ))%>%
    add_column(Year=year,.before = 1)
  
  return(rank)

}

Let’s apply this function to extract the UTMB ranking for 2013.

# Read HTML page for 2013
utmb_2013 <- rvest::read_html('https://itra.run/Races/RaceResults?raceYearId=3940')

# Apply custom function
ranking_2013 <- FunRank(html=utmb_2013, year=2013)
ranking_2013
## # A tibble: 1,687 x 7
##     Year  Rank Runner                  Time       Age Gender Nationality
##    <dbl> <dbl> <chr>                   <chr>    <int> <chr>  <chr>      
##  1  2013     1 THEVENARD Xavier        20:34:57    33 Men    FRA        
##  2  2013     2 HERAS Miguel            20:54:08    46 Men    ESP        
##  3  2013     3 DOMINGUEZ LEDO Javier   21:17:38    47 Men    ESP        
##  4  2013     4 OLSON Tim               21:38:23    38 Men    USA        
##  5  2013     5 FOOTE Mike              21:53:19    38 Men    USA        
##  6  2013     6 CHORIER Julien          22:08:11    41 Men    FRA        
##  7  2013     7 BOSIO Rory              22:37:26    37 Women  USA        
##  8  2013     8 Collomb Patton Bertrand 23:14:16    46 Men    FRA        
##  9  2013     9 LEJEUNE Arnaud          23:18:05    42 Men    FRA        
## 10  2013    10 TIDD John               23:18:27    58 Men    ESP        
## # ... with 1,677 more rows

We may now merge the results for both years in one table.

# Merging ranking for both years
ranking <- bind_rows(ranking_2013,ranking_2021)

3. Analyze the data

Now that the data is formatted in a tibble, the usual processing procedures can be implemented. Let’s start by answering a question: did the percentage of women among finishers increase between 2013 and 2021?

# Merging ranking for both years
gender_ratio <- ranking%>%
  group_by(Year,Gender)%>%
  # Add variable to count participants
  mutate(ct=1)%>%
  # Sum by gender and years
  summarize(
    Finishers=sum(ct)
  )%>%
  ungroup()%>%
  group_by(Year)%>%
  # Percentage of women by year
  summarize(
    PercentageWomen = Finishers[Gender=='Women']/sum(Finishers)*100
  )%>%
  drop_na()

gender_ratio
## # A tibble: 2 x 2
## # Groups:   Year [2]
##    Year PercentageWomen
##   <dbl>           <dbl>
## 1  2013            8.30
## 2  2021            7.21

For both years, the percentage of women among the finishers is low (this is also the case among the participants). This percentage was lower in 2021 than in 2013.

Next, we will see how women perform compared to men.

# Load lubridate for time manipulation
library(lubridate)

gender_time <-ranking%>%
  group_by(Year,Gender)%>%
  # Convert hour:minute:second to second
  mutate(
    Time=period_to_seconds(hms(Time))
  )%>%
  # Mean time for finisher by year and gender
  summarize(
    MeanTime=mean(Time)
  )%>%
  ungroup()%>%
  drop_na()

# Plot results
ggplot(
  data=gender_time,
  aes(y=as.factor(Year),x=MeanTime,color=Gender))+
  geom_point(size=5)+
  labs(
    title='Mean time for UTMB finishers',
    subtitle='Comparison by genders for years 2013 and 2021',
    y="",
    x="Mean finishing time"
  )+
  scale_x_continuous(breaks=c(39.5*3600,40*3600,40.5*3600),labels=c("39h30min","40h","40h30min"))+
  theme_minimal()

We then see that in 2021, on average, women were faster than men to finish the UTMB!

References

Wickham H., 2021. {rvest} Easily Harvest (Scrape) Web Pages