CrossFit 2015 Leaderboard

by Vitor Bernardes

Introduction to CrossFit and the Data Set

CrossFit is a popular fitness program and fitness sport created in 2000. It combines elements of aerobic exercise, calisthenics (body weight exercises), and Olympic weightlifting with the goal of improving overall fitness.

On the sport side, since 2007 CrossFit promotes an annual competition open for athletes from all over the world, called the CrossFit Games. The Games has three stages of qualification: the Open, Regionals, and the Games themselves.

The Open, which receives its name because participation is open to anyone, is held over five weeks at the beginning of the competition season. Each week contains a workout that must be completed by athletes. The athletes can complete the workout at their local box (how CrossFit gyms are called) and submit their scores online. The workouts are referenced by their year and the number corresponding to the order they have been presented in. For example, the first workout of the 2015 Open is called 15.1, the second one is called 15.2, and so forth.

The data set we are going to analyze is the 2015 Open leaderboard. It contains data from athletes from all over the world and the results they submitted for each completed workout.

Summary of the Data Set

Let’s review the data set we are working with.

## [1] 250717

As we can see, it contains observations about roughly over 250,000 athletes that competed in the 2015 Open.

Let’s what data we have about each athlete.

## 'data.frame':    1504303 obs. of  28 variables:
##  $ division  : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ stage     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ athlete_id: int  1690 1998 2206 2559 2811 3008 3021 3938 4006 4156 ...
##  $ rank      : int  154 5950 768 294 1946 245 1105 2260 880 2989 ...
##  $ score     : int  366 497 404 379 437 375 415 444 408 457 ...
##  $ howlong   : Factor w/ 5 levels "Less than 6 months|",..: 4 4 NA 5 4 4 1 4 NA 4 ...
##  $ category  : Factor w/ 2 levels "Rx","Scaled": 1 1 1 1 1 1 1 1 1 1 ...
##  $ scaled    : Factor w/ 2 levels "false","true": 1 1 1 1 1 1 1 1 1 1 ...
##  $ name      : Factor w/ 236538 levels " Bill Sheehan",..: 227131 109885 166370 215890 29501 31246 235109 92131 1300 165324 ...
##  $ region    : Factor w/ 18 levels "","Africa","Asia",..: 16 18 10 8 12 10 3 18 15 1 ...
##  $ team      : Factor w/ 4531 levels "","#CF9J","#CFFPNation",..: 1 3521 4389 1645 4033 1342 1 1 3083 1 ...
##  $ affiliate : Factor w/ 9776 levels "","100 Pourcent CrossFit",..: 5453 8721 9533 4114 5441 3114 2367 2261 8003 1 ...
##  $ gender    : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ age       : int  24 31 25 26 28 38 29 24 26 24 ...
##  $ height    : int  71 70 72 68 68 72 65 70 72 68 ...
##  $ weight    : int  198 170 197 176 195 219 165 170 185 170 ...
##  $ fran      : int  135 157 139 140 152 124 NA 756 154 147 ...
##  $ helen     : int  394 422 410 410 NA 481 NA 458 440 NA ...
##  $ grace     : int  80 116 NA 100 NA 77 NA 110 NA 132 ...
##  $ filthy50  : int  819 1099 NA NA NA 1080 NA 1271 986 NA ...
##  $ fgonebad  : int  447 NA NA NA NA 451 NA 387 NA 413 ...
##  $ run400    : int  54 60 NA NA NA 62 NA 128 NA NA ...
##  $ run5k     : int  1083 1350 NA 1140 NA 1260 NA 1800 NA NA ...
##  $ candj     : int  345 280 335 309 335 305 285 120 265 275 ...
##  $ snatch    : int  275 225 265 234 245 255 235 85 215 215 ...
##  $ deadlift  : int  550 425 NA 441 545 485 425 135 485 385 ...
##  $ backsq    : int  455 345 455 390 465 465 385 225 465 380 ...
##  $ pullups   : int  75 50 NA 45 NA 53 NA 56 65 NA ...

We can see we have several variables with data on the athletes themselves (such as name, region, age, height, and weight), some variables related to the Open workouts and results (such as stage, category, score, and rank), and also some results for benchmark workouts by each athlete (such as fran, helen, snatch, and deadlift).

We will primarily be interested to see what factors are related to the athletes’ results, contained in the variables score and rank.

Summary of Features

Let’s briefly examine the features we will be using in our analysis in order to identify their distribution, any outliers, and also improve our knowledge of the data we will be working with.

We can see there are some pretty extreme values for height, weight, snatch, deadlift, and pullups that are getting in the way of our understanding the data. Let’s identify those outliers and remove them in order to make our analysis more robust.

Now this looks much better and provides us with a first look at the data we will be working with and its distribution.

Getting to know our athletes

Let’s get to know a little about the athletes we will be exploring further in this analysis.

Age

We see the distribution of the number of athletes by age is pretty similar for both men and women. We can also see the number of male athletes is larger in the 2015 Open.

Category

The competition is divided into two categories: Rx and Scaled. In the Rx category, athletes must complete the workouts exactly as prescribed. The Scaled category was created so the Open would be more accessible to a larger number of athletes, and has scaled-down versions of the Rx workouts.

Let’s see how the athletes are divided into both categories.

## 
##        Rx    Scaled 
## 0.8505327 0.1494673

This plot shows us the absolute majority of athletes (85%) in the 2015 Open are in the Rx category. The plot also shows the proportion between male and female athletes on both categories. While men are in higher number in the Rx category, the Scaled category includes more women than men.

Now let’s create a single plot where we will be able to see the distribution of age per category.

It is interesting to note the center of the distribution of scaled athletes appears slightly higher than for Rx athletes. That is specially noticeable for men. It seems reasonable, because older athletes might find it more difficult to complete Rx workouts.

Let’s plot the histogram and some summary statistics for the scaled category to check that observation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   26.00   30.00   31.53   36.00   54.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   27.00   31.00   31.98   37.00   54.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   28.00   33.00   34.29   40.00   54.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   30.00   36.00   36.11   43.00   54.00

Indeed, we can see that while the average age for men and women in the Rx category is very close, the Scaled category has higher average ages for both genders.

CrossFit experience

Let’s find out how long the athletes have had CrossFit experience prior to joining the Open.

While we do not have data on many athletes, we can see many of them have joined the 2015 Open with less than a year of CrossFit experience, which might show eagerness to participate in the event.

One aspect that might influence the choice of Scaled vs. Rx category is how long the athlete has been practicing CrossFit for. It seems reasonable that more experienced athletes might be more inclined to opt for the Rx category.

We can see the proportion of Rx athletes is larger for athletes with over 2 years of experience, and smaller for athletes with between 6 months and 2 years of experience. One interesting observation is that most athletes with less than 6 months of CrossFit experience chose the Rx category. It certainly is a curious fact to notice, however since we unfortunately don’t have experience data for many of our athletes, we can’t draw many conclusions from it.

Regions

Finally, let’s take a look at where our athletes come from.

This plot shows that, despite being open to participation to athletes from all over the world, the popularity of the competition is still heavily centered in North America, followed by Europe. Huge continents such as Africa and Asia still show very little participation in the Open.

Taking a look at the workouts

Now let’s take a look at some results.

As we mentioned, the workouts are referenced by their year and number, such as 15.1, 15.2 etc. We will refer to them using this pattern.

NOTE: The 2015 Open had a special workout on the first week, which we will refer to as 15.1a. In our data set, it is refered to as 1.1.

Also, it is important to mention the workouts can be one of two kinds. In the first kind, the athlete tries to achieve the highest possible number of repetitions, or reps, in the given timeframe. In the second kind, the athlete must complete the workout as fast as possible. The scores for the first kind of workout are measured in number of reps (which means the higher, the better), and for the second kind are measured in seconds (which means the lower, the better).

All but the last workout of the 2015 Open are of the first kind. Only the 15.5 workout score is measured in seconds.

Now let’s plot the distribuition of scores by workout and division. Unless otherwise mentioned, we will focus on the Rx category for the analysis.

Several plots show peaks. The peaks are present on 15.1, 15.2, and 15.4, but they are particularly sharp and intriguing on workout 15.3. I should investigate further to find out what happended there.

That is a very interesting chart, but warrants a closer look at each workout so we can better understand the story they are telling.

Workout 15.1

Workout 15.1 consisted of:

Complete as many rounds and reps as possible in 9 minutes of: 15 toes-to-bars 10 deadlifts (115 / 75 lb.) 5 snatches (115 / 75 lb.)

Each 30 reps represents a completed round of exercises. This particular workout is interesting as the first movement is a relatively easier gymnastic one, compared to the other 2 weightlifting exercises. So the peaks in this plot show how many people struggled to perform the weightlifting exercises in each round. The dips at 15, 45, 75, and so on, show that athletes rarely ended their workouts on the gymnastic movement, but rather on the heavier exercises.

Since the last 2 exercises for each round are the deadlift and the snatch, and our data set contains benchmarks for those exercises for some athletes, let’s check if their maximum weight lifted had any relationship to their results in this workout.

Even though the data is very dispersed, we can see a positive trend between the athlete’s record deadlift and their score on this workout, both for male and female athletes.

Now let’s run the same analysis for the snatch.

The same can be said for the snatch. Though the data is also very dispersed, we can see a positive trend between the athlete’s record snatch and their score on this workout.

Workout 15.1a

Workout 15.1a consisted of:

1-rep-max clean and jerk 6-minute time cap

In other words, the athlete had 6 minutes to perform the heaviest clean and jerk he or she could manage. This is a workout where strength is critical.

Let’s see the distribution of scores for this workout.

The distribution of scores looks very similar for both divisions.

Since strength is vital for this workout, I wonder if bodyweight has any relation to the score. Let’s make a plot of result by weight and find out.