How to Scrape Sports Data with Python

 Felicitations, malefactors! This is my first post ever, so thank you for joining me today. The objective of this post is to help you get started and hit the ground running for quick data analysis and visualization of any type of sports data that you may be interested in. My programming language of choice will be python, because it's easy to learn and extremely useful for data manipulation.


I'm of the firm belief that the best way to learn is to do something - the goal here is to give you a taste of what it takes to accomplish this particular task, and if you're interested enough, hopefully you'll be able to revisit the basics and fundamentals of basic programming with a passion.


Here's a high-level overview of what we're going to accomplish in this tutorial. We're going to:


  1. Choose what type of data we want to collect and investigate. In this tutorial, I'm going to look at fantasy football data from the 2019 NFL season.
  2. Scrape that data and store it somewhere
  3. Create a very basic visualization of it
  4. Make some inferences from our dataset

1. Choose data to gather and collect

In this tutorial, I'm choosing to collect fantasy football data from the 2019 NFL season. I'm going to assume that you already have python installed on your computer, but for the sake of this tutorial, we'll use a Google Colaboratory notebook. Google Colab is an interactive python notebook that allows you to write and execute python code in your browser. If you are having issues with installing python, then please use Colab for the time being - this will allow you to test some code out to start with.


Let's start with targeting some fantasy football data to play with. I'm choosing to gather some data from FantasyPros.com. If we navigate to their Fantasy Leaders page, we can see the top fantasy scorers from the most recent season. The table in the middle of the page is the one we're interested in.

2. Scrape our data

 

Python is extremely powerful because of the abundance of open source libraries that exist to tackle any particular task. Pandas is a python library for data manipulation and analysis that is ridiculously useful for this scenario, and Seaborn is a data visualization library that makes displaying charts and graphs extremely easily. There are tons of other python libraries to leverage, but we'll start with these.

In your Colab notebook, execute the following code below in a code block.

The pandas function read_html, when provided with a url, will find every <table> listed in that page and return either a list of dataframes (tabular data structures) if multiple tables exist on the page, or simply one dataframe if there is only one <table> on the page. To check how many dataframes are returned, examine the contents of our df variable. You can do that by executing the simple code below:

Here we can see that a list of seven dataframes was returned. If you examine the contents of this list, we can see that the first dataframe in this list pertains to the data that we're looking for, and the other six tables are irrelevant to the scope of our example. I'm going to set our df variable to only the first element in our list. Take a look at df and see what it looks like. Your dataframe may look something like this:

Our dataframe appears to be over 800 rows long and contain 7 data fields (columns). Here we can start analyzing our dataset with some basic graphical visualizations.

How to scrape sports data with python - python tutorial

 Let's say we want to look at the top 10 overall fantasy performers from 2019. Execute the following code below: 

Executing this code results in the following graph:

The seaborn library contains the function .barplot() that, when provided a dataframe and data fields, will generate a bar plot based on those arguments. In this scenario, we are only concerned with the top 10 scorers - applying the .head() method with a value of 10 will use only the top 10 entries in our dataframe. We then specify "Player" and "Points" to plot by.


If you're a relatively knowledgeable FF player, you'll notice that 9 of the top 10 overall fantasy producers are quarterbacks (Christian McCaffrey is the lone exception of RB because he's a fucking beast).  If we want to show both a player's name and position, we can create a new column in our dataframe that contains both a player's name and position. We can then view our top ten fantasy performers with this additional context.

And there we go! We can now see that of the top 10 fantasy scorers in 2019, 9 were quarterbacks, with only 1 running back landing in the top 10.


It seems like I've hit a limit on this page, so for continued analysis of our dataset, follow me to part 2 of this tutorial.