I am a data scientist working on time series forecasting (using R and Python 3) at the London Ambulance Service NHS Trust. I earned my PhD in cognitive neuroscience at the University of Glasgow working with fmri data and neural networks. I favour linux machines, and working in the terminal with Vim as my editor of choice.
Creating a searchable database of UFC contests using web-scraping in Python 3
The full code I’ve written so far can be found here.
The Goal
I'm a fan of the Ultimate Fighting Championship (UFC), and often I want to
watch an old fight of a particular fighter but I don't remember all their
previous opponents or which UFC events they performed in.
I decided to create a simple database containing a row for each fight in
each UFC event, the weight category and fighter names. Then I can simply pipe
the file contents into fzf and I have an interactive searcher which quickly
shows me all the UFC cards on which that fighter had fought and who their
opponent was.
With the following alias in my .bashrc file:
I can easily whittle down the results to a specific fighter:
The selected results are sorted and sent to STDOUT:
The Plan
Wikipedia has a page listing all UFC
events and features a table containing links to a page about each event. On
these individual event pages there is a table containing information I want. So
what I want is to have a python script that can go to the list of events page,
follow up each link to individual event pages, and pull the correct table.
We'll get the link urls using BeautifulSoup. Handily, there is a dedicated
module just for accessing the html for of any Wikipedia page. Once we have the
html, we can use the pandas module to read the table of interest into a
dataframe object. Then it's a matter of cleaning the data and entering it into
the database. The goal is the following txt file (containing 1175 fights):
The Code Implementation
After importing the modules we initialise a table:
Next we send an http request to get the list of events webpage using the
requests module and then scrape the content with BeautifulSoup:
The main body of the script is a loop where we scan through each link on the
list of events page and determine if it is a link to an individual UFC event.
This is done with a simple regular expression whereby we search each link for
the string 'UFC_' followed by some number of digits. There can be more than one
link to the same page, and so we make sure not to retry the same link twice. If
we find the string, and we've not yet tried the link we check if the event is
later than existing entries in the database ('latest') and that it's not a
future scheduled event ('future'). I that's the case we go ahead and read the
page html and pull the 3rd table, which contains the information we want, into
a pandas dataframe:
Next we put a zero padded UFC event number - this will allow us to sort the
database chronologically when we're done. After re-indexing the dataframe we
extract the data we want and label the columns.
Then we extract the section of the table that lists fights that appeared on the
main card (skipping the preliminary fights) and convert data to comma
separated string:
If I don't remember the winner of a fight, I would rather not have that spoiler
before I re-watch it. And since the tables on Wikipedia invariably list the
victor followed by the loser, I shuffle the order for my database:
I noticed that a couple of links give me an error, so in those cases we just
move on to the next link. If we are just adding in some recent events to an
existing database, then we remove the header from the table since we already
have that. Then we sort the list by the event number and write the data to a
txt file:
The full code I’ve written so far can be found here