Using Bash and grep to scrape the AP Poll

Using Bash and grep to scrape the AP Poll

In a previous post, I disussed how I love using the Unix pipeline to combine the small tools readily available on most *nix systems. Sometimes I need just a bit more power than that. That's where Bash comes in. And while Bash is certainly a programming language in its own right, I find that I most commonly use it the same way that I use pipelines — as a connection between the other tools that already exist on my system.

As a moderator on Reddit's /r/CFB college football subreddit, there are a great many pieces of information that I want to be able to acquire quickly and get posted for user discussion. The first of these that I automated was The Associated Press' Top 25 Poll. Unfortunately, there's no free API for pulling this data. That left me to scrape the website itself.

While much of the mod team's automation relies on Python and other more advanced toolsets, I chose to instead use Bash. As scraping the site is simple text searching and tying together, this mostly takes the form of a pile of reglar expressions found using grep.

For reference, I'm trying to make something that comes out looking like this, but I need to output it in Markdown:

Rank Team Rec Δ Points
1 Alabama 13-0 - 1,525(61)
2 Clemson 13-0 - 1,460
3 Notre Dame 12-0 - 1,405
4 Oklahoma 12-1 +1 1,327
5 Ohio State 12-1 +1 1,254

The Regular Expressions

I'll skip to scraping the information from the page as that's far more interesting than the code that puts everything together. I start by downloading the the latest edition of the poll rankings and putting them in a temporary file.

The code for a single row on the table looks like the following.

                              <tr class="team-row row-3829" id="row-3829">
              <td class="trank">1</td>
              <td class="trend arrow-"><i class="ico-arrow- ap-icon-arrow"></i>&#45; </td>
              <td class="tname"><img class="tlogo" src="https://collegefootball.ap.org/sites/default/files/taxonomy/logo/AlabamaCrimsonTide.png" alt="Alabama" title="Alabama"/> <a href="/utk/teams/alabama">Alabama</a></td>
                            <td class="tconf">SEC</td>
                            <td class="ovr-rec">13-0</td>
              <td class="tpoints">1,525(61) </td>
              <td class="gscore ">
                                    @ Georgia <span class="ap-poll-W">W</span> 35-28
                              </td>
              <td class="team-share single-item-share ap-ga-track">
                <a href="javascript:void(0);">Share</a>
                <div class="single-item-share-wrap">
                  <a title="Facebook" href="http://www.facebook.com/sharer.php?u=http://collegefootball.ap.org/utk/poll/2018/15?id=3829" target="_blank"><i class="ico-facebook-square"></i></a>
                  <a title="Twitter" href="http://twitter.com/share?url=http://collegefootball.ap.org/utk/poll/2018/15?id=3829" target="_blank"><i class="ico-twitter"></i></a>
                  <a title="Email" class="icon-mail-share" href="mailto:?&subject=College Football:&body=http://collegefootball.ap.org/utk/poll/2018/15?id=3829"><i class="ico-envelope"></i></a>
                  <div class="item-permalink">    
                    <i class="ico-permalink"></i> 
                    <a title="Copy Link" class="item-copy" id="copy" href="javascript:void(0);">Copy link</a>
                    <input type="text" value="http://collegefootball.ap.org/utk/poll/2018/15?id=3829" class="copy-link">
                  </div>
                </div>
              </td>
            </tr>

The script loops over the numbers from 1 to 25 and searches for that trank class. All the information I need for the remainder of this row is in the next five lines, so I grab the following five lines and dump them to another temporary file.

grep -P "(?<=trank\"\>)$i(?=</td>)" -A 5 raw.txt > tmp

This gives the following output which is far easier to read

              <td class="trank">1</td>
              <td class="trend arrow-"><i class="ico-arrow- ap-icon-arrow"></i>&#45; </td>
              <td class="tname"><img class="tlogo" src="https://collegefootball.ap.org/sites/default/files/taxonomy/logo/AlabamaCrimsonTide.png" alt="Alabama" title="Alabama"/> <a href="/utk/teams/alabama">Alabama</a></td>
                            <td class="tconf">SEC</td>
                            <td class="ovr-rec">13-0</td>
              <td class="tpoints">1,525(61) </td>

Working from left to right in the table I want to generate, the first thing I want to find is the team's name. I pull that from the title of the image of the team's logo.

TEAM=$(grep -oP '(?<=title=").*?(?="/>)' tmp)

The next thing I want to find is the team's win/loss record. This is in the ovr-rec field of the table.

RECORD=$(grep -oP '(?<=ovr-rec\">).*?(?=<\/td)' tmp)

Now for the most annoying part of this whole thing, the delta. The AP website represents the change in rankings by an arrow indicating which direction the team has moved and then a number indicating how far they'd moved. This means that we're going to have to use \*gasp\* logic!

If a team has moved up, the arrow will be ico-arrow-up. If the team has moved down, the arrow will be ico-arrow-down. If the team hasn't moved, I don't need to find anything because there's only three cases and I've already tested for the other two. (Horray for laziness!)

I search for both the arrow and the number that follows it, limiting it to a maximum of two digits, just in case. I don't remember exactly why but I've stored the plus sign ( + ) and the minus sign ( - ) in their own variables. I think that was to make it easier to add those to the change number.

        if grep -q 'ico-arrow-up' tmp; then
                #TEAM MOVED UP!
                DELTA=$(grep -oP '(?<=ico-arrow-up ap-icon-arrow\"><\/i>)[0-9]{1,2}' tmp)
                DELTA=$PLUSSIGN$DELTA
        elif grep -q 'ico-arrow-down' tmp; then
                #TEAMMOVEDDOWN
                DELTA=$(grep -oP '(?<=ico-arrow-down ap-icon-arrow\"><\/i>)[0-9]{1,2}' tmp)
                DELTA=$MINUSSIGN$DELTA
        else
                DELTA="-"
        fi

Finally, I need to find how many points each team got in the poll. If you're interested in how the points are assigned, the methodology is listed at the bottom of each poll post. The number of points, as well as the number of first place votes a team got are in the tpoints field.

POINTS=$(grep -oP '(?<=tpoints\">).*(?=<\/td)' tmp)

That's everything I need from the table. After getting this data for each team, I append the information to a temporary file which is output after all data has been acquired.

There is, however, one more piece of information to scrape - the other teams that received votes. This one is quite easy:

grep -oP 'Others receiving votes:.*(?=\.<)' tmp

Putting it all together

The rest of the code is pretty simple. Grab the source material. Loop across it. Dump things into files. Output those files. Rather than describe it in depth, I've included the full commented code below.

# To use, invoke passing the current AP URL. EG:
# bash /path/to/this/file https://collegefootball.ap.org/poll/2019/3          

# Check if a URL was input. Exit if no input.
if [ "$1" = "" ]; then
        echo "NO URL PROVIDED. EXITING."
        exit 1
fi

# Create temporary working directory.
TMPPATH=$(date +%s)
mkdir /tmp/$TMPPATH
cd /tmp/$TMPPATH

# Grabs everything from the begining of the rankings table to the end of the Others receiving votes
curl -s $1 | grep -A 616 'poll-content' > raw.txt

# Setup header
echo "## [AP](#l/ap) [AP Poll]($1)" > output.txt
echo "" >> output.txt
echo "|Rank|Team|Rec|&Delta;|Points|" >> output.txt
echo "|-|-|-|-|-|" >> output.txt

# Loop through rankings, put them into a temporary file
for i in {1..25}; do
        grep -P "(?<=trank\"\>)$i(?=</td>)" -A 5 raw.txt > tmp
        RANK=$i
        TEAM=$(grep -oP '(?<=title=\").*?(?=\"\/>)' tmp) 
        RECORD=$(grep -oP '(?<=ovr-rec\">).*?(?=<\/td)' tmp)

        PLUSSIGN="+"
        MINUSSIGN="-"

        if grep -q 'ico-arrow-up' tmp; then
                #Code for if the team moved up
                DELTA=$(grep -oP '(?<=ico-arrow-up ap-icon-arrow\"><\/i>)[0-9]{1,2}' tmp)
                DELTA=$PLUSSIGN$DELTA
        elif grep -q 'ico-arrow-down' tmp; then
                #Code for if the team moved down
                DELTA=$(grep -oP '(?<=ico-arrow-down ap-icon-arrow\"><\/i>)[0-9]{1,2}' tmp)
                DELTA=$MINUSSIGN$DELTA
        else
                DELTA="-"
        fi

        POINTS=$(grep -oP '(?<=tpoints\">).*(?=<\/td)' tmp)

        echo "| $RANK | $TEAM | $RECORD | $DELTA | $POINTS |" >> output.txt
done

echo "" >> output.txt
# Grab "Others reciving votes"
grep -oP 'Others receiving votes:.*(?=\.<)' raw.txt >> output.txt

echo "" #Blank space for easier copying and pasting
cat output.txt
echo "" #Blank space for easier copying and pasting

cd - > /dev/null
echo "Working files and output are in /tmp/$TMPPATH if needed."

Some other thoughts

Versions matter

I did encounter one problem when coding all this. I was working from my favorite ice house, so I was using my laptop. MacOS uses a different version of grep than is included with most Linux systems. As a result, I had to do all testing on our automation server rather than locally.

Sams-MBP:~ sam$ grep --version
grep (BSD grep) 2.5.1-FreeBSD
Sams-MBP:~ sam$
sam@CFB-server:~$ grep --version
grep (GNU grep) 3.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
sam@CFB-server:~$

Don't assume anything!

My code assumes a lot of things.

It assumes consistency it the AP's website. This is something that's hard to code around. Changes to the AP's website resulted in me recoding this script twice this season as the AP first changed the design of their site then later tweaked some of the table output.

It also assumes that there will only be one team at each rank. Multiple times this season, there were ties in the rankings. Fortunately, the output was mostly correct, so it was easier to fix the output by hand than it was to build a version of this that could handle ties.