Redesigned Course Catalog Project: The Parser

The Fall 2018 Bard College Course List (with a small edit)

The Context

One year ago I was preparing to enter my senior year at Bard College. Majoring in computer science at Bard, students are expected to pursue a research-based senior project within the area of expertise of their adviser. My adviser, Professor Sven Anderson, encouraged me to do some work over the summer of 2017 in preparation for the coming semester.

In Professor Anderson’s Mobile Application Development course from Fall 2017, the project which my partner and I were assigned to develop was to be a health and fitness tracker for Bard students. It would utilize the dining commons menu which featured caloric data and serving sizes to make informed suggestions about users’ fitness routines and meals.

I communicated with the official dining services staff because I hoped to gain access to their API with which they pushed daily menu updates to their web site, but nobody I spoke to could confirm the existence of any API at all. Meanwhile, just observing the HTML of their site was enough to tell me that there was some “public” API being utilized in the background.

Because the application depended on daily menu access, I developed a system using Selenium to automatically scrape menu items on a daily basis and Flask to serve them in a JSON API. With that experience in web automation and scraping, I knew I’d be returning to web development.

The Project

The official Bard College course list is built upon a web programming language called WebFOCUS from around 2003. Compared to competing liberal arts institutions, its design and features are lacking. My intention was not initially to redesign the course list, and instead I just meant to serialize it for my own records. As a self-described datahoarder, I like having backups.

I went so far as to try manually transcribing the online course lists for each semester into JSON format, and after around 40 hours of copy-pasting I decided there had to be a better solution. BeautifulSoup is a Python library for getting data from awful web pages. I knew its limits and decided it would be of use in this project.

By far, the most frustrating part of utilizing BeautifulSoup to parse the course list is that across semesters and often within individual pages there exists limited consistency in the format of the tables containing course data. In 2014 a new naming system for course distribution fulfillments was introduced, further complicating the mess of old HTML tags and widely varied table dimensions due to differences in scheduling format.

Courses featured on the official Mathematics course list for Spring 2018. Notice that the first two are Learning Commons courses and thus do not fulfill any distribution requirements. The issue is that the tables are different sizes.

Because the course list works using embedded iframes which contain the individual “course lists” for each department per semester, I wrote the program to take as its input text files each containing a list of individual course list URLs. This way, all the course list pages which used the same distributions naming system and scheduling format could be downloaded and parsed separately from those which did not.

Soon after starting my senior year, I chose to put this on hold in favor of another topic for my senior project. I came back to it in August 2018 following my graduation from Bard College because in just that single year, further inconsistencies broke the parser. I modified it to make some fixes but one remains which would necessitate a complete rewrite of the way in which the parser accesses individual course descriptions from course list pages.

The parser was designed to take entire HTML pages as its input and serialize each course within each page into a JSON object. Unfortunately, because some course list pages no longer utilize the same table format across all courses they contain, the parser fails on pages with those inconsistencies. Thankfully, that only accounts for around 30 courses out of over 3900 in total from Fall 2014 to Fall 2018.

Feel free to read more about this project’s code and documentation on GitHub.

The Context

The Project

Share your thoughtsCancel reply

Related Posts

Redesigned Course Catalog Project: The Server

Report: Clustering Congress’ Most Popular Tweets

IPython Notebook: Clustering Congress’ Most Popular Tweets