Last spring I read a book called Daemon by Daniel Suarez. At one level, it's just a throw-away thriller, which was exactly what I was looking for in spring break diversion. But its core premise is pretty interesting. The novel starts just after the death of a prominent billionaire computer game developer who turns out to have left a lot of malicious software running in the background on computers around the world. In particular, he has a lot of software set up to read on-line news feeds, triggering various nasty actions when the software detects certain news events. The action begins (with a couple murders, of course) as soon as the daemon detects the death of its author.
Some of the technology imagined in this book is pretty far-fetched. But the idea that you could write a program that would detect the occurrence of events in the news (such as your own death) is both intriguing and plausible. For this assignment, you're going to write a filtering tool that could be used to look for specific events in on-line news feeds. I recommend that you stop your development short of murder and the destruction of the global economy, however. Other people are handling both of those just fine without your assistance.
I suspect that most of you use or have used some sort of RSS feed reader to organize your on-line reading. If not, you might want to take a look at Google Reader as an example tool. The Wikipedia article on RSS has some helpful information on the history and structure of RSS.
The Universal Feed Parser is an open source Python library that parses RSS feeds. Written by Mark Pilgrim, the same guy who wrote Dive Into Python, the feedparser.py module gives you an easy way to get information from an RSS server by specifying the URL of the server.
To get started, go to http://feedparser.org/, read the sample code on the front page, and download and unzip the package. If you feel like installing feedparser on your computer's Python installation, you can follow the installation instructions. Alternatively, you can just put the feedparser.py file in your working directory.
We're going to need a collection of RSS news feeds to help us test our filters. Please find a couple news feeds that you find interesting, and send their URLs to me via e-mail before class on Friday, January 15. I'll pass the collection of feeds along to the whole class.
Your filter will need to do regular expression matching with the contents, titles, and summaries of the news feeds, so you will need to use Python's re module. You don't have to download or install it, since it's a standard part of the Python installation, but you should check out the re module documentation.
For this assignment, you will write a program to filter RSS feeds based on a variety of criteria.
Your program should be invokable from a Unix command line, like so:
Here, the mandatory "feedfile" argument specifies a file consisting of one RSS URL per line.
If you run rssfilter.py without any command-line options, the program should retrieve all the entries from each of the RSS feeds listed in the feedfile. Output should be printed to stdout, one entry per line. Each entry should be printed like so:
title [tab] publication date [tab] link
The output produced by rssfilter.py can be modified by any combination of the following options.
Test your program. Do you think you could figure out who died yesterday using appropriate regular expressions? Could you figure out where bombs exploded, who won last night's basketball games, etc.?
Write a brief readme.txt file describing your program's status (what works and what doesn't).
Put all your source files (maybe just one, maybe more) plus readme.txt into a folder called rssfilter, and submit it using the Collab/Courses system.
Questions? Let me know.
Have fun.