Perl is a powerful and unique interpreted language specifically suited for text processing. It is used
across the world for CGI applications and system administration tasks. It has incorporated many features
from various languages including dynamic typing, objects, and first-class functions (closures). The
versatility extends into the syntactic structure as well, allowing one to write code that is most natural
for him/her.
Students will learn the basics of programming in Perl through writing weekly programs and readings about
the style and structure of the language. Much of the focus will be on processing text using regular expressions.
The final project will involve writing a web crawler to traverse a local repository of web pages.
Week one will focus on introducing students to Perl and getting them ready for many of the
idiosyncrasies of Perl. This will include values that Perl considers to be false (there is no boolean
data type), lack of an integer type, dynamic arrays, namespace, scalar vs. list contexts, and default
variables.
Week two will focus on subroutines (functions) and file handles. Students will also learn different ways
of reading input from a file and the advantages of each method.
Week three will cover hashes (associative arrays) and a quick introduction to regular expressions (more
on these next week). Students will learn about character classes and simple quantifiers before getting a
chance to write some simple regular expressions
Week four goes much more into depth on regular expressions and the engine that supports them. They
will learn about option modifiers, text anchors, match variables, more quantifiers, precedence within a
regular expression, and greediness.
Week five will explain additional control structures, expression modifiers, loop controls, and advanced
sorting techniques, this allows you to specify the ordering without having to write your own sorting algorithm.
Week six will give students a cursory overview of modules, file tests, and directory operations.
Week seven will cover some advanced topics including references. Students will learn how to create complex
data structures as well as how memory is managed internally using reference counting.
Week eight will involve reading several papers about strategies for crawling the web as well the importance
of downloading high quality pages early in the process (and what constitutes a 'high-quality' page). Students
will also learn about ethical crawling practices.
The final project will involve writing a web crawler to crawl a local repository of web pages. This will
avoid potential issues with network activity as the code is developed as well as allowing for a more
controlled test environment. Students will not need to know any network programming but much of
the practice of writing a real-world web crawler will be the same.
|