Assignment 8 - File Analysis

Due: Thursday, May 16, 2024, at 10pm

You may work alone or with a partner, but you must type up the code yourself. You may also discuss the assignment at a high level with other students. You should list any student with whom you discussed each part, and the manner of discussion (high-level, partner, etc.) in a comment at the top of each file. You should only have one partner for an entire assignment.

You should submit your assignment as an a8.zip file on Moodle.

Parts of this assignment:

There is skeleton code available to you; you should extract the .zip file in a new folder and open that folder in VS Code.

Like some of our other assignments, the starter code has a lot of parts; ask if there’s something you’re curious about, otherwise I’ll try to make clear what to focus on for a given part of this assignment. You should find these files:

  • fileAnalysis.py - you’ll put your code here
  • hello.py - a small example Python file
  • senseAndSensibility.txt - the entire text of the novel Sense and Sensibility
  • senseAndSensibiity_citation.txt - the URL citation for Sense and Sensibility, from Project Gutenberg
  • test.txt - a small example text file
  • textlib.py - a library of helper code, adapted from Jessen Havill’s Discovering Computer Science textbook

Note on style:

The following style guidelines are expected moving forward, and will typically constitute 5-10 points of each assignment (out of 100 points).

  • Variable names should be clear and easy to understand, should not start with a capital letter, and should only be a single letter when appropriate (usually for i, j, and k as indices, potentially for x and y as coordinates, and maybe p as a point, c for a circle, r for a rectangle, etc.).
  • It’s good to use empty lines to break code into logical chunks.
  • Comments should be used for anything complex, and typically for chunks of 3-5 lines of code, but not every line.
  • Don’t leave extra print statements in the code, even if you left them commented out.
  • Make sure not to have code that computes the right answer by doing extra work (e.g., leaving a computation in a for loop when it could have occurred after the for loop, only once). Use lowercase first letters for variables and methods, and uppercase first letters for classes.

Note: The example triangle-drawing program on page 108 of the textbook demonstrates a great use of empty lines and comments, and has very clear variable names. It is a good model to follow for style.

Getting started

This assignment is all about analyzing text files and getting some practice with dictioanries. To get started, take a look at the constructor (__init__) and setup methods of the FileAnalyzer class.

The constructor takes in a path to a file, and then calls self.setup(), which calls a bunch of functions (some of which you need to fill in) to create a bunch of instance variables:

    def __init__(self, filepath):
        """
        Creates a FileAnalyzer object for the provided file.

        filepath: path to the file (a string)
        """
        # Store the filepath
        self.filepath = filepath

        # Parse the file into the text, list of lines,
        # list of words, etc.
        self.setup()

    def setup(self):
        """
        Reads in the file, storing its text, lines, and words.
        """
        # Create empty instance variables (they'll exist everywhere
        # once created, even if they're not created in __init__)
        self.text = ""
        self.lines = []
        self.wordList = []
        self.wordCountDict = {}
        self.bigramCountDict = {}
        self.wordBigramMap = {}
        self.bigramProbDict = {}
    ...

For now, the most important thing is that you know that self.text contains one big string representing the contents of a file, and that self.lines contains a list of the lines in the file (each one a string).

To get started, let’s try to make an instance of FileAnalyzer and use it to display the lines in the program itself. Open up the fileAnalysis.py starter code in VS Code and in the Terminal (the window that opens when you hit the Play button), type python3 and hit “Return”. (Note that if you’re on Windows, you should instead type python.exe and hit “Enter”. Also, if python3 doesn’t work on your Mac, try python instead.)

python3

This will start running Python interactively. For example, it may look like this for a slightly older version of Python:

Python 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>

If you’re in the same folder as the code, you should be able to type the following (note that >>> is the prompt; don’t type that):

>>> from fileAnalysis import *

Now you can make an instance of our class and print out its text and lines:

>>> analyzer = FileAnalyzer("hello.py")
>>> print(analyzer.text)
# File: hello.py
# Purpose: show off our Python skillz
# Author: Cheddar
# Collaborator: Lulu

def main():
    print("Hello, world!")

if __name__ == "__main__":
    main()
>>> print(analyzer.lines)
['# File: hello.py', '# Purpose: show off our Python skillz', '# Author: Cheddar', '# Collaborator: Lulu', '', 'def main():', '    print("Hello, world!")', '', 'if __name__ == "__main__":', '    main()']

You can type exit() to exit this interactive Python mode.

If you run the actual starter code in fileAnalysis.py, you should get the following output:


### 4 most popular words

### 10 most popular words

### 4 most popular bigrams

### Starting with 'she':
she

Let’s fill in some of this output!

Part 1: Analyzing line types

# You should be fully equipped to complete this part after Lesson 20 (Friday May 10).

For this first part, you’ll analyze the types of lines in a text file.

Part 1a: Counting line types

First, you’ll implement the getLineTypeDict method of the FileAnalyzer class. This method should look through the lines of the file (remember, you have self.lines as an instance variable), and classify each line as:

  1. "comment" if its first non-whitespace character is #, or
  2. "empty" if it is empty except for whitespace, or
  3. "regular" otherwise.

For example, hello.py (shown above) contains four lines that start with #, two that are empty, and four other lines.

Once you’ve implemented this method, you can test it by itself by running the following in Python interactively:

>>> from fileAnalysis import *
>>> analyzer = FileAnalyzer("hello.py")
>>> d = analyzer.getLineTypeDict()
>>> print(d)
{'comment': 4, 'empty': 2, 'regular': 4}

Part 1b: Plotting line type distribution

Next, you should implement analyzeLineTypes. This function should:

  1. build a dictionary mapping the line types to the counts (hint: use your function from the previous step!),
  2. print out the counts in a table format, and
  3. make a pie chart.

To get you started, remember that we’ve seen pie chart code before, in an Exercise from Lesson 16.

Once you have implemented this method, you should be able to test it as follows:

>>> from fileAnalysis import *
>>> analyzer = FileAnalyzer("hello.py")
>>> analyzer.analyzeLineTypes()

### Distribution of line types
comment: 4/10
empty: 2/10
regular: 4/10

When you run this for hello.py, it should also display the following graph: <image: line type pie chart>

If you run the fileAnalysis.py file itself, you should now see a pie chart and the following text output:

### Distribution of line types
comment: 0/12624
empty: 2028/12624
regular: 10596/12624

<image: line type pie chart for Sense and Sensibility>

# You should be fully equipped to complete this part after Lesson 20 (Friday May 10).

Next, you’ll work get some practice with dictionaries that map their keys to numerical values. For example, a dictionary may map words in a file to the number of times each word appears:

>>> from fileAnalysis import *
>>> analyzer = FileAnalyzer("test.txt")
>>> print(analyzer.wordCountDict)
{'to': 2, 'be': 3, 'or': 1, 'not': 1, 'this': 1, 'should': 1, 'a': 1, 'simple': 1, 'file': 1}

For this part, you’ll implement the function getMostPopularKeys:

def getMostPopularKeys(d, n):
    """
    Finds the n most popular keys in the dictionary.
    Assumes there are at least n unique keys in the dictionary.

    Note: There may be more than n values if there are ties.

    d: a dictionary with values that are numbers
    n: the number of keys to find (an int)
    returns: a list of the n most popular keys
    """
    # TODO: Part 2
    return [] # replace with your code

You should not make any assumptions about the type of the keys, but you can assume that all dictionary values are integers. Note that you may find the .values() method of dicts handy:

>>> d = {'a': 1, 'b': 3, 'c': 14}
>>> vals = list(d.values())
>>> print(vals)
[1, 3, 14]

Here’s a simple test you can do for this function:

>>> d = {'a': 1, 'b': 3, 'c': 14}
>>> getMostPopularKeys(d, 1)
['c']
>>> getMostPopularKeys(d, 2)
['b', 'c']

Once you’ve got it working, the printMostPopularKeys function should also be helpful (note that it takes in a string to use in its printed output):

>>> d = {'a': 1, 'b': 3, 'c': 14}
>>> printMostPopularKeys(d, "chars", 1)

### 1 most popular chars
c: 14
>>> printMostPopularKeys(d, "chars", 2)

### 2 most popular chars
c: 14
b: 3

When you run fileAnalysis.py, it should now include the following output:

### 4 most popular words
to: 4085
the: 4085
of: 3566
and: 3371

### 10 most popular words
to: 4085
the: 4085
of: 3566
and: 3371
her: 2510
a: 2042
in: 1927
i: 1921
was: 1843
it: 1697

Part 3: Word bigrams

# You should be fully equipped to complete this part after Lesson 20 (Friday May 10).

Bigrams are consective pairs of words. If we look at frequency of bigrams instead of frequency of specific words, we can begin to model the structure of a sentence, paragraph, or entire file. The bigrams are already parsed for you into a dictionary mapping a bigram to the number of times it appears in the file.

For example, we can inspect the parsed bigrams for the file test.txt:

>>> from fileAnalysis import *
>>> analyzer = FileAnalyzer("test.txt")
>>> for bigram in analyzer.bigramCountDict:
...     print(bigram)
...
('to', 'be')
('be', 'or')
('or', 'not')
('not', 'to')
('be', 'this')
('this', 'should')
('should', 'be')
('be', 'a')
('a', 'simple')
('simple', 'file')

For this part, you’ll implement the buildWordBigramMap method. For each word in the file, you should store a key-value pair for which the corresponding value that is a list of bigrams that have that word as their first word.

Hint: The words are already available to you as elements of the list self.wordList or keys of the dictionary self.wordCountDict.

Here is an example of the dictionary the buildWordBigramMap method should return:

>>> from fileAnalysis import *
>>> analyzer = FileAnalyzer("test.txt")
>>> d = analyzer.buildWordBigramMap()
>>> print(d)
{'to': [('to', 'be')], 'be': [('be', 'or'), ('be', 'this'), ('be', 'a')], 'or': [('or', 'not')], 'not': [('not', 'to')], 'this': [('this', 'should')], 'should': [('should', 'be')], 'a': [('a', 'simple')], 'simple': [('simple', 'file')]}
>>> for key in d:
...     print(key, d[key])
...
to [('to', 'be')]
be [('be', 'or'), ('be', 'this'), ('be', 'a')]
or [('or', 'not')]
not [('not', 'to')]
this [('this', 'should')]
should [('should', 'be')]
a [('a', 'simple')]
simple [('simple', 'file')]

When you run fileAnalysis.py, it should now include the following output:

### 4 most popular bigrams
('to', 'be'): 431
('of', 'the'): 430
('in', 'the'): 356
('it', 'was'): 273

Part 4: Word prediction

# You should be fully equipped to complete this part after Lesson 20 (Friday May 10).

Have you used a Large Language Model (LLM) like ChatGPT? If not, have you ever tapped a suggested next word when texting a friend or when typing your code in VS Code?

These predictions operate based on the assumption that certain words typically follow others. We’ll do this now at a much simpler scale.

Start by looking at the method buildBigramProbabilityDict, which has been implemented for you. This method takes the count of each bigram and instead maps a given bigram to its likelihood of being the right sequence of words, given its count relative to those of all bigrams starting with that same word.

For example, for test.txt we have self.bigramCountDict containing (I’ve truncated the output):

>>> from fileAnalysis import *
>>> analyzer = FileAnalyzer("test.txt")
>>> print(analyzer.bigramCountDict)
{..., ('be', 'or'): 1, ...,  ('be', 'this'): 1, ..., ('be', 'a'): 1, ...}

Therefore, the result of buildBigramProbabilityDict contains:

{..., ('be', 'or'): 0.3333333333333333, ..., ('be', 'this'): 0.3333333333333333, ..., ('be', 'a'): 0.3333333333333333, ...}

Your job is to implement getNextWords. You will find the bigram-probability map described above in self.bigramProbDict. For a given word, you should look for all bigrams starting with word and build and return a new dictionary that maps all possible second words to their corresponding bigram’s probability.

Here is a what test.txt should give you:

>>> from fileAnalysis import *
>>> analyzer = FileAnalyzer("test.txt")
>>> print(analyzer.getNextWords("to"))
{'be': 1.0}
>>> print(analyzer.getNextWords("be"))
{'or': 0.3333333333333333, 'this': 0.3333333333333333, 'a': 0.3333333333333333}

Once this is implemented, you should be able to run fileAnalysis.py and have it predict a bunch of words for any text file. Here are the 12 words predicted starting with "she" for the text of Sense and Sensibility:

### Starting with 'she':
she was not be a very well as she was not be

Note: To do this well, we’d use n-grams instead of bigrams, which means our tuples would be many more words long, not just pairs.

Reflection

# You should be equipped to complete this part after finishing your assignment.

Were there any particular issues or challenges you dealt with in completing this assignment? How long did you spend on this assignment? Write a brief discussion (a sentence or two is fine) in your readme.txt file.

Grading

This assignment will be graded out of 100 points, as follows:

  • 5 points - submit a valid a8.zip file with all files correctly named

  • 5 points - your fileAnalysis.py code file contains top-level comments with file name, purpose, and author names

  • 5 points - your fileAnalysis.py code file’s top-level comments contain collaboration statement

  • 10 points - code style enables readable programs

  • 20 points - getLineTypeDict and analyzeLineTypes (Part 1)

  • 20 points - getMostPopularKeys (Part 2)

  • 15 points - buildWordBigramMap (Part 3)

  • 15 points - getNextWords (Part 4)

  • 5 points - readme.txt file contains reflection

What you should submit

You should submit a single .zip file on Moodle. It should contain the following files:

  • readme.txt (reflection)
  • fileAnalysis.py (Parts 1-4)
  • any cool text files you want to show off testing with (check out the Project Gutenberg website)