That’s still a stop sign!: Adversarial examples and machine learning

Final Results

That’s still a stop sign!: Adversarial examples and machine learning

Advisor: Anna Rafferty

Times: Winter 4a

Background

Machine learning classifiers are increasingly common and effective: From apps to identify pictures of plants to autonomous vehicle systems that identify the meaning of road signs, these classifiers often identify objects in images with high accuracy. However, these classifiers can also be “tricked” by images that look very similar to humans. For example, the picture on the left is correctly identified by a machine learning classifier as a panda with 58% confidence, but the almost-identical picture on the right is incorrectly identified by that same classifier as a gibbon with 99% confidence (images from Goodfellow, Shlens, & Szegedy 2015):


Panda (58% confidence)	Gibbon (99% confidence)

This behavior is even more problematic when applied to critical systems like identifying road signs, where for instance the stop sign below was identified as a 45 mph speed limit sign (Eykholt et al. 2018):

These types of perturbed images are known as adversarial examples: they’re designed to fool the classifier, while still being intelligible to humans. Adversarial examples have been an area of significant study in the last ten years, with work both identifying how to create adversarial examples and exploration of how to make classifiers more robust to these types of examples.

The project

In this project, you’ll be diving into this rich literature about adversarial examples focusing on two primary areas:

Types of attacks: Your group will investigate different types of adversarial attacks, implement several of these attacks, and examine the attack’s effectiveness on several pre-trained machine learning models.
Defense strategies: Your group will search the literature for possible defenses against these attacks, and time permitting, empirically evaluate how well at least one of these defense strategies works.

Your goal will be to understand both the theory behind these attacks and defenses and their practical implementation.

The progression of the project will look something like the following:

You’ll begin by reading several papers about adversarial examples and deciding what type of classifier to focus on (image or text-based). You’ll identify existing trained models of that type that are available for you to use for your experiments.
Your group will decide on 2-3 types of adversarial attacks to focus on, and implement these attacks against the existing models and evaluate their success. You’ll document your work so that it’s replicable and understandable by others.
Your group will investigate 1-2 defense strategies, and decide on one of them to evaluate empirically. (This part of the project is likely to be less in depth than the investigation of the attacks.)

Deliverables

Your well-documented source code for implementing the attacks and conducting your analyses.
A paper or webpage that describes the attacks and defenses you explored, your process for investigating their effectiveness, your results, and a discussion of the implications of this work for the use of machine learning classifiers in different situations.

Recommended experience

It would be helpful but not required if at least one member of the group has taken a machine learning or artificial intelligence course. Linear algebra is also likely to be helpful, and willingness to engage with mathematical content is necessary.

References/inspiration

I’ll provide additional references when we start the project, but here are some relevant papers: