That’s still a stop sign!: Adversarial examples and machine learning
Advisor: Anna Rafferty
Times: Winter 4a
Machine learning classifiers are increasingly common and effective: From apps to identify pictures of plants to autonomous vehicle systems that identify the meaning of road signs, these classifiers often identify objects in images with high accuracy. However, these classifiers can also be “tricked” by images that look very similar to humans. For example, the picture on the left is correctly identified by a machine learning classifier as a panda with 58% confidence, but the almost-identical picture on the right is incorrectly identified by that same classifier as a gibbon with 99% confidence (images from Goodfellow, Shlens, & Szegedy 2015):
Panda (58% confidence)
Gibbon (99% confidence)
This behavior is even more problematic when applied to critical
systems like identifying road signs, where for instance the stop
sign below was identified as a 45 mph speed limit sign (Eykholt et
These types of perturbed images are known as adversarial examples: they’re designed to fool the classifier, while still being intelligible to humans. Adversarial examples have been an area of significant study in the last ten years, with work both identifying how to create adversarial examples and exploration of how to make classifiers more robust to these types of examples.
In this project, you’ll be diving into this rich literature about
adversarial examples focusing on two primary areas:
- Types of attacks: Your group will investigate different types of adversarial attacks, implement several of these attacks, and examine the attack’s effectiveness on several pre-trained machine learning models.
- Defense strategies: Your group will search the literature for
possible defenses against these attacks, and time permitting, empirically evaluate how
well at least one of these defense strategies works.
Your goal will be to understand both the theory behind these attacks
and defenses and their practical implementation.
The progression of the project will look something like the
You’ll begin by reading several papers about adversarial examples and deciding what type of classifier to focus on (image or text-based). You’ll identify existing trained models of that type that are available for you to use for your experiments.
- Your group will decide on 2-3 types of adversarial attacks to focus on, and implement these attacks against the existing models and evaluate their success. You’ll document your work so that it’s replicable and understandable by others.
- Your group will investigate 1-2 defense strategies, and decide on
one of them to evaluate empirically. (This part of the project is
likely to be less in depth than the investigation of the
- Your well-documented source code for implementing the attacks and conducting your analyses.
- A paper or webpage that describes the attacks and defenses you
explored, your process for investigating their effectiveness, your
results, and a discussion of the implications of this work for the
use of machine learning classifiers in different situations.
It would be helpful but not required if at least one member of the group has taken a machine learning or artificial intelligence course. Linear algebra is also likely to be helpful, and willingness to engage with mathematical content is necessary.
I’ll provide additional references when we start the project, but
here are some relevant papers:
- Carlini, N., & Wagner, D. (2017, November). Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM workshop on artificial intelligence and security (pp. 3-14).
- Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., ... & Song, D. (2018). Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1625-1634).
- Goodfellow, I.,
Shlens, J., and Szegedy, C. (2015) Explaining and Harnessing
Adversarial Examples. In Proceedings of ICLR 2015.
- Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2017, April). Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security (pp. 506-519).
- Ren, S., Deng, Y.,
He, K., & Che, W. (2019, July). Generating natural language
adversarial examples through probability weighted word
saliency. In Proceedings of the 57th annual meeting of the
association for computational linguistics
- Zhang, J., & Li, C. (2019). Adversarial examples: Opportunities and challenges. IEEE Transactions on neural networks and learning systems, 31(7), 2578-2593.