Resources
These resources are intended to provide an introduction to newcomers and help researchers stay up to date with the latest research.
Online Course
This course covers various technical topics in Machine Learning safety. The course discusses Risk Management, Robustness, Monitoring, Alignment, and Systemic Safety.
Learn moreReadings by topic
Here are some papers that we recommend to researchers and practitioners who want to learn more about ML safety.
Robustness
Adversarial Robustness
- Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
- Towards Deep Learning Models Resistant to Adversarial Attacks
- Universal Adversarial Triggers for Attacking and Analyzing NLP
- Data Augmentation Can Improve Robustness
- Adversarial Examples for Evaluating Reading Comprehension Systems
- BERT-ATTACK: Adversarial Attack Against BERT Using BERT (GitHub)
- Gradient-based Adversarial Attacks against Text Transformers
- Adversarial Examples for Evaluating Reading Comprehension Systems
- Smooth Adversarial Training
- Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks (website)
- Certified Adversarial Robustness via Randomized Smoothing
- Adversarial Examples Are a Natural Consequence of Test Error in Noise
- Using Pre-Training Can Improve Model Robustness and Uncertainty
- Motivating the Rules of the Game for Adversarial Example Research
- Certified Defenses against Adversarial Examples
- Towards Evaluating the Robustness of Neural Networks
Long Tails and Distribution Shift
- The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
- Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
- PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
- WILDS: A Benchmark of in-the-Wild Distribution Shifts
- ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models
- Adversarial NLI: A New Benchmark for Natural Language Understanding
- Natural Adversarial Examples
- ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness
Monitoring
OOD and Malicious Behavior Detection
- Deep Anomaly Detection with Outlier Exposure
- A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
- ViM: Out-Of-Distribution with Virtual-logit Matching
- VOS: Learning What You Don’t Know by Virtual Outlier Synthesis
- Scaling Out-of-Distribution Detection for Real-World Settings
- A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks
Interpretable Uncertainty
- On Calibration of Modern Neural Networks
- Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
- PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
- Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
- Posterior calibration and exploratory analysis for natural language processing models
- Accurate Uncertainties for Deep Learning Using Calibrated Regression
Transparency
- The Mythos of Model Interpretability
- Sanity Checks for Saliency Maps
- Interpretable Explanations of Black Boxes by Meaningful Perturbation
- Locating and Editing Factual Knowledge in GPT
- Acquisition of Chess Knowledge in AlphaZero
- Feature Visualizations and OpenAI Microscope
- Exemplary Natural Images Explain CNN Activations Better than State-of-the-Art Feature Visualization
- Network Dissection: Quantifying Interpretability of Deep Visual Representations
- Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
- Convergent Learning: Do different neural networks learn the same representations?
Trojans
- Poisoning and Backdooring Contrastive Learning
- Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs
- Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks
- TrojAI
- Detecting AI Trojans Using Meta Neural Analysis
- STRIP: A Defence Against Trojan Attacks on Deep Neural Networks
- Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
- BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Detecting and Forecasting Emergent Behavior
Alignment
Honest AI
- TruthfulQA: Measuring How Models Mimic Human Falsehoods
- Truthful AI: Developing and governing AI that does not lie
Machine Ethics
- What Would Jiminy Cricket Do? Towards Agents That Behave Morally
- Ethics Background (Introduction through “Absolute Rights or Prima Facie Duties”)
- Aligning AI With Shared Human Values
- Avoiding Side Effects in Complex Environments
- Conservative Agency via Attainable Utility Preservation
- The Structure of Normative Ethics
Systemic Safety
Forecasting
- Forecasting Future World Events with Neural Networks
- On Single Point Forecasts for Fat-Tailed Variables
- On the Difference between Binary Prediction and True Exposure With Implications For Forecasting Tournaments and Decision Making Research
- Superforecasting – Philip Tetlock