Adversarial Machine Learning is the study of vulnerabilities in machine learning systems and the development of attacks and defenses related to those vulnerabilities. It examines how AI models can be fooled, manipulated, or exploited by adversaries and develops techniques to make AI systems more robust.
Major attack categories include: evasion attacks (crafting inputs at inference time that cause misclassification — e.g., adding imperceptible noise to images to fool classifiers), poisoning attacks (corrupting training data to compromise model behavior during learning), model extraction attacks (using API queries to steal a model's functionality by training a substitute model), model inversion attacks (reconstructing training data from model outputs), and membership inference attacks (determining whether specific data was included in the training set).
Real-world implications are significant: autonomous vehicle vision systems can be fooled by adversarial patches, spam and malware detectors can be evaded with crafted inputs, facial recognition systems can be defeated or spoofed, AI content moderation can be bypassed, and financial AI models can be manipulated for fraudulent gain.
Defense strategies include: adversarial training (training models on adversarial examples to build robustness), input preprocessing (detecting and filtering adversarial inputs before they reach the model), model ensembles (using multiple models to reduce vulnerability to any single attack), certified robustness (mathematical guarantees of model behavior within defined input ranges), and runtime monitoring (detecting anomalous inputs that may indicate adversarial activity).
