Understanding AI Vulnerabilities

As artificial intelligence capabilities evolve, so too will the tactics used to exploit them. 

One major security concern is the susceptibility of AI to “adversarial attacks,” or breaches from exploitative parties.  | MONTAGE BY NIKO YAITANES/HARVARD MAGAZINE; IMAGES BY UNSPLASH

How safe is generative AI today? Yaron Singer, former McKay professor of computer science and applied mathematics and currently vice president of AI and security at Cisco, has spent the past six years developing guardrails to protect AI systems. In 2019, he co-founded Robust Intelligence with Kojin Oshiba ‘18. The startup, acquired by Cisco in August 2024, evaluates commercial AI models at scale, looking for vulnerabilities and providing protection against abuse or privacy breaches. Many of these AI safety practices were uncommon when Robust Intelligence was founded.

To understand the present landscape of AI security, it’s useful to look back. Traditional artificial intelligence, including machine learning (ML) models, functions by analyzing data to make predictions or classifications. These models take inputs and produce corresponding outputs based on patterns learned from historical data. “Ten years ago,” Singer says, “what we would think of as AI, we used to call ‘common machine learning.’” For example, spam filters using machine learning classify emails based on their probability of being junk. Similarly, in healthcare, AI might analyze medical records to predict the likelihood of a patient being hospitalized. Importantly, many traditional AI systems do take actions—self-driving cars make real-time navigation decisions, hiring algorithms filter job applications, and recidivism models influence judicial sentencing. However, these actions are typically constrained to predefined tasks or decision-making rules, operating within the boundaries of specific domains.

Generative AI—which found its popular foothold with OpenAI’s ChatGPT 3 (now in version 4o)—represents a paradigm shift. Unlike traditional models, which primarily make predictions or classifications based on data, generative AI models, particularly large language models (LLMs), create new content based on learned patterns. Chatbots like ChatGPT can generate text, images, and more (often in real time) in response to user prompts. This opens a wide range of possibilities but also introduces novel security risks. Whereas an ML-based spam filter might misclassify an email, a generative AI system could be manipulated into generating misleading, harmful, or sensitive content. One major security concern is the susceptibility of AI to “adversarial attacks,” or breaches from exploitative parties.

 

As Singer explains, in traditional machine learning, even minor changes in input data could significantly alter outputs, sometimes in unintended ways. For example, in the case of a spam filter, “small changes in text” could allow an email that should be classified as spam to bypass detection. With generative AI, the risks escalate. Subtle modifications in prompts or training data could lead to significant vulnerabilities, such as generating inappropriate content, leaking confidential information, or even executing unauthorized actions. For example, in 2023, Robust Intelligence researchers discovered security vulnerabilities in supposed AI-safety guardrails used by chipmaker Nvidia—in one scenario, the team successfully manipulated LLMs into overcoming existing restrictions and releasing personally identifiable information from an assumed-secure database.

Since the launch of ChatGPT, companies have begun integrating these AI capabilities into their products and applications using LLM application programming interfaces (APIs). These interfaces allow developers to connect AI capabilities to their software by sending inputs and receiving outputs from pre-trained models. But when hidden vulnerabilities in models exist, such integrations can cause security issues at scale.

“We’re seeing a lot of companies who are facing this dilemma now,” Singer says. “On the one hand, they’re saying, ‘Well, we have a lot of data here. That can give us a business advantage. We can then customize our AI models according to the data to benefit our customers.’ Right. But when vulnerabilities exist and there’s potential for privacy concerns and data leakage, that becomes a huge problem.”

Bias in AI systems further compounds these risks. Poorly calibrated models can unintentionally perpetuate harmful stereotypes or produce “discriminatory, racist, or sexist outputs,” he says (see “Artificial Intelligence and Ethics,” January-February 2019, and “Bias in Artificial Intelligence”). As AI systems take on increasingly sensitive roles in society, rigorous testing and validation of data become more important to mitigate these risks.

As of 2025, businesses have embraced APIs more than ever, weaving them into their operations as essential tools for staying competitive in the digital age. A 2024 Gartner survey indicates that 71 percent of digital businesses are consuming APIs created by third parties, highlighting the widespread reliance on external API integrations to enhance functionality. “Businesses must be educated about the safety of their models and the data they use from patients or customers,” Singer says, “And they need to [implement] the appropriate guardrails and the right decisions for releasing models for commercial use.” These techniques are designed to help prevent failures and attacks across the data, model, and deployment stages.

Solutions to Safeguarding AI

What if data could be validated before (or while) it’s fed to an AI model? Like a bouncer at a bar vetting guests—checking IDs, assessing behavior, making sure nothing dangerous is brought into the venue—a “software bouncer” could validate AI inputs before they influence model behavior. In real time, AI validation involves rigorous pre-deployment testing to identify vulnerabilities in both data and models, preventing adversarial manipulation. Before Singer and his team developed many of the validation techniques used by Robust Intelligence, such practices were “not a common practice [in AI security],” he says. “We pioneered both the validation [techniques] and what we’re calling ‘guardrails’ for models.”

Data-Related Vulnerabilities

“Prompt engineering”: This term refers to the practice of designing precise inputs to steer an AI model toward specific outputs. While it is primarily used to enhance performance and improve responses, it can also be used to manipulate AI behavior in unintended ways, including by malicious actors looking to exploit vulnerabilities to generate biased content, reveal private conversations, or even assist in carrying out cyberattacks. For instance, if an AI-powered customer service bot is manipulated, it could be tricked into revealing sensitive corporate information. The implications stretch across industries: in healthcare, Singer says, an improperly secured AI could provide incorrect medical advice; in finance, generative AI could be manipulated to approve fraudulent transactions or even release fiduciary information.

“Data poisoning”: If the data used to train or configure a model contains toxic content, bias, or personally identifiable information (PII), it can manifest as outputs that are harmful or violate privacy standards. Singer’s team works to identify and screen out data that could result in ethical issues.

Model-Related Vulnerabilities

“AI jailbreaking”: This term refers to setting an AI output “free” from its intended restrictions. Using jailbreaking techniques, bad actors could potentially manipulate prompts or exploit system vulnerabilities to bypass safeguards and make a model generate content it was designed to block (for example, outputting unsafe content or leaking sensitive information).

Adversarial testing”: As Singer explains, this is a method in which external experts simulate attacks to assess an AI model’s resistance to manipulation. Returning to the bouncer analogy, imagine an external party deliberately provoking an aggressive situation to see how well security staff can manage it. Similarly, adversarial testing exposes weaknesses in AI systems, helping developers anticipate and mitigate adversarial threats.

In 2023, Robust Intelligence used this “adversarial testing” on OpenAI’s Chat GPT-4 to search for weaknesses—although not all attempts worked, some did, including a jailbreak designed to generate phishing messages and one for producing ideas to help a malicious actor remain hidden on a government computer network. Although these jailbreaks have been addressed, new types of weaknesses appear as models evolve and as their use across a range of sectors expands. Unlike some models that have been open-sourced, OpenAI has kept its most advanced LLMs (like GPT-4) behind closed doors, necessitating security measures to mitigate misuse; other companies, like Meta, have open-sourced their models.

“AI firewall”: Another innovation developed by Singer’s team is the “AI firewall.” Just as a traditional firewall shields networks from cyber threats, AI firewalls scan incoming data to block malicious attacks, misinformation, and unethical content before it reaches the model. They also monitor a model’s responses, preventing biased or unsafe outputs. AI firewalls are used to verify that inputs are “safe” before they’re given to a model and that the model adheres to predefined ethical constraints.

A Cat and Mouse Game

Even as security measures improve, the question remains: can AI vulnerabilities ever be fully resolved? Singer says, in short…no. “As the technology evolves, so too will the techniques used for adversarial attacks, hacking, and manipulating data,” he says. It’s a cat-and-mouse game between developers and adversaries, with new vulnerabilities emerging as quickly as they are patched. “That question has a lot of merit, however,” Singer continues, “because—and especially when we’re looking at this from an academic lens—there’s unbelievable fundamental work that we’re doing in computer science and math presently.” He uses the example of quantum cryptography, a method of encryption that uses the principles of quantum mechanics to secure and transmit data, as an emerging field dedicated to giving “provable guarantees.” These guarantees can extend to the safety and data security of computation.

“So, there are very, very strong foundations that we can rely on to ensure information security. However, in the world of general security, there are a lot of things that just happen for odd reasons,” Singer says. He then gives a hypothetical: imagine a new version of software for which “somebody forgets to do an update for some library, or there’s a misconfiguration in a cloud—all these things that depend on human behavior. Unknown organizational gaps, and what have you. These are the things that invite and introduce risk to the use of artificial intelligence.”

The responsibility to protect AI systems, then, lies not only with developers, but also with businesses and regulators—who must ensure that the pressure to innovate doesn’t come at the expense of security and ethical integrity. Although he was approved to become a full, tenured professor at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS) in 2020, Singer’s transition to Robust Intelligence—now Cisco—seems complete. He’s eyeing his next goal: helping to ensure that the future of AI is secure for its millions of users and billions of people around the world. “In society, we [constantly] look at tradeoffs and we accept them,” Singer says. “That’s going to be the case with AI: it’s going to introduce more risk. Some of that risk we will accept, and we’ll protect against whatever we can.”

Disclosure: Olivia Farrar, the author of this article, was employed at Robust Intelligence between 2021-2022.


 


 

Read more articles by Olivia Farrar

You might also like

Harvard Releases Antisemitism and Anti-Muslim Task Force Reports

University publishes findings from thorough examinations of campus conditions.

Harvard Renames Diversity Office

The decision follows pressure from the Trump administration to eliminate DEI practices. 

Centralizing University Discipline

Harvard establishes new disciplinary procedures for campus protest violations.

Most popular

Harvard Releases Antisemitism and Anti-Muslim Task Force Reports

University publishes findings from thorough examinations of campus conditions.

Harvard Renames Diversity Office

The decision follows pressure from the Trump administration to eliminate DEI practices. 

The New Gender Gaps

What to do as men and boys fall behind

Explore More From Current Issue

The Trump Administration's Impact on Higher Education

Unprecedented federal actions against research funding, diversity, speech, and more

89664

Jessica Shand—Math and Music at Harvard

Jessica Shand blends math and music.

89677

Paper Peepshows at Harvard's Baker Library

How “paper peepshows” brought distant realms to life

89684