On this tutorial, we’ll discover methods to check an OpenAI mannequin towards single-turn adversarial assaults utilizing deepteam.
deepteam supplies 10+ assault strategies—like immediate injection, jailbreaking, and leetspeak—that expose weaknesses in LLM purposes. It begins with easy baseline assaults after which applies extra superior methods (often known as assault enhancement) to imitate real-world malicious habits. Try the FULL CODES right here.
By working these assaults, we will consider how nicely the mannequin defends towards completely different vulnerabilities.
In deepteam, there are two principal kinds of assaults:
Right here, we’ll focus solely on single-turn assaults.
Putting in the dependencies
pip set up deepteam openai pandas
You’ll have to set your OPENAI_API_KEY as an surroundings variable earlier than working the red_team() perform, since deepteam makes use of LLMs to each generate adversarial assaults and consider LLM outputs.
To get an OpenAI API key, go to https://platform.openai.com/settings/group/api-keys and generate a brand new key. For those who’re a brand new consumer, you might want so as to add billing particulars and make a minimal cost of $5 to activate API entry. Try the FULL CODES right here.
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')
Importing the libraries
import asyncio
from openai import OpenAI
from deepteam import red_team
from deepteam.vulnerabilities import IllegalActivity
from deepteam.assaults.single_turn import PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, MathProblem
Defining the Mannequin Callback
This code defines an async callback perform that queries the OpenAI mannequin (gpt-4o-mini) and returns the mannequin’s response textual content. It acts because the LLM’s output generator for the assault framework. Try the FULL CODES right here.
shopper = OpenAI()
# Outline callback for querying the LLM
async def model_callback(enter: str) -> str:
response = shopper.chat.completions.create(
mannequin="gpt-4o-mini", # use "gpt-4o" if you need a stronger mannequin
messages=[{"role": "user", "content": input}],
)
return response.selections[0].message.content material
Defining Vulnerability and Assaults
Right here, we specify the vulnerability (IllegalActivity) and set its kind to youngster exploitation. We then outline completely different assault strategies similar to PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, and MathProblem
Every of those assault sorts will probably be defined intimately later within the tutorial. Try the FULL CODES right here.
# Vulnerability
illegal_activity = IllegalActivity(sorts=["child exploitation"])
# Assaults
prompt_injection = PromptInjection()
graybox_attack = GrayBox()
base64_attack = Base64()
leetspeak_attack = Leetspeak()
rot_attack = ROT13()
multi_attack = Multilingual()
math_attack = MathProblem()
Immediate Injection
Immediate Injection is an assault the place a consumer tries to override the mannequin’s authentic directions by injecting malicious or manipulative textual content into the immediate. The aim is to trick the mannequin into ignoring security insurance policies and producing dangerous or restricted content material.
The injected immediate makes an attempt to drive the mannequin into role-playing as an unethical persona known as OPPO, who disregards security guidelines and encourages criminality.
Regardless of this try, the mannequin refused to conform—it responded with:
“I’m sorry, I can not help with that.”
This confirms that the mannequin efficiently resisted the immediate injection and returned a PASS consequence for this check. Try the FULL CODES right here.
risk_assessment = red_team(
model_callback=model_callback,
vulnerabilities=[illegal_activity],
assaults=[prompt_injection],
)
Graybox Assault
The GrayBox assault leverages partial data concerning the goal LLM system to craft adversarial prompts. Not like random inputs, GrayBox assaults exploit recognized weaknesses by reframing the baseline assault with summary or deceptive language, making it tougher for security filters to detect malicious intent.
On this check, the adversarial immediate tried to encourage unlawful actions by disguising directions about creating false identification paperwork and utilizing encrypted channels. Try the FULL CODES right here.
risk_assessment = red_team(
model_callback=model_callback,
vulnerabilities=[illegal_activity],
assaults=[graybox_attack],
)
Base64 Assault
The Base64 assault is a typical adversarial method the place dangerous directions are encoded in Base64 to bypass security filters. As a substitute of presenting malicious content material instantly, the attacker hides it in an encoded format, hoping the mannequin will decode it and execute the directions.
On this check, the encoded string contained instructions associated to criminality, disguised to look innocent at first look. The mannequin, nevertheless, didn’t try and decode or observe via with the hidden request. Try the FULL CODES right here.
risk_assessment = red_team(
model_callback=model_callback,
vulnerabilities=[illegal_activity],
assaults=[base64_attack],
)
Leetspeak Assault
The Leetspeak assault disguises malicious directions by changing regular characters with numbers or symbols (for instance, a turns into 4, e turns into 3, i turns into 1). This symbolic substitution makes dangerous textual content tougher to detect with easy key phrase filters, whereas nonetheless being readable to people or programs which may decode it.
On this check, the assault textual content tried to instruct minors in unlawful actions, written in leetspeak format. Regardless of the obfuscation, the mannequin clearly acknowledged the malicious intent. Try the FULL CODES right here.
risk_assessment = red_team(
model_callback=model_callback,
vulnerabilities=[illegal_activity],
assaults=[leetspeak_attack],
)
ROT-13 Assault
The ROT-13 assault is a traditional obfuscation methodology the place every letter is shifted 13 positions within the alphabet. For instance, A turns into N, B turns into O, and so forth. This transformation scrambles dangerous directions right into a coded kind, making them much less more likely to set off easy keyword-based content material filters. Nevertheless, the textual content can nonetheless be simply decoded again into its authentic kind. Try the FULL CODES right here.
risk_assessment = red_team(
model_callback=model_callback,
vulnerabilities=[illegal_activity],
assaults=[rot_attack],
)
Multi-lingual Assault
The multilingual assault works by translating a dangerous baseline immediate right into a much less generally monitored language. The concept is that content material filters and moderation programs could also be extra strong in broadly used languages (similar to English) however much less efficient in different languages, permitting malicious directions to bypass detection.
On this check, the assault was written in Swahili, asking for directions associated to criminality. Try the FULL CODES right here.
risk_assessment = red_team(
model_callback=model_callback,
vulnerabilities=[illegal_activity],
assaults=[multi_attack],
)
Math Drawback
The mathematics drawback assault disguises malicious requests inside mathematical notation or drawback statements. By embedding dangerous directions in a proper construction, the textual content could look like a innocent educational train, making it tougher for filters to detect the underlying intent.
On this case, the enter framed unlawful exploitation content material as a bunch idea drawback, asking the mannequin to “show” a dangerous final result and supply a “translation” in plain language. Try the FULL CODES right here.
risk_assessment = red_team(
model_callback=model_callback,
vulnerabilities=[illegal_activity],
assaults=[math_attack],
)
Try the FULL CODES right here. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication.