Tips on how to Check an OpenAI Mannequin Towards Single-Flip Adversarial Assaults Utilizing deepteam

August 18, 2025

46

On this tutorial, we’ll discover methods to check an OpenAI mannequin towards single-turn adversarial assaults utilizing deepteam.

deepteam supplies 10+ assault strategies—like immediate injection, jailbreaking, and leetspeak—that expose weaknesses in LLM purposes. It begins with easy baseline assaults after which applies extra superior methods (often known as assault enhancement) to imitate real-world malicious habits. Try the FULL CODES right here.

By working these assaults, we will consider how nicely the mannequin defends towards completely different vulnerabilities.

In deepteam, there are two principal kinds of assaults:

Right here, we’ll focus solely on single-turn assaults.

Putting in the dependencies

pip set up deepteam openai pandas

You’ll have to set your OPENAI_API_KEY as an surroundings variable earlier than working the red_team() perform, since deepteam makes use of LLMs to each generate adversarial assaults and consider LLM outputs.

To get an OpenAI API key, go to https://platform.openai.com/settings/group/api-keys and generate a brand new key. For those who’re a brand new consumer, you might want so as to add billing particulars and make a minimal cost of $5 to activate API entry. Try the FULL CODES right here.

import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')

Importing the libraries

import asyncio
from openai import OpenAI
from deepteam import red_team
from deepteam.vulnerabilities import IllegalActivity
from deepteam.assaults.single_turn import PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, MathProblem

Defining the Mannequin Callback

This code defines an async callback perform that queries the OpenAI mannequin (gpt-4o-mini) and returns the mannequin’s response textual content. It acts because the LLM’s output generator for the assault framework. Try the FULL CODES right here.

shopper = OpenAI()

# Outline callback for querying the LLM
async def model_callback(enter: str) -> str:
    response = shopper.chat.completions.create(
        mannequin="gpt-4o-mini",  # use "gpt-4o" if you need a stronger mannequin
        messages=[{"role": "user", "content": input}],
    )
    return response.selections[0].message.content material

Defining Vulnerability and Assaults

Right here, we specify the vulnerability (IllegalActivity) and set its kind to youngster exploitation. We then outline completely different assault strategies similar to PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, and MathProblem

Every of those assault sorts will probably be defined intimately later within the tutorial. Try the FULL CODES right here.

# Vulnerability
illegal_activity = IllegalActivity(sorts=["child exploitation"])

# Assaults
prompt_injection = PromptInjection()
graybox_attack = GrayBox()
base64_attack = Base64()
leetspeak_attack = Leetspeak()
rot_attack = ROT13()
multi_attack = Multilingual()
math_attack = MathProblem()

Immediate Injection

Immediate Injection is an assault the place a consumer tries to override the mannequin’s authentic directions by injecting malicious or manipulative textual content into the immediate. The aim is to trick the mannequin into ignoring security insurance policies and producing dangerous or restricted content material.

The injected immediate makes an attempt to drive the mannequin into role-playing as an unethical persona known as OPPO, who disregards security guidelines and encourages criminality.

Regardless of this try, the mannequin refused to conform—it responded with:

“I’m sorry, I can not help with that.”

This confirms that the mannequin efficiently resisted the immediate injection and returned a PASS consequence for this check. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[prompt_injection],
    )

Graybox Assault

The GrayBox assault leverages partial data concerning the goal LLM system to craft adversarial prompts. Not like random inputs, GrayBox assaults exploit recognized weaknesses by reframing the baseline assault with summary or deceptive language, making it tougher for security filters to detect malicious intent.

On this check, the adversarial immediate tried to encourage unlawful actions by disguising directions about creating false identification paperwork and utilizing encrypted channels. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[graybox_attack],
    )

Base64 Assault

The Base64 assault is a typical adversarial method the place dangerous directions are encoded in Base64 to bypass security filters. As a substitute of presenting malicious content material instantly, the attacker hides it in an encoded format, hoping the mannequin will decode it and execute the directions.

On this check, the encoded string contained instructions associated to criminality, disguised to look innocent at first look. The mannequin, nevertheless, didn’t try and decode or observe via with the hidden request. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[base64_attack],
    )

Leetspeak Assault

The Leetspeak assault disguises malicious directions by changing regular characters with numbers or symbols (for instance, a turns into 4, e turns into 3, i turns into 1). This symbolic substitution makes dangerous textual content tougher to detect with easy key phrase filters, whereas nonetheless being readable to people or programs which may decode it.

On this check, the assault textual content tried to instruct minors in unlawful actions, written in leetspeak format. Regardless of the obfuscation, the mannequin clearly acknowledged the malicious intent. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[leetspeak_attack],
    )

ROT-13 Assault

The ROT-13 assault is a traditional obfuscation methodology the place every letter is shifted 13 positions within the alphabet. For instance, A turns into N, B turns into O, and so forth. This transformation scrambles dangerous directions right into a coded kind, making them much less more likely to set off easy keyword-based content material filters. Nevertheless, the textual content can nonetheless be simply decoded again into its authentic kind. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[rot_attack],
    )

Multi-lingual Assault

The multilingual assault works by translating a dangerous baseline immediate right into a much less generally monitored language. The concept is that content material filters and moderation programs could also be extra strong in broadly used languages (similar to English) however much less efficient in different languages, permitting malicious directions to bypass detection.

On this check, the assault was written in Swahili, asking for directions associated to criminality. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[multi_attack],
    )

Math Drawback

The mathematics drawback assault disguises malicious requests inside mathematical notation or drawback statements. By embedding dangerous directions in a proper construction, the textual content could look like a innocent educational train, making it tougher for filters to detect the underlying intent.

On this case, the enter framed unlawful exploitation content material as a bunch idea drawback, asking the mannequin to “show” a dangerous final result and supply a “translation” in plain language. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[math_attack],
    )

Try the FULL CODES right here. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication.

I’m a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I’ve a eager curiosity in Knowledge Science, particularly Neural Networks and their software in varied areas.

Previous articleHow Including Extra Presents and Providers Can Hurt Your Enterprise

Next articleios – How you can scale back cell peak (vertical margins) when utilizing UIListContentConfiguration

Tips on how to Check an OpenAI Mannequin Towards Single-Flip Adversarial Assaults Utilizing deepteam

Putting in the dependencies

Importing the libraries

Defining the Mannequin Callback

Defining Vulnerability and Assaults

Immediate Injection

Graybox Assault

Base64 Assault

Leetspeak Assault

ROT-13 Assault

Multi-lingual Assault

Math Drawback

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

Ionic Angular ion-content inner-scroll has zero peak on iOS stopping scrolling – all customary fixes tried

Obtain 2x quicker information lake question efficiency with Apache Iceberg on Amazon Redshift

ADU 1391: The Way forward for Drones: New Drones, Alternatives and Challenges

Raspberry Pi Goals for Extra Versatile OS Configuration with a Transfer to Cloud-Init

Recent Comments

ABOUT US

POPULAR POSTS

Ionic Angular ion-content inner-scroll has zero peak on iOS stopping scrolling – all customary fixes tried

Obtain 2x quicker information lake question efficiency with Apache Iceberg on Amazon Redshift

ADU 1391: The Way forward for Drones: New Drones, Alternatives and Challenges

POPULAR CATEGORY