Publication Details

Automated adversarial prompts generation against black-box large language models

Title
Automated Adversarial Prompts Generation Against Black-Box Large Language Models
Author
Sun, Qiyang
Date
[2025-04-29 Tue]
Module
COMP3200
Session
2024/2025
Supervisor
Dr Erisa Karafili
Second Examiner
Professor Jonathon Hare
Third Examiner
Dr Basel Halak
Note
A project report submitted for the award of BSc Computer Science.
Pages
90
Publisher
Unpublished
Uniform Resource Locator
https://secure.ecs.soton.ac.uk/notes/comp3200/e_archive/COMP3200/2425/qs2g22/project.html
Digital Object Identifier
Not available
Content Warning
This dissertation contains examples of potentially harmful, offensive or upsetting language. Readers are advised to exercise discretion.
Abstract
Given their widespread use, efforts have been made to align large language models with human values and ethical standards. However, the persistence of various jailbreak attacks highlights the need for further research to strengthen their security. While prior work has explored manual crafting, helper LLM prompt generation and adversarial machine learning, few studies have investigated the use of evolutionary algorithms. This paper reviews existing attack techniques and proposes a black-box attack framework based on evolutionary algorithms, instantiated using genetic algorithms and evolution strategies. Key innovations include a novel and effective population initialisation policy, fitness evaluation without access to the internals of the victim model, chunk-level crossover and word-level Gaussian-distributed mutation. Experiments achieve 100% attack success rate on less aligned models and success rates of 28.7% (genetic algorithm) and 37.7% (evolution strategy) on well-aligned models within limited maximum generations, comparable to the state-of-the-art white-box hierarchical genetic algorithm attack, which achieves a success rate of 32.7% under the same limit. The framework demonstrates high transferability to proprietary models such as ChatGPT, Gemini and DeepSeek. These results highlight that even the safety-aligned large language models remain vulnerable to black-box jailbreak attacks.
Citation (IEEE)
Q. Sun, Automated adversarial prompts generation against black-box large language models, Bachelor's thesis, Univ. of Southampton, 2025. [Online]. Available: https://secure.ecs.soton.ac.uk/notes/comp3200/e_archive/COMP3200/2425/qs2g22/project.html
Citation (BibLaTeX)
@thesis{sun2025automated
  author = {Sun, Qiyang},
  title = {Automated Adversarial Prompts Generation Against Black-Box Large Language Models},
  type = {Bachelor's thesis},
  institution = {University of Southampton},
  date = {2025-04-29},
  url = {https://secure.ecs.soton.ac.uk/notes/comp3200/e_archive/COMP3200/2425/qs2g22/project.html},
}

Not publicly available.



Copyright © 2024–2025 | Author: Qiyang Sun <Q.Sun@soton.ac.uk> | Privacy | Last modified: 2025-08-12 Tue 22:49 | Built with Emacs 30.1 (Org mode 9.7.11)