Fight Back Against Jailbreaking via Prompt Adversarial Tuning

Published in NeurIPS 2024, 2024

This paper proposes prompt adversarial tuning as a defense against jailbreak attacks.

Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. (2024). "Fight Back Against Jailbreaking via Prompt Adversarial Tuning." NeurIPS 2024.