Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Published in TPAMI 2026 (Adopted at scale by Anthropic), 2026

This journal article was later followed by Anthropic.

Zeming Wei, Yifei Wang, Li Ang, Yichuan Mo, and Yisen Wang. (2026). "Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations." TPAMI 2026.