LLM Jailbreak Defense Research Project
Published research on improving LLM security by analyzing model activations to defend against jailbreaking attacks.

About This Project
Conducted cutting-edge research on Large Language Model security as part of QMIND (Queen's Machine Intelligence and Data Science), focusing on developing robust defenses against jailbreaking methods that attempt to bypass AI safety measures. The project involved deep analysis of neural network internals to understand how malicious prompts succeed in circumventing model safeguards. Developed an innovative framework that significantly improves refusal rates for various LLM jailbreaking techniques while maintaining the model's performance on legitimate tasks. The approach centers on analyzing the model's internal activations to identify relevant groups of neurons (features) that are crucial for safety mechanisms. By strategically boosting or suppressing these identified features, the system can better detect and refuse potentially harmful requests without degrading the model's helpful capabilities. The research involved extensive experimentation with different activation patterns, feature selection algorithms, and intervention strategies. Key technical contributions include novel methods for identifying safety-relevant neural pathways and developing efficient techniques for real-time feature manipulation during inference. The work culminated in a peer-reviewed paper co-authored and published at CUCAI (Canadian Undergraduate Conference on Artificial Intelligence), demonstrating the practical effectiveness and theoretical significance of the proposed defense mechanisms. This research contributes to the broader field of AI safety and responsible AI development.