Research⭐ Featured

LLM Jailbreak Defense Research Project

Published research on improving LLM security by analyzing model activations to defend against jailbreaking attacks.

LLM Jailbreak Defense Research Project

About This Project

Conducted cutting-edge research on Large Language Model security as part of QMIND (Queen's Machine Intelligence and Data Science), focusing on developing robust defenses against jailbreaking methods that attempt to bypass AI safety measures. The project involved deep analysis of neural network internals to understand how malicious prompts succeed in circumventing model safeguards. Developed an innovative framework that significantly improves refusal rates for various LLM jailbreaking techniques while maintaining the model's performance on legitimate tasks. The approach centers on analyzing the model's internal activations to identify relevant groups of neurons (features) that are crucial for safety mechanisms. By strategically boosting or suppressing these identified features, the system can better detect and refuse potentially harmful requests without degrading the model's helpful capabilities. The research involved extensive experimentation with different activation patterns, feature selection algorithms, and intervention strategies. Key technical contributions include novel methods for identifying safety-relevant neural pathways and developing efficient techniques for real-time feature manipulation during inference. The work culminated in a peer-reviewed paper co-authored and published at CUCAI (Canadian Undergraduate Conference on Artificial Intelligence), demonstrating the practical effectiveness and theoretical significance of the proposed defense mechanisms. This research contributes to the broader field of AI safety and responsible AI development.

Technologies Used

PythonPyTorchTransformersNeural Network AnalysisFeature EngineeringAI SafetyResearch MethodologyAcademic Writing

Project Links