Security

Different from the main README🕵️

Within this subtopic, we will be updating with the latest articles. This will help researchers in this area to quickly understand recent trends.
In addition to providing the most recent updates, we will also add keywords to each subtopic to help you find content of interest more quickly.
Within each subtopic, we will also update with profiles of scholars we admire and endorse in the field. Their work is often of high quality and forward-looking!"

📑Papers

Date	Institute	Publication	Paper	Keywords
20.10	Facebook AI Research	arxiv	Recipes for Safety in Open-domain Chatbots	Toxic Behavior&Open-domain
22.02	DeepMind	EMNLP2022	Red Teaming Language Models with Language Model	Red Teaming&Harm Test
22.03	OpenAI	NIPS2022	Training language models to follow instructions with human feedback	InstructGPT&RLHF&Harmless
22.04	Anthropic	arxiv	Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback	Helpful&Harmless
22.05	UCSD	EMNLP2022	An Empirical Analysis of Memorization in Fine-tuned Autoregressive Language Models	Privacy Risks&Memorization
22.09	Anthropic	arxiv	Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned	Red Teaming&Harmless&Helpful
22.12	Anthropic	arxiv	Constitutional AI: Harmlessness from AI Feedback	Harmless&Self-improvement&RLAIF
23.07	UC Berkeley	NIPS2023	Jailbroken: How Does LLM Safety Training Fail?	Jailbreak&Competing Objectives&Mismatched Generalization
23.08	The Chinese University of Hong Kong Shenzhen China, Tencent AI Lab, The Chinese University of Hong Kong	arxiv	GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs Via Cipher	Safety Alignment&Adversarial Attack
23.08	University College London, University College London, Tilburg University	arxiv	Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities	Security&AI Alignment
23.09	Peking University	arxiv	RAIN: Your Language Models Can Align Themselves without Finetuning	Self-boosting&Rewind Mechanisms
23.10	Princeton University, Virginia Tech, IBM Research, Stanford University	arxiv	FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO!	Fine-tuningSafety Risks&Adversarial Training
23.10	UC Riverside	arXiv	Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks	Adversarial Attacks&Vulnerabilities&Model Security
23.10	Rice University	NAACL2024(findings)	Secure Your Model: An Effective Key Prompt Protection Mechanism for Large Language Models	Key Prompt Protection&Large Language Models&Unauthorized Access Prevention
23.11	KAIST AI	arxiv	HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning	Hate Speech&Detection
23.11	CMU	AACL2023(ART or Safety workshop)	Measuring Adversarial Datasets	Adversarial Robustness&AI Safety&Adversarial Datasets
23.11	UIUC	arxiv	Removing RLHF Protections in GPT-4 via Fine-Tuning	Remove Protection&Fine-Tuning
23.11	IT University of Copenhagen，University of Washington	arxiv	Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild	Red Teaming
23.11	Fudan University&Shanghai AI lab	arxiv	Fake Alignment: Are LLMs Really Aligned Well?	Alignment Failure&Safety Evaluation
23.11	University of Southern California	arxiv	SAFER-INSTRUCT: Aligning Language Models with Automated Preference Data	RLHF&Safety
23.11	Google Research	arxiv	AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications	Adversarial Testing&AI-Assisted Red Teaming&Application Safety
23.11	Tencent AI Lab	arxiv	ADVERSARIAL PREFERENCE OPTIMIZATION	Human Preference Alignment&Adversarial Preference Optimization&Annotation Reduction
23.11	Docta.ai	arxiv	Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models	Data Credibility&Safety alignment
23.11	CIIRC CTU in Prague	arxiv	A Security Risk Taxonomy for Large Language Models	Security risks&Taxonomy&Prompt-based attacks
23.11	Meta&University of Illinois Urbana-Champaign	NAACL2024	MART: Improving LLM Safety with Multi-round Automatic Red-Teaming	Automatic Red-Teaming&LLM Safety&Adversarial Prompt Writing
23.11	The Ohio State University&University of California, Davis	NAACL2024	How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities	Open-Source LLMs&Malicious Demonstrations&Trustworthiness
23.12	Drexel University	arXiv	A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly	Security&Privacy&Attacks
23.12	Tenyx	arXiv	Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation	Geometric Interpretation&Intrinsic Dimension&Toxicity Detection
23.12	Independent (Now at Google DeepMind)	arXiv	Scaling Laws for Adversarial Attacks on Language Model Activations	Adversarial Attacks&Language Model Activations&Scaling Laws
23.12	University of Liechtenstein, University of Duesseldorf	arxiv	NEGOTIATING WITH LLMS: PROMPT HACKS, SKILL GAPS, AND REASONING DEFICITS	Negotiation&Reasoning&Prompt Hacking
23.12	University of Wisconsin Madison, University of Michigan Ann Arbor, ASU, Washington University	arXiv	Exploring the Limits of ChatGPT in Software Security Applications	Software Security
23.12	GenAI at Meta	arxiv	Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations	Human-AI Conversation&Safety Risk taxonomy
23.12	University of California Riverside, Microsoft	arxiv	Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack	Safety Alignment&Summarization&Vulnerability
23.12	MIT, Harvard	NIPS2023(Workshop)	Forbidden Facts: An Investigation of Competing Objectives in Llama-2	Competing Objectives&Forbidden Fact Task&Model Decomposition
23.12	University of Science and Technology of China	arxiv	Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models	Text Protection&Silent Guardian
23.12	OpenAI	Open AI	Practices for Governing Agentic AI Systems	Agentic AI Systems&LM Based Agent
23.12	University of Massachusetts Amherst, Columbia University, Google, Stanford University, New York University	arxiv	Learning and Forgetting Unsafe Examples in Large Language Models	Safety Issues&ForgetFilter Algorithm&Unsafe Content
23.12	Tencent AI Lab, The Chinese University of Hong Kong	arxiv	Aligning Language Models with Judgments	Judgment Alignment&Contrastive Unlikelihood Training
24.01	Delft University of Technology	arxiv	Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks	Red Teaming&Hallucinations&Mathematics Tasks
24.01	Apart Research, University of Edinburgh, Imperial College London, University of Oxford	arxiv	Large Language Models Relearn Removed Concepts	Neuroplasticity&Concept Redistribution
24.01	Tsinghua University, Xiaomi AI Lab, Huawei, Shenzhen Heytap Technology, vivo AI Lab, Viomi Technology, Li Auto, Beijing University of Posts and Telecommunications, Soochow University	arxiv	PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY EFFICIENCY AND SECURITY	Intelligent Personal Assistant&LLM Agent&Security and Privacy
24.01	Zhongguancun Laboratory, Tsinghua University, Institute of Information Engineering Chinese Academy of Sciences, Ant Group	arxiv	Risk Taxonomy Mitigation and Assessment Benchmarks of Large Language Model Systems	Safety&Risk Taxonomy&Mitigation Strategies
24.01	Google Research	arxiv	Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models	Interpretability
24.01	Ben-Gurion University of the Negev Israel	arxiv	GPT IN SHEEP’S CLOTHING: THE RISK OF CUSTOMIZED GPTS	GPTs&Cybersecurity&ChatGPT
24.01	Shanghai Jiao Tong University	arxiv	R-Judge: Benchmarking Safety Risk Awareness for LLM Agents	LLM Agents&Safety Risk Awareness&Benchmark
24.01	Ant Group	arxiv	A FAST PERFORMANT SECURE DISTRIBUTED TRAINING FRAMEWORK FOR LLM	Distributed LLM&Security
24.01	Shanghai Artificial Intelligence Laboratory, Dalian University of Technology, University of Science and Technology of China	arxiv	PsySafe: A Comprehensive Framework for Psychological-based Attack Defense and Evaluation of Multi-agent System Safety	Multi-agent Systems&Agent Psychology&Safety
24.01	Rochester Institute of Technology	arxiv	Mitigating Security Threats in LLMs	Security Threats&Prompt Injection&Jailbreaking
24.01	Johns Hopkins University, University of Pennsylvania, Ohio State University	arxiv	The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts	Multilingualism&Safety&Resource Disparity
24.01	University of Florida	arxiv	Adaptive Text Watermark for Large Language Models	Text Watermarking&Robustness&Security
24.01	The Hebrew University	arXiv	TRADEOFFS BETWEEN ALIGNMENT AND HELPFULNESS IN LANGUAGE MODELS	Language Model Alignment&AI Safety&Representation Engineering
24.01	Google Research， Anthropic	arxiv	Gradient-Based Language Model Red Teaming	Red Teaming&Safety&Prompt Learning
24.01	National University of Singapore， Pennsylvania State University	arxiv	Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code	Watermarking&Error Correction Code&AI Ethics
24.01	Tsinghua University, University of California Los Angeles, WeChat AI Tencent Inc.	arxiv	Prompt-Driven LLM Safeguarding via Directed Representation Optimization	Safety Prompts&Representation Optimization
24.02	Rensselaer Polytechnic Institute, IBM T.J. Watson Research Center, IBM Research	arxiv	Adaptive Primal-Dual Method for Safe Reinforcement Learning	Safe Reinforcement Learning&Adaptive Primal-Dual&Adaptive Learning Rates
24.02	Jagiellonian University, University of Modena and Reggio Emilia, Alma Mater Studiorum University of Bologna, European University Institute	arxiv	No More Trade-Offs: GPT and Fully Informative Privacy Policies	ChatGPT&Privacy Policies&Legal Requirements
24.02	Florida International University	arxiv	Security and Privacy Challenges of Large Language Models: A Survey	Security&Privacy Challenges&Suevey
24.02	Rutgers University, University of California, Santa Barbara, NEC Labs America	arxiv	TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution	LLM-based Agents&Safety&Trustworthiness
24.02	University of Maryland College Park, JPMorgan AI Research, University of Waterloo, Salesforce Research	arxiv	Shadowcast: Stealthy Data Poisoning Attacks against VLMs	Vision-Language Models&Data Poisoning&Security
24.02	Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong	arxiv	SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models	Safety Benchmark&Safety Evaluation&Hierarchical Taxonomy**
24.02	Fudan University	arxiv	ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages	Tool Learning&Large Language Models (LLMs)&Safety Issues&ToolSword
24.02	Paul G. Allen School of Computer Science & Engineering, University of Washington	arxiv	SPML: A DSL for Defending Language Models Against Prompt Attacks	Domain-Specific Language (DSL)&Chatbot Definitions&System Prompt Meta Language (SPML)
24.02	Tsinghua University	arxiv	ShieldLM: Empowering LLMs as Aligned Customizable and Explainable Safety Detectors	Safety Detectors&Customizable&Explainable
24.02	Dalhousie University	arxiv	Immunization Against Harmful Fine-tuning Attacks	Fine-tuning Attacks&Immunization
24.02	Chinese Academy of Sciences, University of Chinese Academy of Sciences, Alibaba Group	arxiv	SoFA: Shielded On-the-fly Alignment via Priority Rule Following	Priority Rule Following&Alignment
24.02	Universidade Federal de Santa Catarina	arxiv	A Survey of Large Language Models in Cybersecurity	Cybersecurity&Vulnerability Assessment
24.02	Zhejiang University	arxiv	PRSA: Prompt Reverse Stealing Attacks against Large Language Models	Prompt Reverse Stealing Attacks&Security
24.02	Shanghai Artificial Intelligence Laboratory	NAACL2024	Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey	Large Language Models&Conversation Safety&Survey
24.03	Tulane University	arxiv	ENHANCING LLM SAFETY VIA CONSTRAINED DIRECT PREFERENCE OPTIMIZATION	Reinforcement Learning&Human Feedback&Safety Constraints
24.03	University of Illinois Urbana-Champaign	arxiv	INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents	Tool Integration&Security&Indirect Prompt Injection
24.03	Harvard University	arxiv	Towards Safe and Aligned Large Language Models for Medicine	*Medical Safety&Alignment&Ethical Principles
24.03	Rensselaer Polytechnic Institute, University of Michigan, IBM Research, MIT-IBM Watson AI Lab	arxiv	ALIGNERS: DECOUPLING LLMS AND ALIGNMENT	Alignment&Synthetic Data
24.03	MIT, Princeton University, Stanford University, Georgetown University, AI Risk and Vulnerability Alliance, Eleuther AI, Brown University, Carnegie Mellon University, Virginia Tech, Northeastern University, UCSB, University of Pennsylvania, UIUC	arxiv	A Safe Harbor for AI Evaluation and Red Teaming	AI Evaluation&Red Teaming&Safe Harbor
24.03	University of Southern California	arxiv	Logits of API-Protected LLMs Leak Proprietary Information	API-Protected LLMs&Softmax Bottleneck&Embedding Size Detection
24.03	University of Bristol	arxiv	Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention	Safety&Prompt Engineering
24.03	XiaMen University, Yanshan University, IDEA Research, Inner Mongolia University, Microsoft, Microsoft Research Asia	arxiv	Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models	Safety&Guidelines&Alignment
24.03	Tianjin University, Tianjin University, Zhengzhou University, China Academy of Information and Communications Technology	arxiv	OpenEval: Benchmarking Chinese LLMs across Capability, Alignment, and Safety	Chinese LLMs&Benchmarking&Safety
24.03	Center for Cybersecurity Systems and Networks, AIShield Bosch Global Software Technologies Bengaluru India	arxiv	Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal	LLM Security&Threat modeling&Risk Assessment
24.03	Queen’s University Belfast	arxiv	AI Safety: Necessary but insufficient and possibly problematic	AI Safety&Transparency&Structural Harm
24.04	Provable Responsible AI and Data Analytics (PRADA) Lab, King Abdullah University of Science and Technology	arxiv	Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs	Dialectical Alignment&3H Principle&Security Threats
24.04	LibrAI, Tsinghua University, Harbin Institute of Technology, Monash University, The University of Melbourne, MBZUAI	arxiv	Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models	Red Teaming&Safety
24.04	University of California, Santa Barbara, Meta AI	arxiv	Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models	Safety&Helpfulness&Controllability
24.04	School of Information and Software Engineering, University of Electronic Science and Technology of China	arxiv	Exploring Backdoor Vulnerabilities of Chat Models	Backdoor Attacks&Chat Models&Security
24.04	Enkrypt AI	arxiv	INCREASED LLM VULNERABILITIES FROM FINE-TUNING AND QUANTIZATION	Fine-tuning&Quantization&LLM Vulnerabilities
24.04	TongJi University, Tsinghua University&, eijing University of Technology, Nanyang Technological University, Peng Cheng Laboratory	arxiv	Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security	Multimodal Large Language Models&Security Vulnerabilities&Image Inputs
24.04	University of Washington, Carnegie Mellon University, University of British Columbia, Vector Institute for AI	arxiv	CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge	AI-Assisted Red-Teaming&Multicultural Knowledge
24.04	Nanjing University	DLSP 2024	Subtoxic Questions: Dive Into Attitude Change of LLM’s Response in Jailbreak Attempts	Jailbreak&Subtoxic Questions&GAC Model
24.04	Innodata	arxiv	Benchmarking Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Propensity for Hallucinations	Evaluation&Safety
24.04	University of Cambridge, New York University, ETH Zurich	arxiv	Foundational Challenges in Assuring Alignment and Safety of Large Language Models	Alignment&Safety
24.04	Zhejiang University	arxiv	TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment	Intellectual Property Protection&Edge-deployed Transformer Model
24.04	Harvard University	arxiv	More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness	Reinforcement Learning from Human Feedback&Trustworthiness
24.05	University of Maryland	arxiv	Constrained Decoding for Secure Code Generation	Code Generation&Code LLM&Secure Code Generation&AI Safety
24.05	Huazhong University of Science and Technology	arxiv	Large Language Models for Cyber Security: A Systematic Literature Review	Cybersecurity&Systematic Review
24.04	CSIRO’s Data61	ACM International Conference on AI-powered Software	An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping	AI Safety&Evaluation Framework&AI Lifecycle Mapping
24.05	CSAIL and CBMM, MIT	arxiv	SecureLLM: Using Compositionality to Build Provably Secure Language Models for Private, Sensitive, and Secret Data	SecureLLM&Compositionality
24.05	Carnegie Mellon University	arxiv	Human–AI Safety: A Descendant of Generative AI and Control Systems Safety	Human–AI Safety&Generative AI
24.05	University of York	arxiv	Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding	Safe Reinforcement Learning&Black-Box Environments&Adaptive Shielding
24.05	Princeton University	arxiv	AI Risk Management Should Incorporate Both Safety and Security	AI Safety&AI Security&Risk Management
24.05	University of Oslo	arxiv	AI Safety: A Climb to Armageddon?	AI Safety&Existential Risk&AI Governance
24.06	Zscaler, Inc.	arxiv	Exploring Vulnerabilities and Protections in Large Language Models: A Survey	Prompt Hacking&Adversarial Attacks&Suvery
24.06	Texas A & M University - San Antonio	arxiv	Transforming Computer Security and Public Trust Through the Exploration of Fine-Tuning Large Language Models	Fine-Tuning&Cyber Security
24.06	Alibaba Group	arxiv	How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States	LLM Safety&Alignment&Jailbreak
24.06	UC Davis	arxiv	Security of AI Agents	Security&AI Agents&Vulnerabilities
24.06	University of Connecticut	USENIX Security ‘24	An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection	Backdoor Attack&Code Completion Models&Vulnerability Detection
24.06	University of California, Irvine	arxiv	TorchOpera: A Compound AI System for LLM Safety	TorchOpera&LLM Safety&Compound AI System
24.06	NVIDIA Corporation	arxiv	garak: A Framework for Security Probing Large Language Models	garak&Security Probing
24.06	Carnegie Mellon University	arxiv	Current State of LLM Risks and AI Guardrails	LLM Risks&AI Guardrails
24.06	Johns Hopkins University	arxiv	Every Language Counts: Learn and Unlearn in Multilingual LLMs	Multilingual LLMs&Fake Information&Unlearning
24.06	Tsinghua University	arxiv	Finding Safety Neurons in Large Language Models	Safety Neurons&Mechanistic Interpretability&AI Safety
24.06	Center for AI Safety and Governance, Institute for AI, Peking University	arxiv	SAFESORA: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset	Safety Alignment&Text2Video Generation
24.06	Samsung R&D Institute UK, KAUST, University of Oxford	arxiv	Model Merging and Safety Alignment: One Bad Model Spoils the Bunch	Model Merging&Safety Alignment
24.06	Hofstra University	arxiv	Analyzing Multi-Head Attention on Trojan BERT Models	Trojan Attack&BERT Models&Multi-Head Attention
24.06	Fudan University	arxiv	SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance	Safety Alignment&Jailbreak Attacks&Response Disparity
24.06	Stony Brook University	NAACL 2024 Workshop	Automated Adversarial Discovery for Safety Classifiers	Safety Classifiers&Adversarial Attacks&Toxicity
24.07	University of Utah	arxiv	Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression	Model Compression&Safety Evaluation
24.07	University of Alberta	arxiv	Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture	Multilingual Blending&LLM Safety Alignment&Language Mixture
24.07	Singapore National Eye Centre	arxiv	A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models – Safety, Consensus, Objectivity, Reproducibility and Explainability	Evaluation Framework
24.07	Microsoft	arxiv	SLIP: Securing LLM’s IP Using Weights Decomposition	Hybrid Inference&Model Security&Weights Decomposition
24.07	Microsoft	arxiv	Phi-3 Safety Post-Training: Aligning Language Models with a “Break-Fix” Cycle	Phi-3&Safety Post-Training
24.07	Tsinghua University	arxiv	Course-Correction: Safety Alignment Using Synthetic Preferences	Course-Correction&Safety Alignment&Synthetic Preferences
24.07	Northwestern University	arxiv	From Sands to Mansions: Enabling Automatic Full-Life-Cycle Cyberattack Construction with LLM	Cyberattack Construction&Full-Life-Cycle
24.07	Singapore University of Technology and Design	arxiv	AI Safety in Generative AI Large Language Models: A Survey	Generative AI&AI Safety
24.07	Lehigh University	arxiv	Blockchain for Large Language Model Security and Safety: A Holistic Survey	Blockchain&Security&Safety
24.08	OpenAI	openai	Rule-Based Rewards for Language Model Safety	Reinforcement Learning&Safety&Rule-Based Rewards
24.08	University of Texas at Austin	arxiv	HIDE AND SEEK: Fingerprinting Large Language Models with Evolutionary Learning	Model Fingerprinting&In-context Learning
24.08	Technical University of Munich	arxiv	Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study	Secure Code Assessment&Vulnerability Detection
24.08	Offenburg University of Applied Sciences	arxiv	"You still have to study" - On the Security of LLM generated code	Code Security&Prompting Techniques
24.08	University of Connecticut	arxiv	Clip2Safety: A Vision Language Model for Interpretable and Fine-Grained Detection of Safety Compliance in Diverse Workplaces	Vision Language Model&Safety Compliance&Personal Protective Equipment Detection
24.08	Pabna University of Science and Technology	arxiv	Risks, Causes, and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey	Privacy&Bias&Interpretability
24.08	Quinnipiac University	arxiv	Is Generative AI the Next Tactical Cyber Weapon For Threat Actors? Unforeseen Implications of AI Generated Cyber Attacks	Generative AI&Cybersecurity&Cyber Attacks
24.08	Nanyang Technological University	arxiv	Trustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations	AI Safety&Trustworthy&Responsible
24.08	King Abdullah University of Science and Technology	arxiv	Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models	Safety&Helpfulness&LLM Alignment
24.08	University of Calgary	arxiv	Trustworthy and Responsible AI for Human-Centric Autonomous Decision-Making Systems	Trustworthy AI&Algorithmic Bias&Responsible AI
24.08	University of Oxford	arxiv	AI Security Audits: Challenges and Innovations in Assessing Large Language Models	AI Security Audits&Vulnerability Assessment&AI Ethics
24.08	University of Science and Technology of China	arxiv	Safety Layers of Aligned Large Language Models: The Key to LLM Security	Aligned LLM&Safety Layers&Security Degradation
24.09	University of Texas at San Antonio	arxiv	Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs	Source Code Security&LLMs&Reinforcement Learning
24.09	The Hong Kong Polytechnic University	arxiv	Alignment-Aware Model Extraction Attacks on Large Language Models	Model Extraction Attacks&LLM Alignment&Watermark Resistance
24.09	University of Oxford, Redwood Research	arxiv	Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols	AI Control&Safety Protocols&Game Theory
24.09	University of Galway	ECAI AIEB Workshop	Ethical AI Governance: Methods for Evaluating Trustworthy AI	Trustworthy AI&Ethics&AI Evaluation
24.09	University of Texas at San Antonio	arxiv	AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing	Multi-Agent Systems&Code Security&Fuzz Testing&Static Analysis
24.09	Tsinghua University	arxiv	Language Models Learn to Mislead Humans via RLHF	Reinforcement Learning from Human Feedback (RLHF)&U-SOPHISTRY&Misleading AI
24.09	Stevens Institute of Technology	arxiv	Measuring Copyright Risks of Large Language Model via Partial Information Probing	Copyright&Partial Information Probing
24.09	IBM Research	arxiv	Attack Atlas: A Practitioner’s Perspective on Challenges and Pitfalls in Red Teaming GenAI	Red Teaming&LLM Security&Adversarial Attacks
24.09	Pengcheng Laboratory	arxiv	Multi-Designated Detector Watermarking for Language Models	Watermarking&Claimability&Multi-designated Verifier Signature
24.09	ETH Zurich	arxiv	An Adversarial Perspective on Machine Unlearning for AI Safety	Machine Unlearning&Adversarial Attacks&Unlearning Robustness
24.10	Google DeepMind	arxiv	A Watermark for Black-Box Language Models	Watermarking&Black-Box Models&LLM Detection
24.10	Mohamed Bin Zayed University of Artificial Intelligence	arxiv	Optimizing Adaptive Attacks Against Content Watermarks for Language Models	Watermarking&Adaptive Attacks&LLM Security
24.10	Rice University, Rutgers University	arxiv	Taylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion	Taylor Expansion&Model Security
24.10	PeopleTec	arxiv	Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders	Cybersecurity&Hallucinations
24.10	Fondazione Bruno Kessler, Université Côte d’Azur	EMNLP 2024	Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering	Counterspeech&Safety Guardrails
24.10	University of California, Davis, AWS AI Labs	arxiv	Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models	Safety alignment&Vision-Language models&Cross-modality representation manipulation
24.10	North Carolina State University	arxiv	Superficial Safety Alignment Hypothesis: The Need for Efficient and Robust Safety Mechanisms in LLMs	Superficial safety alignment&Safety mechanisms&Safety-critical components
24.10	Shanghai Jiao Tong University, Chinese University of Hong Kong (Shenzhen), Tsinghua University	arxiv	ARCHILLES’ HEEL IN SEMI-OPEN LLMS: HIDING BOTTOM AGAINST RECOVERY ATTACKS	Semi-open LLMs&Recovery attacks&Model resilience
24.10	University of Tulsa	arxiv	Weak-to-Strong Generalization beyond Accuracy: A Pilot Study in Safety, Toxicity, and Legal Reasoning	Weak-to-Strong Generalization&Safety&Toxicity
24.10	Aalborg University	arxiv	Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis	Language confusion&Multilingual LLMs&Security vulnerabilities
24.10	Carnegie Mellon University	arxiv	Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents	LLM Safety&Browser Agents&Red Teaming
24.10	Palisade Research	arxiv	LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild	LLM Agents&Honeypots&Cybersecurity
24.10	University of Pittsburgh	arxiv	Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents	Embodied Agents&Multimodal Safety&Active Learning
24.10	CSIRO’s Data61	arxiv	From Solitary Directives to Interactive Encouragement! LLM Secure Code Generation by Natural Language Prompting	Secure Code Generation&Encouragement Prompting
24.10	AppCubic	arxiv	Jailbreaking and Mitigation of Vulnerabilities in Large Language Models	Prompt Injection&Jailbreaking&AI Security
24.10	UC Berkeley	arxiv	SAFETYANALYST: Interpretable, Transparent, and Steerable LLM Safety Moderation	LLM Safety&Interpretability&Content Moderation
24.10	ShanghaiTech University	arxiv	Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization	Safety Alignment&Reinforcement Learning&Policy Optimization
24.11	Zhejiang University	arxiv	Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control	Trustworthiness&Sparse Activation Control&Representation Control
24.11	University of California, Riverside	arxiv	Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models	Vision-Language Models&Safety Alignment&Cross-Layer Vulnerability
24.11	National University of Singapore	EMNLP 2024	Multi-expert Prompting Improves Reliability, Safety and Usefulness of Large Language Models	Multi-expert Prompting&LLM Safety&Reliability&Usefulness
24.11	OpenAI	NeurIPS 2024	Rule Based Rewards for Language Model Safety	Rule Based Rewards&Safety Alignment&AI Feedback
24.11	Center for Automation and Robotics, Spanish National Research Council	arXiv	Can Adversarial Attacks by Large Language Models Be Attributed?	Adversarial Attribution&LLM Security&Formal Language Theory
24.11	McGill University	arXiv	Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset	Helpful and Harmless Dataset&Safety Trade-offs&Bias Analysis
24.11	Fudan University	arxiv	Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding	Text-to-Image Generation&Safety&Prompt Embedding Sanitization
24.11	Meta	arxiv	Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations	Multimodal LLM&Content Moderation&Adversarial Robustness
24.11	Columbia University	arxiv	When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations	Backdoor Attacks&Explainability
24.11	Ben-Gurion University of the Negev	arxiv	The Information Security Awareness of Large Language Models	Information Security Awareness&Benchmarking
24.11	Fordham University	arxiv	Next-Generation Phishing: How LLM Agents Empower Cyber Attackers	Phishing Detection&Cybersecurity
24.12	UC Berkeley	arxiv	Trust & Safety of LLMs and LLMs in Trust & Safety	Trust and Safety&Prompt Injection
24.12	Harvard Kennedy School， Avant Research Group	arxiv	Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects	Phishing Attacks&Human-in-the-loop
24.12	University of Massachusetts	arxiv	Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness	Instruction Tuning&Safety&Helpfulness
24.11	University of Pennsylvania, IBM T.J. Watson Research Center	arxiv	Cyber-Attack Technique Classification Using Two-Stage Trained Large Language Models	Cyber-Attack Classification&Two-Stage Training
24.12	University of New South Wales	arxiv	How Can LLMs and Knowledge Graphs Contribute to Robot Safety? A Few-Shot Learning Approach	Robot Safety&Few-Shot Learning&Knowledge Graph Prompting
24.12	Örebro University	arxiv	Large Language Models and Code Security: A Systematic Literature Review	LLM-Generated Code&Vulnerability Detection&Data Poisoning Attacks
24.12	Algiers Research Institute	arxiv	On the Validity of Traditional Vulnerability Scoring Systems for Adversarial Attacks against LLMs	Adversarial Attacks&Vulnerability Metrics&Risk Assessment
24.12	Alan Turing Institute	arxiv	SoK: Mind the Gap—On Closing the Applicability Gap in Automated Vulnerability Detection	Automated Vulnerability Detection&Applicability Gap&Software Security
25.01	Meta	arxiv	MLLM-as-a-Judge for Image Safety without Human Labeling	Image Safety&Zero-Shot Judgment&Multimodal Large Language Models
25.01	FAU Erlangen-Nürnberg	arxiv	Refusal Behavior in Large Language Models: A Nonlinear Perspective	Refusal Behavior&Mechanistic Interpretability&AI Alignment
25.01	University of Waterloo	arxiv	Advanced Real-Time Fraud Detection Using RAG-Based LLMs	Fraud Detection&Retrieval-Augmented Generation&Real-Time AI Security
25.01	Mondragon University, University of Seville	arxiv	Early External Safety Testing of OpenAI’s O3-Mini: Insights from Pre-Deployment Evaluation	LLM Safety Testing&OpenAI O3-Mini
25.02	Nanyang Technological University	arxiv	Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning	LLM Alignment&Instruction Tuning&Reward Models
25.02	University of Bristol	arxiv	The Dark Deep Side of DeepSeek: Fine-Tuning Attacks Against the Safety Alignment of CoT-Enabled Models	Chain of Thought&Fine-Tuning Attack&LLM Safety
25.02	Marburg University	arxiv	Editing Large Language Models Poses Serious Safety Risks	Knowledge Editing&LLM Security Risks&Adversarial Manipulation
25.02	Technical University of Munich	AAAI 2025	Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment	Medical Multimodal Models&Model Stealing&Adversarial Domain Alignment
25.02	Georgia Institute of Technology	arxiv	Enhancing Phishing Email Identification with Large Language Models	Phishing Detection&Cybersecurity
25.02	Fudan University	arxiv	Safety at Scale: A Comprehensive Survey of Large Model Safety	Large Model Safety&AI Security&Adversarial Attacks
25.02	University of Maryland	arxiv	Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities	Model Tampering Attacks&LLM Security&Adversarial Robustness
25.02	Penn State University	arxiv	Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet	Harmfulness Ranking&LLM Evaluation&AI Safety
25.02	Peking University	arxiv	Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning	LLM Safety&Reasoning Trade-off&Fine-Tuning
25.02	City University of Hong Kong	arxiv	The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis	LLM Alignment&Safety Fine-Tuning&Jailbreak Attacks
25.02	Tsinghua University	arxiv	“Nuclear Deployed!”: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents	Autonomous LLM Agents&Catastrophic Risks&Decision-making
25.02	University of Washington	arxiv	SAFECHAIN: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities	LLM Safety&Chain-of-Thought Reasoning&Model Alignment
25.02	University of California, Santa Cruz	arxiv	The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1	Large Reasoning Models&Safety Assessment&Adversarial Attacks
25.02	34 Affiliates	arxiv	On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective	Safety Assessment&Guideline Paper
25.02	University of Cambridge	arxiv	Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection	Hateful Meme Detection&Multimodal Models&Contrastive Learning
25.02	Cooperative AI Foundation	arXiv	Multi-Agent Risks from Advanced AI	Multi-Agent Systems&AI Risk&AI Governance
25.02	Apart Research, University of Science and Technology of Hanoi	AAAI 2025 Workshop on Theory of Mind	A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks	Theory of Mind&AI Safety
25.02	Truthful AI, University College London	arxiv	Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs	LLM Alignment&Fine-tuning Risks&Emergent Misalignment
25.02	Wuhan University	arxiv	A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations	LVLM Safety&Adversarial Attacks&Defense Mechanisms
25.02	Clark Atlanta University	arxiv	SoK: Exploring Hallucinations and Security Risks in AI-Assisted Software Development with Insights for LLM Deployment	Hallucinations&Security Risks&AI-Assisted Software Development
25.02	Stony Brook Universit, Michigan State University	arXiv	Cyber Defense Reinvented: Large Language Models as Threat Intelligence Copilots	Cyber Threat Intelligence&Large Language Models&Threat Detection
25.03	HydroX AI	arXiv	Output Length Effect on DeepSeek-R1’s Safety in Forced Thinking	Output Length&LLM Safety&Forced Thinking
25.03	Tampere University	arxiv	Mapping Trustworthiness in Large Language Models: A Bibliometric Analysis Bridging Theory to Practice	Trustworthiness&AI Ethics
25.03	University of California, Santa Barbara	arxiv	Graphormer-Guided Task Planning: Beyond Static Rules with LLM Safety Perception	LLM Planning&Graphormer&Risk-Aware Robotics
25.03	University of Pennsylvania	arxiv	Safety Guardrails for LLM-Enabled Robots	LLM-enabled Robotics&Jailbreaking Defense&Formal Safety Guarantees
25.03	Peking University	arxiv	LIFE-CYCLE ROUTING VULNERABILITIES OF LLM ROUTER	LLM Router&Adversarial Attack&Backdoor Attack
25.03	Squirrel AI Learning	arxiv	A Survey on Trustworthy LLM Agents: Threats and Countermeasures	Trustworthy Agent&LLM-based Agents&Multi-Agent System
25.03	Cornell Tech	arxiv	Multi-Agent Systems Execute Arbitrary Malicious Code	Multi-Agent Systems&Control-Flow Hijacking&Arbitrary Code Execution
25.03	University of Utah	arxiv	A Comprehensive Study of LLM Secure Code Generation	Secure Code Generation&Vulnerability Scanning&Functionality Evaluation
25.03	University of Minnesota	arxiv	Safety Aware Task Planning via Large Language Models in Robotics	LLM Robotics Planning&Safety-Aware Framework&Control Barrier Functions
25.03	Peking University, Zhongguancun Lab, Tsinghua University	arxiv	Large Language Models powered Network Attack Detection: Architecture, Opportunities and Case Study	Network Security&LLM for Security&Anomaly Detection
25.03	Aim Intelligence, Yonsei University, Seoul National University	arxiv	sudo rm -rf agentic_security	Agent Security&Multimodal Jailbreak&LLM Agent Exploitation
25.03	Georgia Institute of Technology, IMT Mines Albi	arxiv	Leveraging Large Language Models for Risk Assessment in Hyperconnected Logistic Hub Network Deployment	Risk Assessment&LLMs for Logistics&Supply Chain Resilience
25.04	University of Twente	arxiv	Safety and Security Risk Mitigation in Satellite Missions via Attack-Fault-Defense Trees	Cyber-Physical Systems&Attack-Fault-Defense Trees&Satellite Ground Segment
25.04	Google DeepMind	arxiv	An Approach to Technical AGI Safety and Security	AGI Safety&Misalignment Mitigation&Capability Control
25.04	Earlham College	ISDFS 2025	Debate-Driven Multi-Agent LLMs for Phishing Email Detection	Phishing Detection&Multi-Agent LLMs&Debate Framework
25.04	Indian Institute of Technology Kanpur	MSR 2025	MaLAware: Automating the Comprehension of Malicious Software Behaviours using Large Language Models (LLMs)	Malware Analysis&Behavior Explanation&LLMs for Cybersecurity
25.04	Leidos	arxiv	MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits	Model Context Protocol&Security Audit&Agentic LLM Exploits
25.04	Norwegian University of Science and Technology	arxiv	An LLM Framework For Cryptography Over Chat Channels	LLMs&Cryptography&Steganography
25.04	Peking University	arxiv	SaRO: Enhancing LLM Safety through Reasoning-based Alignment	Safety Alignment&Reasoning-based Alignment&LLMs
25.04	Johns Hopkins University	arxiv	An Investigation of Large Language Models and Their Vulnerabilities in Spam Detection	Spam Detection&Adversarial Attack&Data Poisoning
25.04	TU Wien	arxiv	Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design	Offensive Security&Benchmarking&LLM Penetration Testing
25.04	Fraunhofer Institute for Cognitive Systems IKS	arxiv	Towards Automated Safety Requirements Derivation Using Agent-based RAG	Agent-based RAG&Safety Requirements Derivation&Autonomous Driving
25.04	Nanjing University	arxiv	Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask	LLM-based Vulnerability Detection&Contextual Reasoning&Benchmark Evaluation
25.04	Nanyang Technological University	arxiv	A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment	LLM Safety&LLM Lifecycle&Agent Alignment
25.04	Arab American University	arxiv	Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends	Traffic Safety&Multimodal Large Language Models&ADAS
25.04	National University of Singapore	arxiv	Safety in Large Reasoning Models: A Survey	Large Reasoning Models&Safety Taxonomy&Adversarial Attacks
25.04	Amazon Web Services	arxiv	Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents	Agentic AI Security&Threat Modeling&Mitigation Framework
25.04	Alibaba Group	NAACL 2025	DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models	Multimodal LLM&Safety Alignment&Risk Disentanglement
25.04	University of Maryland	NAACL 2025	RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models	RAG&Safety Alignment&Red Teaming
25.05	University of Granada	arxiv	LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures	Large Language Models&Security&Defense Mechanisms
25.05	University of North Carolina at Chapel Hill	Transactions on Machine Learning Research	Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation	Multimodal LLMs&Information Unlearning&Security Evaluation
25.05	University of Oxford	arxiv	Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents	Multi-Agent Systems&AI Security&Emergent Threats
25.05	Huazhong University of Science and Technology	arxiv	Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study	LLM Deployment&Security Analysis&Empirical Study
25.05	Rutgers University	arxiv	Aligning Large Language Models with Healthcare Stakeholders: A Pathway to Trustworthy AI Integration	Healthcare&Alignment&Large Language Models
25.05	Metropolia University of Applied Sciences	arxiv	A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient	LLM Safety&Ethical Evaluation&Danger Coefficient
25.05	University of Kent	arxiv	Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data	Safety Alignment&Pseudo-Malicious Data&Cybersecurity LLMs
25.05	Carnegie Mellon University	arxiv	A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?	Computer-Using Agents&Security Threats&Safety Benchmarks
25.05	NYU Tandon	arxiv	MARVEL: Multi-Agent RTL Vulnerability Extraction using Large Language Models	RTL Security&Multi-Agent Systems&LLM for Hardware Verification
25.05	Jerusalem College of Technology	arxiv	Proposal for Improving Google A2A Protocol: Safeguarding Sensitive Data in Multi-Agent Systems	A2A Protocol&Sensitive Data Protection&Multi-Agent Security
25.05	Huazhong University of Science and Technology	arxiv	From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents	Mobile LLM Agents&Security Risks&AgentScan
25.05	Mohamed bin Zayed University of Artificial Intelligence	arxiv	Safety Subspaces are Not Distinct: A Fine-Tuning Case Study	Safety Alignment&Subspace Geometry&Fine-Tuning Vulnerability
25.05	Amazon Web Services	arxiv	From nuclear safety to LLM security: Applying non-probabilistic risk management strategies to build safe and secure LLM-powered systems	Risk Management&LLM Security&Non-Probabilistic Strategies
25.05	Infinite Optimization AI Lab	arxiv	Security Concerns for Large Language Models: A Survey	LLM Security&Prompt Injection&Autonomous Agents
25.05	Nanyang Technological University	arxiv	Understanding Refusal in Language Models with Sparse Autoencoders	Refusal&Sparse Autoencoder&LLM Safety
25.05	Seoul National University	arxiv	Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems	Multi-agent LLM&Cross-domain Security&Threat Modeling
25.05	University of Washington	arxiv	OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities	AI Safety&Multimodal Moderation&Universal Representation
25.06	Tsinghua University	arxiv	The Security Threat of Compressed Projectors in Large Vision-Language Models	Vision-Language Model&Compressed Projector&Adversarial Attack
25.06	Michigan State University	arxiv	Comprehensive Vulnerability Analysis is Necessary for Trustworthy LLM-MAS	LLM-MAS&Vulnerability Analysis&Trustworthy AI
25.06	Singapore Management University	arxiv	Which Factors Make Code LLMs More Vulnerable to Backdoor Attacks? A Systematic Study	Code LLM&Backdoor Attack&Adversarial Robustness
25.06	University of Science and Technology of China	arxiv	SECNEURON: Reliable and Flexible Abuse Control in Local LLMs via Hybrid Neuron Encryption	Local LLM&Abuse Control&Neuron Encryption
25.06	Dartmouth College	arxiv	Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets	LLM Safety&Alignment Robustness&Representation Similarity
25.06	Georgia Tech	arxiv	Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety	Interpretation&LLM Safety&Survey
25.06	George Mason University	ICML 2025	StealthInk: A Multi-bit and Stealthy Watermark for Large Language Models	Watermark&LLM&Stealthy&Multi-bit
25.06	SUNY-Albany, New Jersey Institute of Technology, Microsoft, Kent State University, University of Florida	arxiv	SoK: Are Watermarks in LLMs Ready for Deployment?	Watermark&LLM&Model Stealing&IP Protection
25.06	Tsinghua University, Apple, Beijing University of Posts and Telecommunications	arxiv	Enhancing Watermarking Quality for LLMs via Contextual Generation States Awareness	Watermarking&LLM&Generation Quality&Context Awareness
25.06	Shanghaitech University, Sun Yat-sen University	arxiv	Safeguarding Multimodal Knowledge Copyright in the RAG-as-a-Service Environment	Multimodal RAG&Copyright Protection&Watermarking&Image Knowledge&Retrieval-Augmented Generation
25.06	University of Applied Sciences Northwestern Switzerland	arxiv	Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications	Multi-Agent System&LLM&Reasoning&Cybersecurity&Specification
25.06	Sungkyunkwan University, Microsoft Research Asia	ACL 2025	Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights	Value Alignment&LLM Safety&Personalization&Harmful Behavior&Psychological Analysis
25.06	Virelya Intelligence Research Labs	arxiv	Risks & Benefits of LLMs & GenAI for Platform Integrity, Healthcare Diagnostics, Cybersecurity, Privacy & AI Safety: A Comprehensive Survey, Roadmap & Implementation Blueprint for Automated Review, Compliance Assurance, Moderation, Abuse & Fraud Detection, App Security, and Trust in Digital Ecosystems	Large Language Models&Generative AI&Platform Integrity&Cybersecurity&Compliance
25.06	University of Pennsylvania	arxiv	A Fast, Reliable, and Secure Programming Language for LLM Agents with Code Actions	Programming Language&LLM Agents&Code Actions&Security&Parallelization
25.06	Universitas Muhammadiyah Surakarta	arxiv	Using LLMs for Security Advisory Investigations: How Far Are We?	Security Advisory&CVE ID&LLMs&Hallucination&Reliability
25.06	University of Texas at El Paso	arxiv	Evaluating Large Language Models for Phishing Detection, Self-Consistency, Faithfulness, and Explainability	Phishing Detection&Large Language Models&Explainability&Self-Consistency&Fine-Tuning
25.06	Universiti Sains Malaysia	arxiv	PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection	Phishing Website Detection&Large Language Model&Multi-Agent System&Debate Framework&Explainability
25.06	NTT	arxiv	Towards Safety Evaluations of Theory of Mind in Large Language Models	Theory of Mind&LLM Safety&Evaluation
25.06	The Ohio State University	arxiv	AI Safety vs. AI Security: Demystifying the Distinction and Boundaries	AI Safety&AI Security&Risk Management
25.06	Zhejiang University	arxiv	A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures	LLM-Driven Agents&Agent Communication&Security Risks
25.07	University of Science and Technology of China, Douyin Co., Ltd.	arxiv	SAFER: Probing Safety in Reward Models with Sparse Autoencoder	Reward Model&Interpretability&Safety
25.07	Princeton University	arxiv	Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs	Alignment Erosion&Fine-Tuning&Safety
25.07	AI Risk and Vulnerability Alliance	arxiv	Red Teaming AI Red Teaming	Red Teaming&AI Security&Sociotechnical
25.07	Shandong University	arxiv	We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems	MCP Security&API Measurement&Privilege
25.07	University College London	arxiv	Emergent Misalignment as Prompt Sensitivity: A Research Note	Misalignment&Prompt Sensitivity&Finetuning
25.07	Ludwig-Maximilians-Universität in Munich	arxiv	On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment	Alignment&Filtering&Intractability
25.07	University of Wisconsin-Madison	arxiv	Prompt-level Watermarking is Provably Impossible	Watermarking&Prompt Injection&Impossibility
25.07	UK AI Security Institute	arxiv	Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety	Chain of Thought&Monitorability&Safety
25.07	Northeastern University	arxiv	LLMs Encode Harmfulness and Refusal Separately	Harmfulness&Refusal&Safety
25.07	Aymara	arxiv	Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix	Safety Evaluation&LLM&Benchmark
25.07	Shanghai Artificial Intelligence Laboratory	arxiv	Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report	AI Risk&Safety&Benchmark
25.07	Shanghai Artificial Intelligence Laboratory	arxiv	SafeWork-R1: Coevolving Safety and Intelligence under the AI-45° Law	Safety&Reinforcement Learning&Multimodal
25.07	University of Illinois Urbana-Champaign	arxiv	PurpCode: Reasoning for Safer Code Generation	SecureCode&Reasoning&Alignment
25.07	IBM Research	arxiv	OneShield - the Next Generation of LLM Guardrails	Guardrails&Safety&Compliance
25.08	University of Maryland, College Park	arxiv	Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation	Predictive Auditing&LLM APIs&Reasoning Token Count&Token Inflation
25.08	Independent Researcher, Arizona State University, University of California, Berkeley	arxiv	Measuring Harmfulness of Computer-Using Agents	Computer-Using Agents&Safety Risks&CUAHarm Benchmark&Language Models
25.08	Jimei University, Wenzhou-Kean University, The Hong Kong University of Science and Technology (Guangzhou), New York University, Xiamen University	arxiv	A Survey on Data Security in Large Language Models	Large language model (LLM)&Data security&LLM vulnerabilities&Prompt injection
25.08	University of South Florida	arxiv	Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM	Harmful Content&LLM Safety&Jailbreak Mitigation
25.08	Monash University	ACM CCS 2025	Robust Anomaly Detection in O-RAN: Leveraging LLMs against Data Manipulation Attacks	O-RAN Security&Anomaly Detection&Data Manipulation Attacks
25.08	Zhejiang University	arxiv	Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends	Copyright Protection&Model Fingerprinting&Text Watermarking
25.08	Global Center on AI Governance	arxiv	Toward an African Agenda for AI Safety	AI Safety in Africa&Governance&Socio-Technical Risks
25.08	PeopleTec, Inc.	arxiv	SERVANT, STALKER, PREDATOR: How an Honest, Helpful, and Harmless (3H) Agent Unlocks Adversarial Skills	Multi-Agent Systems&Service Orchestration&Composite Threats
25.08	Nanyang Technological University	EMNLP 2025 Findings	Improving Alignment in LVLMs with Debiased Self-Judgment	LVLM Alignment&Debiased Self-Judgment&Hallucination Mitigation
25.09	Zhejiang University	arxiv	Web Fraud Attacks Against LLM-Driven Multi-Agent Systems	Multi-Agent Systems&Web Fraud Attack&Security
25.09	Alibaba AAIG	arxiv	Oyster-I: Beyond Refusal — Constructive Safety Alignment for Responsible Language Models	Constructive Safety Alignment&Safety Benchmark&Game-Theoretic Modeling
25.09	Kennesaw State University	IEEE Internet of Things Journal	A Survey: Towards Privacy and Security in Mobile Large Language Models	Mobile LLMs&Privacy&Security
25.09	University of Wisconsin-Madison	arxiv	Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLMs	Prompt Injection&Threat Model&LLM Security
25.09	Tennessee Tech University, University of Nebraska at Omaha	arxiv	Safety and Security Analysis of Large Language Models: Risk Profile and Harm Potential	Safety and Security&Risk Profiling&Adversarial Prompts
25.09	Instituto de Pesquisas Eldorado, SRI International	arxiv	LLM in the Middle: A Systematic Review of Threats and Mitigations to Real-World LLM-based Systems	LLM Security&Threat Modeling&Systematic Review
25.09	Alibaba Group, Zhejiang University	arxiv	Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework	Sparse Autoencoder&Safety Interpretation&LLM Interpretability
25.09	Argonne National Laboratory	arxiv	Evaluating the Safety and Skill Reasoning of Large Reasoning Models Under Compute Constraints	Reasoning Models&Safety Evaluation&Compute Constraints
25.09	Chinese Academy of Sciences, Wuhan University, Renmin University of China, Macquarie University, Griffith University, Xiaomi Inc.	arxiv	LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions	LLM-based Agents&Hallucinations&Trustworthiness
25.09	Binghamton University, Duke University, University of Alabama at Birmingham	arxiv	Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting	LLM Benchmark&Cybersecurity&Blue Teaming
25.09	Universitat Pompeu Fabra	INLG 2025 (accepted), arxiv	Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs	Lexical Simplification&Small LLMs&Safety&Knowledge Distillation
25.10	City University of Hong Kong, Johns Hopkins University, George Mason University	arxiv	Towards Human-Centered RegTech: Unpacking Professionals' Strategies and Needs for Using LLMs Safely	RegTech&Human-Centered NLP&Compliance Risk&LLM Safety
25.10	University of Mannheim	arxiv	A Granular Study of Safety Pretraining under Model Abliteration	Safety Pretraining&Model Abliteration&Refusal Robustness
25.10	University of California, Riverside	arxiv	Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?	Jailbreak Detection&Internal Representations&Tensor Decomposition
25.10	OpenAI & Anthropic & Google DeepMind & ETH Zürich & Northeastern University & HackAPrompt & AI Security Company	arxiv	The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections	Adaptive Attacks&LLM Jailbreak Defense&Prompt Injection Robustness
25.10	Ruhr-Universität Bochum & Universität Bonn & Lamarr Institute for Machine Learning and Artificial Intelligence	arxiv	AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?	AI Alignment&Failure Modes&Risk Analysis
25.10	Harbin Institute of Technology (Shenzhen) & Pengcheng Lab	arxiv	GRIDAI: Generating and Repairing Intrusion Detection Rules via Collaboration among Multiple LLM-based Agents	Intrusion Detection&Rule Generation&Multi-Agent LLM System
25.10	University of Massachusetts Amherst & ELLIS Institute Tübingen	arxiv	Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies	Multi-Agent Systems&Security Evaluation&Blackboard Architecture
25.10	Shanghai Jiao Tong University	NeurIPS	Stop DDoS Attacking the Research Community with AI-Generated Survey Papers	AI-Generated Surveys&Research Integrity&Scholarly Oversight
25.10	LMU Munich & TUM & Oxford & HKU	NeurIPS Workshop	Deep Research Brings Deeper Harm	Deep Research Agents&LLM Safety Alignment&Biosecurity Risks
25.10	University of Connecticut, University of Alabama at Birmingham	arxiv	SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models	Prompt Security&Jailbreak Taxonomy&Defense Evaluation
25.10	École Normale Supérieure (ENS) - Université Paris Sciences et Lettres (PSL), CNRS, Université Sorbonne Nouvelle	arxiv	On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?	AI-Generated Text Detection&Watermarking&Ethical AI Evaluation
25.11	University of Pennsylvania	arxiv	Watermarking Discrete Diffusion Language Models	Discrete Diffusion&Watermarking&Generative Model Security
25.11	CEA Paris-Saclay	arxiv	Watermarking Large Language Models in Europe: Interpreting the AI Act in Light of Technology	Watermarking&AI Act&Compliance Evaluation
25.11	Shandong University	AAAI 2026	HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection	Human Language Preference Optimization&Machine-Revised Text Detection&Adversarial Multi-Task Detection
25.11	Massachusetts Institute of Technology	arxiv	Hiding in the AI Traffic: Abusing MCP for LLM-Powered Agentic Red Teaming	Model Context Protocol&Agentic Red Teaming&Command and Control
25.11	Zhejiang University	AAAI 2026	Do Not Merge My Model! Safeguarding Open-Source LLMs Against Unauthorized Model Merging	Model Merging Stealing&LLM IP Protection&Proactive Defense
25.11	Beijing Institute of Technology	arxiv	Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion	Existential Risk&Prefix Completion&LLM Safety Evaluation
25.11	ETH Zurich, Huawei Technologies Switzerland AG	arxiv	Can LLMs Make (Personalized) Access Control Decisions?	Access Control&Personalization&LLM Security
25.11	Purdue University, Perplexity AI	arxiv	BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents	Prompt Injection&AI Browser Agents&Benchmarking
25.11	University of Tennessee, Sungkyunkwan University	arxiv	Supporting Students in Navigating LLM-Generated Insecure Code	Insecure Code Generation&Cybersecurity Education&Bifröst Framework
25.11	Vanta, MintMCP, Darktrace	arxiv	Securing the Model Context Protocol (MCP): Risks, Controls, and Governance	Model Context Protocol&AI Governance&Agent Security
25.11	Renmin University of China, Ant Group	AAAI 2026	Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems	LLM Multi-Agent Systems&Security Risks&Adversarial Defense
25.11	Independent Researcher	arxiv	Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs	Healthcare AI Safety&Adversarial Dataset&Context-Aware Alignment
25.11	NVIDIA, Lakera AI	arxiv	A Safety and Security Framework for Real-World Agentic Systems	Agentic Systems&Safety and Security Framework&AI Risk Taxonomy
25.12	Sun Yat-sen University	arxiv	An Empirical Study on the Security Vulnerabilities of GPTs	GPT Security&Prompt Injection&Tool Misuse
25.12	Hiroshima University, The University of Tokyo, National Institute of Informatics	IEEE ISPA 2025	Decentralized Multi-Agent System with Trust-Aware Communication	Decentralized Multi-Agent Systems&Blockchain Communication&Trust-Aware Protocols
25.12	China Telecom (TeleAI), Sichuan University, Peking University	arxiv	Aetheria: A Multimodal Interpretable Content Safety Framework Based on Multi-Agent Debate and Collaboration	Content Safety&Multi-Agent Systems&Interpretable AI&Multimodal Analysis
25.12	DEXAI – Icaro Lab, Sapienza University of Rome, Sant’Anna School of Advanced Studies	arxiv	Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions	Multi-Agent Safety&Systemic Risk&Institutional AI
25.12	Shandong University, Nanjing University	arxiv	“MCP Does Not Stand for Misuse Cryptography Protocol”: Uncovering Cryptographic Misuse in Model Context Protocol at Scale	Model Context Protocol (MCP)&Cryptographic Misuse Detection&Program Analysis
25.12	University of Pennsylvania, Carnegie Mellon University, Columbia University	arxiv	MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking	LLM Watermarking&Model Fine-Tuning&Open-Weight Models
25.12	University of Maryland, Oracle Labs, Oracle Health AI	ML4H 2025	Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment	Healthcare AI Assistants&Iterative Alignment&Safety vs Helpfulness
25.12	Old Dominion University	arxiv	ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications	Threat Modeling&Agentic AI&Vision-Language Models
25.12	Singapore Management University	arxiv	SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security	Causality Analysis&LLM Security&Jailbreak Detection
25.12	FAIR at Meta	arxiv	Self-Improving AI & Human Co-Improvement for Safer Co-Superintelligence	Self-Improving AI&Human-AI Collaboration&Co-Superintelligence
25.12	University of North Carolina Wilmington	arxiv	From Description to Score: Can LLMs Quantify Vulnerabilities?	Vulnerability Scoring&CVSS&Large Language Models
25.12	Beihang University	arxiv	SoK: Trust-Authorization Mismatch in LLM Agent Interactions	LLM Agents&Trust and Authorization&Agent Security
25.12	Tribhuvan University	arxiv	Systematization of Knowledge: Security and Safety in the Model Context Protocol Ecosystem	Model Context Protocol&LLM Security&Agentic AI Safety
25.12	CISPA Helmholtz Center for Information Security	NDSS 2026	Chasing Shadows: Pitfalls in LLM Security Research	LLM Security Research&Reproducibility Pitfalls&Evaluation Methodology
25.12	Cisco AI Threat and Security Research	arxiv	Cisco Integrated AI Security and Safety Framework Report	AISecurity&ThreatTaxonomy&Governance
25.12	The Beacom College of Computer & Cyber Sciences, Dakota State University	arxiv	Quantifying Return on Security Controls in LLM Systems	RiskModeling&SecurityControls&LLMSafety
25.12	National University of Defense Technology	arxiv	Exploring the Security Threats of Retriever Backdoors in Retrieval-Augmented Code Generation	Retriever Backdoors&RAG Security&Code Generation
25.12	BITS Pilani	arxiv	Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection	Multi-Vulnerability&LLM Benchmarking&Code Security
26.01	Zhejiang University	arxiv	RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation	Domain-Specific Safety&Harmful Prompt Synthesis&Knowledge Graph
26.01	Stanford University	arxiv	Toward Safe and Responsible AI Agents: A Three-Pillar Model for Transparency, Accountability, and Trustworthiness	ResponsibleAI&AgentGovernance&Transparency
26.01	Chinese Academy of Sciences	arxiv	Lightweight Yet Secure: Secure Scripting Language Generation via Lightweight LLMs	SecureScripting&PowerShell&LightweightModels
26.01	Unknown	arxiv	SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations	PromptInjection&ConstitutionalAI&Cybersecurity
26.01	Xi’an Jiaotong University	arxiv	Small Symbols, Big Risks: Exploring Emoticon Semantic Confusion in Large Language Models	EmoticonConfusion&LLMSafety&Robustness
26.01	Zhejiang University	arxiv	ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models	ModelFingerprinting&Unlearning&Copyright
26.01	Zhejiang University	arxiv	DNF: Dual-Layer Nested Fingerprinting for Large Language Model Intellectual Property Protection	ModelFingerprinting&IPProtection&Backdoor
26.01	Peking University	arxiv	ToolSafe: Enhancing Tool Invocation Safety of LLM-based Agents via Proactive Step-level Guardrail and Feedback	ToolSafety&AgentGuardrails&PromptInjection
26.01	Nanyang Technological University	arxiv	Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale	AgentSecurity&SupplyChainRisk&VulnerabilityAnalysis
26.01	Ben-Gurion University of the Negev	arxiv	AgentGuardian: Learning Access Control Policies to Govern AI Agent Behavior	AgentGovernance&AccessControl&ExecutionFlow
26.01	Fudan University	arxiv	A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5	SafetyEvaluation&MultimodalLLM&AdversarialTesting
26.01	Fujitsu Research of Europe	arxiv	AgenTRIM: Tool Risk Mitigation for Agentic AI	AgenticAI&ToolSecurity&LeastPrivilege
26.01	Nanjing University of Aeronautics and Astronautics	arxiv	SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment	LLM Safety&Jailbreak Defense&Adaptive Alignment
26.01	Unknown	arxiv	Breaking the Protocol: Security Analysis of the Model Context Protocol Specification and Prompt Injection Vulnerabilities in Tool-Integrated LLM Agents	Model Context Protocol&Prompt Injection&Agent Security
26.01	School of Automation, Northwestern Polytechnical University, Xi'an, China	arxiv	FNF: Functional Network Fingerprint for Large Language Models	Model Fingerprinting&Intellectual Property&Functional Networks
26.01	University of Science and Technology of China	arxiv	Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures	Misalignment&Persona&Safety
26.02	eBay Inc	arxiv	ZERO-TRUST RUNTIME VERIFICATION FOR AGENTIC PAYMENT PROTOCOLS: MITIGATING REPLAY AND CONTEXT-BINDING FAILURES IN AP2	Agentic Payments&Runtime Verification&Replay Attacks
26.02	Huazhong University of Science and Technology	arxiv	Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models	Vulnerability Reasoning&Benchmarking&RLVR
26.02	Technical University of Darmstadt	arxiv	GoodVibe: Security-by-Vibe for LLM-Based Code Generation	Code Security&Neuron-Level Tuning&Code Generation
26.02	Canadian Institute for Cybersecurity (CIC), University of New Brunswick, New Brunswick, Canada	arxiv	Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP	AI-Agent Protocols&Threat Modeling&MCP
26.02	DiSTA, University of Insubria, Italy	arxiv	LoRA-based Parameter-Efficient LLMs for Continuous Learning in Edge-based Malware Detection	Malware Detection&Edge Computing&Continuous Learning
26.02	Unknown	arxiv	Agentic AI for Cybersecurity: A Meta-Cognitive Architecture for Governable Autonomy	Cybersecurity&Agentic AI&Governable Autonomy
26.02	University of Luxembourg, Interdisciplinary Center for Security, Reliability, and Trust (SnT), Trustworthy Software Engineering Group (TruX), Luxembourg	arxiv	Assessing Spear-Phishing Website Generation in Large Language Model Coding Agents	Spear-Phishing&Coding Agents&Cyber Misuse
26.02	ShanghaiTech University, Shanghai, China	arxiv	A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)	Agent Safety Audit&Trajectory Analysis&OpenClaw
26.02	State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China	arxiv	Intellicise Wireless Networks Meet Agentic AI: A Security and Privacy Perspective	Agentic AI&Wireless Security&Privacy
26.02	Applied Machine Learning Research	arxiv	Intent Laundering: AI Safety Datasets Are Not What They Seem	Safety Datasets&Intent Laundering&Evaluation Robustness
26.02	Fraunhofer ISST	arxiv	DAVE: A Policy-Enforcing LLM Spokesperson for Secure Multi-Document Data Sharing	Data Sharing&Policy Enforcement&LLM Spokesperson
26.02	Department of Computer Science, National University of Singapore	arxiv	LLM-enabled Applications Require System-Level Threat Monitoring	threat monitoring&incident response&LLM systems
26.02	University of Technology Sydney	arxiv	SoK: Agentic Skills — Beyond Tool Use in LLM Agents	agentic skills&supply chain&survey
26.02	Amazon Web Services	arxiv	Manifold of Failure: Behavioral Attraction Basins in Language Models	failure manifold&MAP-Elites&alignment deviation
26.02	National University of Singapore	arxiv	IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation	auditing&verifiable computation&API integrity
26.03	Shanghai Innovation Institute	arxiv	From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions	agentic AI&web security&survey
26.03	Sahara AI	arxiv	Proof-of-Guardrail in AI Agents and What (Not) to Trust from It	agent guardrails&TEE attestation&verifiable safety
26.03	Beihang University	arxiv	Evolving Deception: When Agents Evolve, Deception Wins	deceptive agents&self-evolution&alignment drift
26.03	Communication and Distributed Systems, RWTH Aachen University, Germany	arxiv	Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers	artifact evaluation&reproducibility&cybersecurity
26.03	Shandong University, Qingdao, Shandong, China	arxiv	Give Them an Inch and They Will Take a Mile: Understanding and Measuring Caller Identity Confusion in MCP-Based AI Systems	MCP security&caller identity&authorization
26.03	Crew Scaler	arxiv	Security Considerations for Multi-agent Systems	multi-agent systems&security frameworks&threat taxonomy
26.03	Institute of Information Engineering, Chinese Academy of Sciences	arxiv	ProvAgent: Threat Detection Based on Identity-Behavior Binding and Multi-Agent Collaborative Attack Investigation	threat detection&provenance graphs&multi-agent investigation
26.03	Shandong University	arxiv	Don’t Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw	code agents&OpenClaw&human-in-the-loop defense
26.03	School of Interactive Computing, Georgia Institute of Technology	arxiv	Safe and Scalable Web Agent Learning via Recreated Websites	web agents&synthetic environments&self-evolution
26.03	Ant Group & Tsinghua University, China	arxiv	Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats	autonomous agents&OpenClaw&lifecycle security
26.03	College of Intelligent Science and Engineering, Jinan University	arxiv	Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework	agent memory&governance&semantic drift
26.03	State Key Laboratory of Complex & Critical Software Environment, Beihang University	arxiv	Uncovering Security Threats and Architecting Defenses in Autonomous Agents: A Case Study of OpenClaw	Autonomous Agents&Threat Modeling&Defense Architecture
26.03	Unknown	arxiv	Evaluation of Audio Language Models for Fairness, Safety, and Security	Audio LLMs&Safety Evaluation&Structural Taxonomy
26.03	Centre for Philosophy and AI Research, Friedrich-Alexander-University Erlangen-Nuremberg	arxiv	Questionnaire Responses Do Not Capture the Safety of AI Agents	AI Agents&Safety Assessment&Construct Validity
26.03	University of Connecticut	arxiv	Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning	Code LLMs&Data Poisoning&Vulnerability Scanning
26.03	Dartmouth College	arxiv	Retrieval-Augmented LLMs for Security Incident Analysis	Security Incident Analysis&RAG&MITRE ATTACK
26.03	Purdue University	arxiv	Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety	Agent Safety&Benchmark Auditing&Tool Calls
26.03	University of Electronic Science and Technology of China, Chengdu, China	arxiv	Functional Subspace Watermarking for Large Language Models	Model Watermarking&Functional Subspace&Ownership Verification
26.03	UC Santa Cruz	arxiv	A Framework for Formalizing LLM Agent Security	Agent security&Contextual security&Authorization
26.03	City University of Hong Kong	arxiv	PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management	Mobile power management&LLM agents&Personalization
26.03	Shanghai Jiao Tong University, Shanghai, China	arxiv	Trojan’s Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance	OpenClaw&Guidance injection&Autonomous coding agents
26.03	BigCommerce	arxiv	An Agentic Multi-Agent Architecture for Cybersecurity Risk Management⋆	Cybersecurity risk&Multi-agent systems&Risk assessment
26.03	Lulea tekniska universitet, Sweden	arxiv	Agentproof: Static Verification of Agent Workflow Graphs	Static verification&Workflow graphs&Temporal safety
26.03	Department of Computing Science, Umea University, Umea, Sweden	arxiv	Memory poisoning and secure multi-agent systems	Memory poisoning&Multi-agent systems&Cryptographic mitigation
26.03	University of Washington, USA	arxiv	AC4A: Access Control for Agents	Access control&Agent permissions&API security
26.03	Sun Yat-sen University	arxiv	SkillProbe: Security Auditing for Emerging Agent Skill Marketplaces via Multi-Agent Collaboration	Skill marketplaces&Security auditing&Multi-agent collaboration
26.03	Department of Computer Science, New York Institute of Technology, Vancouver, BC, Canada	arxiv	Auditing MCP Servers for Over-Privileged Tool Capabilities	MCP servers&Capability auditing&Tool privileges
26.03	New York Institute of Technology, Vancouver, BC, Canada	arxiv	Are AI-assisted Development Tools Immune to Prompt Injection?	Prompt injection&MCP clients&Development tools
26.03	Department of Computer Science, New York Institute of Technology, Canada	arxiv	Model Context Protocol Threat Modeling and Analyzing Vulnerabilities to Prompt Injection with Tool Poisoning	MCP security&Tool poisoning&Threat modeling
26.03	Beijing University of Posts and Telecommunications	arxiv	ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers	OpenClaw security&Watcher middleware&Runtime protection
26.03	Department of Computer Science, Institute of Artificial Intelligence, University of Central Florida	Transactions on Machine Learning Research 2026	AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective	Foundation model security&Threat taxonomy&Cross-modal defense
26.03	CSIRO Data61	arxiv	Clawed and Dangerous: Can We Trust Open Agentic Systems?	Agent security&Open agentic systems&Software engineering
26.03	Rensselaer Polytechnic Institute	arxiv	SAFETYDRIFT: Predicting When AI Agents Cross the Line Before They Actually Do	Agent safety&Trajectory prediction&Runtime monitoring
26.03	SUCCESS Lab, Texas A&M University	arxiv	A Systematic Taxonomy of Security Vulnerabilities in the OpenClaw AI Agent Framework	Agent vulnerabilities&Security taxonomy&OpenClaw
26.03	The Hong Kong University of Science and Technology (Guangzhou), China	arxiv	“What Did It Actually Do?”: Understanding Risk Awareness and Traceability for Computer-Use Agents	Risk awareness&Agent traceability&Computer-use agents
26.03	Department of Electrical and Computer Engineering, Tandon School of Engineering, New York University, USA	arxiv	Safeguarding LLMs Against Misuse and AI-Driven Malware Using Steganographic Canaries	Malware defense&Steganographic canaries&LLM misuse
26.03	Singapore Management University, Singapore	arxiv	SafeClaw-R: Towards Safe and Secure Multi-Agent Personal Assistants	Multi-agent assistants&Runtime enforcement&Skill security
26.03	Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy	arxiv	Security in LLM-as-a-Judge: A Comprehensive SoK	LLM-as-a-Judge&Security survey&Evaluation robustness
26.04	Mattersec Labs	arxiv	SecLens: Role-Specific Evaluation of LLMs for Security Vulnerability Detection	Vulnerability detection&Stakeholder evaluation&Role-specific scoring
26.04	University of Melbourne	arxiv	Combating Data Laundering in LLM Training	Data laundering&Training data detection&Synthesis data reversion
26.04	Fudan University, China	arxiv	From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers	MCP security&Malicious servers&Behavioral deviation detection
26.04	University of California, Los Angeles	EACL 2026	Open-Domain Safety Policy Construction	Safety policy construction&Agentic research&Content moderation
26.04	Constellation	arxiv	An Independent Safety Evaluation of Kimi K2.5	Bias Reduction&Safety Evaluation&Open-Weight Models
26.04	UC Santa Cruz	arxiv	Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw	Personal AI&Threat Taxonomy&Safety Evaluation
26.04	aKASTEL Security Research Labs, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany	arxiv	Foundations for Agentic AI Investigations from the Forensic Analysis of OpenClaw	Threat Taxonomy&AI Forensics&OpenClaw
26.04	University of the Cumberlands	arxiv	A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms	Threat Taxonomy&Benchmarking&MCP Security
26.04	Computer Science & Engineering, Mississippi State University	arxiv	Attribution-Driven Explainable Intrusion Detection with Encoder-Based Large Language Models	Security Analysis&Intrusion Detection&Explainability
26.04	Research Institute of Trustworthy Autonomous	arxiv	ClawLess: A Security Model of AI Agents	Clawless&Security&Model
26.04	Department of Earth Science and Engineering, Imperial College London, London, United Kingdom	arxiv	SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills	Agent Skills&OpenClaw&Benchmarking
26.04	The Pennsylvania State University	arxiv	TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation	Tool Poisoning&TRUSTDESC&Preventing
26.04	The Hong Kong Polytechnic University	arxiv	Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions	Threat Taxonomy&Benchmarking&RAG Security
26.04	Arizona State University	arxiv	Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit	Large Language&Security Operations Center&LLM tools
26.04	University of Delaware	arxiv	Towards Personalizing Secure Programming Education with LLM-Injected Vulnerabilities	Personalizing Secure Programming&Secure Programming Education&Personalizing Secure
26.04	DEXAI - Icaro Lab	arxiv	Agentic Microphysics: A Manifesto for Generative AI Safety	Agentic Microphysics&Manifesto for Generative&Manifesto
26.04	Chongqing University	ACL 2026	DEEPGUARD: Secure Code Generation via Multi-Layer Semantic Aggregation	DEEPGUARD&Multi-Layer Semantic Aggregation&Semantic Aggregation
26.04	New York University	FORC 2026	Can we Watermark Low-Entropy LLM Outputs?	Low-Entropy LLM Outputs&Watermark Low-Entropy LLM&output
26.04	MemTensor, China	arxiv	A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty	Agent Memory&Memory Security&Governance
26.04	Tandon School of Engineering, New York University	arxiv	Surgical Repair of Insecure Code Generation in LLMs	Code Security&Model Repair&Mechanistic Diagnosis
26.04	ETH Zurich	arxiv	Using large language models for embodied planning introduces systematic safety risks	Embodied Planning&Robotics Safety&LLM Agents
26.04	BlueFocus Communication Group	arxiv	Owner-Harm: A Missing Threat Model for AI Agent Safety	Agent Safety&Threat Model&Owner Harm
26.04	University of Oslo	arxiv	Towards Agentic Investigation of Security Alerts	Security Alerts&Agentic Investigation&Cybersecurity
26.05	Fudan University	arxiv	Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses	Embodied AI&Safety Survey&Attacks
26.05	Shanghai Jiao Tong University	arxiv	ClawGuard: Out-of-Band Detection of LLM Agent Workflow Hijacking via EM Side Channel	Workflow Hijacking&Side Channel&Agent Security

💻Presentations & Talks

📖Tutorials & Workshops

Date	Type	Title	URL
23.10	Tutorials	Awesome-LLM-Safety	link

📰News & Articles

Date	Type	Title	URL
23.01	video	ChatGPT and InstructGPT: Aligning Language Models to Human Intention	link
23.06	Report	“Dual-use dilemma” for GenAI Workshop Summarization	link
23.10	News	Joint Statement on AI Safety and Openness	link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security

Different from the main README🕵️

📑Papers

💻Presentations & Talks

📖Tutorials & Workshops

📰News & Articles

🧑‍🏫Scholars

FilesExpand file tree

Security&Discussion.md

Latest commit

History

Security&Discussion.md

File metadata and controls

Security

Different from the main README🕵️

📑Papers

💻Presentations & Talks

📖Tutorials & Workshops

📰News & Articles

🧑‍🏫Scholars