| 20.10 |
Facebook AI Research |
arxiv |
Recipes for Safety in Open-domain Chatbots |
Toxic Behavior&Open-domain |
| 22.02 |
DeepMind |
EMNLP2022 |
Red Teaming Language Models with Language Model |
Red Teaming&Harm Test |
| 22.03 |
OpenAI |
NIPS2022 |
Training language models to follow instructions with human feedback |
InstructGPT&RLHF&Harmless |
| 22.04 |
Anthropic |
arxiv |
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback |
Helpful&Harmless |
| 22.05 |
UCSD |
EMNLP2022 |
An Empirical Analysis of Memorization in Fine-tuned Autoregressive Language Models |
Privacy Risks&Memorization |
| 22.09 |
Anthropic |
arxiv |
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned |
Red Teaming&Harmless&Helpful |
| 22.12 |
Anthropic |
arxiv |
Constitutional AI: Harmlessness from AI Feedback |
Harmless&Self-improvement&RLAIF |
| 23.07 |
UC Berkeley |
NIPS2023 |
Jailbroken: How Does LLM Safety Training Fail? |
Jailbreak&Competing Objectives&Mismatched Generalization |
| 23.08 |
The Chinese University of Hong Kong Shenzhen China, Tencent AI Lab, The Chinese University of Hong Kong |
arxiv |
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs Via Cipher |
Safety Alignment&Adversarial Attack |
| 23.08 |
University College London, University College London, Tilburg University |
arxiv |
Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities |
Security&AI Alignment |
| 23.09 |
Peking University |
arxiv |
RAIN: Your Language Models Can Align Themselves without Finetuning |
Self-boosting&Rewind Mechanisms |
| 23.10 |
Princeton University, Virginia Tech, IBM Research, Stanford University |
arxiv |
FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO! |
Fine-tuning****Safety Risks&Adversarial Training |
| 23.10 |
UC Riverside |
arXiv |
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks |
Adversarial Attacks&Vulnerabilities&Model Security |
| 23.10 |
Rice University |
NAACL2024(findings) |
Secure Your Model: An Effective Key Prompt Protection Mechanism for Large Language Models |
Key Prompt Protection&Large Language Models&Unauthorized Access Prevention |
| 23.11 |
KAIST AI |
arxiv |
HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning |
Hate Speech&Detection |
| 23.11 |
CMU |
AACL2023(ART or Safety workshop) |
Measuring Adversarial Datasets |
Adversarial Robustness&AI Safety&Adversarial Datasets |
| 23.11 |
UIUC |
arxiv |
Removing RLHF Protections in GPT-4 via Fine-Tuning |
Remove Protection&Fine-Tuning |
| 23.11 |
IT University of Copenhagen,University of Washington |
arxiv |
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild |
Red Teaming |
| 23.11 |
Fudan University&Shanghai AI lab |
arxiv |
Fake Alignment: Are LLMs Really Aligned Well? |
Alignment Failure&Safety Evaluation |
| 23.11 |
University of Southern California |
arxiv |
SAFER-INSTRUCT: Aligning Language Models with Automated Preference Data |
RLHF&Safety |
| 23.11 |
Google Research |
arxiv |
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications |
Adversarial Testing&AI-Assisted Red Teaming&Application Safety |
| 23.11 |
Tencent AI Lab |
arxiv |
ADVERSARIAL PREFERENCE OPTIMIZATION |
Human Preference Alignment&Adversarial Preference Optimization&Annotation Reduction |
| 23.11 |
Docta.ai |
arxiv |
Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models |
Data Credibility&Safety alignment |
| 23.11 |
CIIRC CTU in Prague |
arxiv |
A Security Risk Taxonomy for Large Language Models |
Security risks&Taxonomy&Prompt-based attacks |
| 23.11 |
Meta&University of Illinois Urbana-Champaign |
NAACL2024 |
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming |
Automatic Red-Teaming&LLM Safety&Adversarial Prompt Writing |
| 23.11 |
The Ohio State University&University of California, Davis |
NAACL2024 |
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities |
Open-Source LLMs&Malicious Demonstrations&Trustworthiness |
| 23.12 |
Drexel University |
arXiv |
A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly |
Security&Privacy&Attacks |
| 23.12 |
Tenyx |
arXiv |
Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation |
Geometric Interpretation&Intrinsic Dimension&Toxicity Detection |
| 23.12 |
Independent (Now at Google DeepMind) |
arXiv |
Scaling Laws for Adversarial Attacks on Language Model Activations |
Adversarial Attacks&Language Model Activations&Scaling Laws |
| 23.12 |
University of Liechtenstein, University of Duesseldorf |
arxiv |
NEGOTIATING WITH LLMS: PROMPT HACKS, SKILL GAPS, AND REASONING DEFICITS |
Negotiation&Reasoning&Prompt Hacking |
| 23.12 |
University of Wisconsin Madison, University of Michigan Ann Arbor, ASU, Washington University |
arXiv |
Exploring the Limits of ChatGPT in Software Security Applications |
Software Security |
| 23.12 |
GenAI at Meta |
arxiv |
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations |
Human-AI Conversation&Safety Risk taxonomy |
| 23.12 |
University of California Riverside, Microsoft |
arxiv |
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack |
Safety Alignment&Summarization&Vulnerability |
| 23.12 |
MIT, Harvard |
NIPS2023(Workshop) |
Forbidden Facts: An Investigation of Competing Objectives in Llama-2 |
Competing Objectives&Forbidden Fact Task&Model Decomposition |
| 23.12 |
University of Science and Technology of China |
arxiv |
Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models |
Text Protection&Silent Guardian |
| 23.12 |
OpenAI |
Open AI |
Practices for Governing Agentic AI Systems |
Agentic AI Systems&LM Based Agent |
| 23.12 |
University of Massachusetts Amherst, Columbia University, Google, Stanford University, New York University |
arxiv |
Learning and Forgetting Unsafe Examples in Large Language Models |
Safety Issues&ForgetFilter Algorithm&Unsafe Content |
| 23.12 |
Tencent AI Lab, The Chinese University of Hong Kong |
arxiv |
Aligning Language Models with Judgments |
Judgment Alignment&Contrastive Unlikelihood Training |
| 24.01 |
Delft University of Technology |
arxiv |
Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks |
Red Teaming&Hallucinations&Mathematics Tasks |
| 24.01 |
Apart Research, University of Edinburgh, Imperial College London, University of Oxford |
arxiv |
Large Language Models Relearn Removed Concepts |
Neuroplasticity&Concept Redistribution |
| 24.01 |
Tsinghua University, Xiaomi AI Lab, Huawei, Shenzhen Heytap Technology, vivo AI Lab, Viomi Technology, Li Auto, Beijing University of Posts and Telecommunications, Soochow University |
arxiv |
PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY EFFICIENCY AND SECURITY |
Intelligent Personal Assistant&LLM Agent&Security and Privacy |
| 24.01 |
Zhongguancun Laboratory, Tsinghua University, Institute of Information Engineering Chinese Academy of Sciences, Ant Group |
arxiv |
Risk Taxonomy Mitigation and Assessment Benchmarks of Large Language Model Systems |
Safety&Risk Taxonomy&Mitigation Strategies |
| 24.01 |
Google Research |
arxiv |
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models |
Interpretability |
| 24.01 |
Ben-Gurion University of the Negev Israel |
arxiv |
GPT IN SHEEP’S CLOTHING: THE RISK OF CUSTOMIZED GPTS |
GPTs&Cybersecurity&ChatGPT |
| 24.01 |
Shanghai Jiao Tong University |
arxiv |
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents |
LLM Agents&Safety Risk Awareness&Benchmark |
| 24.01 |
Ant Group |
arxiv |
A FAST PERFORMANT SECURE DISTRIBUTED TRAINING FRAMEWORK FOR LLM |
Distributed LLM&Security |
| 24.01 |
Shanghai Artificial Intelligence Laboratory, Dalian University of Technology, University of Science and Technology of China |
arxiv |
PsySafe: A Comprehensive Framework for Psychological-based Attack Defense and Evaluation of Multi-agent System Safety |
Multi-agent Systems&Agent Psychology&Safety |
| 24.01 |
Rochester Institute of Technology |
arxiv |
Mitigating Security Threats in LLMs |
Security Threats&Prompt Injection&Jailbreaking |
| 24.01 |
Johns Hopkins University, University of Pennsylvania, Ohio State University |
arxiv |
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts |
Multilingualism&Safety&Resource Disparity |
| 24.01 |
University of Florida |
arxiv |
Adaptive Text Watermark for Large Language Models |
Text Watermarking&Robustness&Security |
| 24.01 |
The Hebrew University |
arXiv |
TRADEOFFS BETWEEN ALIGNMENT AND HELPFULNESS IN LANGUAGE MODELS |
Language Model Alignment&AI Safety&Representation Engineering |
| 24.01 |
Google Research, Anthropic |
arxiv |
Gradient-Based Language Model Red Teaming |
Red Teaming&Safety&Prompt Learning |
| 24.01 |
National University of Singapore, Pennsylvania State University |
arxiv |
Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code |
Watermarking&Error Correction Code&AI Ethics |
| 24.01 |
Tsinghua University, University of California Los Angeles, WeChat AI Tencent Inc. |
arxiv |
Prompt-Driven LLM Safeguarding via Directed Representation Optimization |
Safety Prompts&Representation Optimization |
| 24.02 |
Rensselaer Polytechnic Institute, IBM T.J. Watson Research Center, IBM Research |
arxiv |
Adaptive Primal-Dual Method for Safe Reinforcement Learning |
Safe Reinforcement Learning&Adaptive Primal-Dual&Adaptive Learning Rates |
| 24.02 |
Jagiellonian University, University of Modena and Reggio Emilia, Alma Mater Studiorum University of Bologna, European University Institute |
arxiv |
No More Trade-Offs: GPT and Fully Informative Privacy Policies |
ChatGPT&Privacy Policies&Legal Requirements |
| 24.02 |
Florida International University |
arxiv |
Security and Privacy Challenges of Large Language Models: A Survey |
Security&Privacy Challenges&Suevey |
| 24.02 |
Rutgers University, University of California, Santa Barbara, NEC Labs America |
arxiv |
TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution |
LLM-based Agents&Safety&Trustworthiness |
| 24.02 |
University of Maryland College Park, JPMorgan AI Research, University of Waterloo, Salesforce Research |
arxiv |
Shadowcast: Stealthy Data Poisoning Attacks against VLMs |
Vision-Language Models&Data Poisoning&Security |
| 24.02 |
Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong |
arxiv |
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models |
Safety Benchmark&Safety Evaluation**&Hierarchical Taxonomy |
| 24.02 |
Fudan University |
arxiv |
ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages |
Tool Learning&Large Language Models (LLMs)&Safety Issues&ToolSword |
| 24.02 |
Paul G. Allen School of Computer Science & Engineering, University of Washington |
arxiv |
SPML: A DSL for Defending Language Models Against Prompt Attacks |
Domain-Specific Language (DSL)&Chatbot Definitions&System Prompt Meta Language (SPML) |
| 24.02 |
Tsinghua University |
arxiv |
ShieldLM: Empowering LLMs as Aligned Customizable and Explainable Safety Detectors |
Safety Detectors&Customizable&Explainable |
| 24.02 |
Dalhousie University |
arxiv |
Immunization Against Harmful Fine-tuning Attacks |
Fine-tuning Attacks&Immunization |
| 24.02 |
Chinese Academy of Sciences, University of Chinese Academy of Sciences, Alibaba Group |
arxiv |
SoFA: Shielded On-the-fly Alignment via Priority Rule Following |
Priority Rule Following&Alignment |
| 24.02 |
Universidade Federal de Santa Catarina |
arxiv |
A Survey of Large Language Models in Cybersecurity |
Cybersecurity&Vulnerability Assessment |
| 24.02 |
Zhejiang University |
arxiv |
PRSA: Prompt Reverse Stealing Attacks against Large Language Models |
Prompt Reverse Stealing Attacks&Security |
| 24.02 |
Shanghai Artificial Intelligence Laboratory |
NAACL2024 |
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey |
Large Language Models&Conversation Safety&Survey |
| 24.03 |
Tulane University |
arxiv |
ENHANCING LLM SAFETY VIA CONSTRAINED DIRECT PREFERENCE OPTIMIZATION |
Reinforcement Learning&Human Feedback&Safety Constraints |
| 24.03 |
University of Illinois Urbana-Champaign |
arxiv |
INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents |
Tool Integration&Security&Indirect Prompt Injection |
| 24.03 |
Harvard University |
arxiv |
Towards Safe and Aligned Large Language Models for Medicine |
*Medical Safety&Alignment&Ethical Principles |
| 24.03 |
Rensselaer Polytechnic Institute, University of Michigan, IBM Research, MIT-IBM Watson AI Lab |
arxiv |
ALIGNERS: DECOUPLING LLMS AND ALIGNMENT |
Alignment&Synthetic Data |
| 24.03 |
MIT, Princeton University, Stanford University, Georgetown University, AI Risk and Vulnerability Alliance, Eleuther AI, Brown University, Carnegie Mellon University, Virginia Tech, Northeastern University, UCSB, University of Pennsylvania, UIUC |
arxiv |
A Safe Harbor for AI Evaluation and Red Teaming |
AI Evaluation&Red Teaming&Safe Harbor |
| 24.03 |
University of Southern California |
arxiv |
Logits of API-Protected LLMs Leak Proprietary Information |
API-Protected LLMs&Softmax Bottleneck&Embedding Size Detection |
| 24.03 |
University of Bristol |
arxiv |
Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention |
Safety&Prompt Engineering |
| 24.03 |
XiaMen University, Yanshan University, IDEA Research, Inner Mongolia University, Microsoft, Microsoft Research Asia |
arxiv |
Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models |
Safety&Guidelines&Alignment |
| 24.03 |
Tianjin University, Tianjin University, Zhengzhou University, China Academy of Information and Communications Technology |
arxiv |
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment, and Safety |
Chinese LLMs&Benchmarking&Safety |
| 24.03 |
Center for Cybersecurity Systems and Networks, AIShield Bosch Global Software Technologies Bengaluru India |
arxiv |
Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal |
LLM Security&Threat modeling&Risk Assessment |
| 24.03 |
Queen’s University Belfast |
arxiv |
AI Safety: Necessary but insufficient and possibly problematic |
AI Safety&Transparency&Structural Harm |
| 24.04 |
Provable Responsible AI and Data Analytics (PRADA) Lab, King Abdullah University of Science and Technology |
arxiv |
Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs |
Dialectical Alignment&3H Principle&Security Threats |
| 24.04 |
LibrAI, Tsinghua University, Harbin Institute of Technology, Monash University, The University of Melbourne, MBZUAI |
arxiv |
Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models |
Red Teaming&Safety |
| 24.04 |
University of California, Santa Barbara, Meta AI |
arxiv |
Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models |
Safety&Helpfulness&Controllability |
| 24.04 |
School of Information and Software Engineering, University of Electronic Science and Technology of China |
arxiv |
Exploring Backdoor Vulnerabilities of Chat Models |
Backdoor Attacks&Chat Models&Security |
| 24.04 |
Enkrypt AI |
arxiv |
INCREASED LLM VULNERABILITIES FROM FINE-TUNING AND QUANTIZATION |
Fine-tuning&Quantization&LLM Vulnerabilities |
| 24.04 |
TongJi University, Tsinghua University&, eijing University of Technology, Nanyang Technological University, Peng Cheng Laboratory |
arxiv |
Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security |
Multimodal Large Language Models&Security Vulnerabilities&Image Inputs |
| 24.04 |
University of Washington, Carnegie Mellon University, University of British Columbia, Vector Institute for AI |
arxiv |
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge |
AI-Assisted Red-Teaming&Multicultural Knowledge |
| 24.04 |
Nanjing University |
DLSP 2024 |
Subtoxic Questions: Dive Into Attitude Change of LLM’s Response in Jailbreak Attempts |
Jailbreak&Subtoxic Questions&GAC Model |
| 24.04 |
Innodata |
arxiv |
Benchmarking Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Propensity for Hallucinations |
Evaluation&Safety |
| 24.04 |
University of Cambridge, New York University, ETH Zurich |
arxiv |
Foundational Challenges in Assuring Alignment and Safety of Large Language Models |
Alignment&Safety |
| 24.04 |
Zhejiang University |
arxiv |
TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment |
Intellectual Property Protection&Edge-deployed Transformer Model |
| 24.04 |
Harvard University |
arxiv |
More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness |
Reinforcement Learning from Human Feedback&Trustworthiness |
| 24.05 |
University of Maryland |
arxiv |
Constrained Decoding for Secure Code Generation |
Code Generation&Code LLM&Secure Code Generation&AI Safety |
| 24.05 |
Huazhong University of Science and Technology |
arxiv |
Large Language Models for Cyber Security: A Systematic Literature Review |
Cybersecurity&Systematic Review |
| 24.04 |
CSIRO’s Data61 |
ACM International Conference on AI-powered Software |
An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping |
AI Safety&Evaluation Framework&AI Lifecycle Mapping |
| 24.05 |
CSAIL and CBMM, MIT |
arxiv |
SecureLLM: Using Compositionality to Build Provably Secure Language Models for Private, Sensitive, and Secret Data |
SecureLLM&Compositionality |
| 24.05 |
Carnegie Mellon University |
arxiv |
Human–AI Safety: A Descendant of Generative AI and Control Systems Safety |
Human–AI Safety&Generative AI |
| 24.05 |
University of York |
arxiv |
Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding |
Safe Reinforcement Learning&Black-Box Environments&Adaptive Shielding |
| 24.05 |
Princeton University |
arxiv |
AI Risk Management Should Incorporate Both Safety and Security |
AI Safety&AI Security&Risk Management |
| 24.05 |
University of Oslo |
arxiv |
AI Safety: A Climb to Armageddon? |
AI Safety&Existential Risk&AI Governance |
| 24.06 |
Zscaler, Inc. |
arxiv |
Exploring Vulnerabilities and Protections in Large Language Models: A Survey |
Prompt Hacking&Adversarial Attacks&Suvery |
| 24.06 |
Texas A & M University - San Antonio |
arxiv |
Transforming Computer Security and Public Trust Through the Exploration of Fine-Tuning Large Language Models |
Fine-Tuning&Cyber Security |
| 24.06 |
Alibaba Group |
arxiv |
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States |
LLM Safety&Alignment&Jailbreak |
| 24.06 |
UC Davis |
arxiv |
Security of AI Agents |
Security&AI Agents&Vulnerabilities |
| 24.06 |
University of Connecticut |
USENIX Security ‘24 |
An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection |
Backdoor Attack&Code Completion Models&Vulnerability Detection |
| 24.06 |
University of California, Irvine |
arxiv |
TorchOpera: A Compound AI System for LLM Safety |
TorchOpera&LLM Safety&Compound AI System |
| 24.06 |
NVIDIA Corporation |
arxiv |
garak: A Framework for Security Probing Large Language Models |
garak&Security Probing |
| 24.06 |
Carnegie Mellon University |
arxiv |
Current State of LLM Risks and AI Guardrails |
LLM Risks&AI Guardrails |
| 24.06 |
Johns Hopkins University |
arxiv |
Every Language Counts: Learn and Unlearn in Multilingual LLMs |
Multilingual LLMs&Fake Information&Unlearning |
| 24.06 |
Tsinghua University |
arxiv |
Finding Safety Neurons in Large Language Models |
Safety Neurons&Mechanistic Interpretability&AI Safety |
| 24.06 |
Center for AI Safety and Governance, Institute for AI, Peking University |
arxiv |
SAFESORA: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset |
Safety Alignment&Text2Video Generation |
| 24.06 |
Samsung R&D Institute UK, KAUST, University of Oxford |
arxiv |
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch |
Model Merging&Safety Alignment |
| 24.06 |
Hofstra University |
arxiv |
Analyzing Multi-Head Attention on Trojan BERT Models |
Trojan Attack&BERT Models&Multi-Head Attention |
| 24.06 |
Fudan University |
arxiv |
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance |
Safety Alignment&Jailbreak Attacks&Response Disparity |
| 24.06 |
Stony Brook University |
NAACL 2024 Workshop |
Automated Adversarial Discovery for Safety Classifiers |
Safety Classifiers&Adversarial Attacks&Toxicity |
| 24.07 |
University of Utah |
arxiv |
Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression |
Model Compression&Safety Evaluation |
| 24.07 |
University of Alberta |
arxiv |
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture |
Multilingual Blending&LLM Safety Alignment&Language Mixture |
| 24.07 |
Singapore National Eye Centre |
arxiv |
A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models – Safety, Consensus, Objectivity, Reproducibility and Explainability |
Evaluation Framework |
| 24.07 |
Microsoft |
arxiv |
SLIP: Securing LLM’s IP Using Weights Decomposition |
Hybrid Inference&Model Security&Weights Decomposition |
| 24.07 |
Microsoft |
arxiv |
Phi-3 Safety Post-Training: Aligning Language Models with a “Break-Fix” Cycle |
Phi-3&Safety Post-Training |
| 24.07 |
Tsinghua University |
arxiv |
Course-Correction: Safety Alignment Using Synthetic Preferences |
Course-Correction&Safety Alignment&Synthetic Preferences |
| 24.07 |
Northwestern University |
arxiv |
From Sands to Mansions: Enabling Automatic Full-Life-Cycle Cyberattack Construction with LLM |
Cyberattack Construction&Full-Life-Cycle |
| 24.07 |
Singapore University of Technology and Design |
arxiv |
AI Safety in Generative AI Large Language Models: A Survey |
Generative AI&AI Safety |
| 24.07 |
Lehigh University |
arxiv |
Blockchain for Large Language Model Security and Safety: A Holistic Survey |
Blockchain&Security&Safety |
| 24.08 |
OpenAI |
openai |
Rule-Based Rewards for Language Model Safety |
Reinforcement Learning&Safety&Rule-Based Rewards |
| 24.08 |
University of Texas at Austin |
arxiv |
HIDE AND SEEK: Fingerprinting Large Language Models with Evolutionary Learning |
Model Fingerprinting&In-context Learning |
| 24.08 |
Technical University of Munich |
arxiv |
Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study |
Secure Code Assessment&Vulnerability Detection |
| 24.08 |
Offenburg University of Applied Sciences |
arxiv |
"You still have to study" - On the Security of LLM generated code |
Code Security&Prompting Techniques |
| 24.08 |
University of Connecticut |
arxiv |
Clip2Safety: A Vision Language Model for Interpretable and Fine-Grained Detection of Safety Compliance in Diverse Workplaces |
Vision Language Model&Safety Compliance&Personal Protective Equipment Detection |
| 24.08 |
Pabna University of Science and Technology |
arxiv |
Risks, Causes, and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey |
Privacy&Bias&Interpretability |
| 24.08 |
Quinnipiac University |
arxiv |
Is Generative AI the Next Tactical Cyber Weapon For Threat Actors? Unforeseen Implications of AI Generated Cyber Attacks |
Generative AI&Cybersecurity&Cyber Attacks |
| 24.08 |
Nanyang Technological University |
arxiv |
Trustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations |
AI Safety&Trustworthy&Responsible |
| 24.08 |
King Abdullah University of Science and Technology |
arxiv |
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models |
Safety&Helpfulness&LLM Alignment |
| 24.08 |
University of Calgary |
arxiv |
Trustworthy and Responsible AI for Human-Centric Autonomous Decision-Making Systems |
Trustworthy AI&Algorithmic Bias&Responsible AI |
| 24.08 |
University of Oxford |
arxiv |
AI Security Audits: Challenges and Innovations in Assessing Large Language Models |
AI Security Audits&Vulnerability Assessment&AI Ethics |
| 24.08 |
University of Science and Technology of China |
arxiv |
Safety Layers of Aligned Large Language Models: The Key to LLM Security |
Aligned LLM&Safety Layers&Security Degradation |
| 24.09 |
University of Texas at San Antonio |
arxiv |
Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs |
Source Code Security&LLMs&Reinforcement Learning |
| 24.09 |
The Hong Kong Polytechnic University |
arxiv |
Alignment-Aware Model Extraction Attacks on Large Language Models |
Model Extraction Attacks&LLM Alignment&Watermark Resistance |
| 24.09 |
University of Oxford, Redwood Research |
arxiv |
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols |
AI Control&Safety Protocols&Game Theory |
| 24.09 |
University of Galway |
ECAI AIEB Workshop |
Ethical AI Governance: Methods for Evaluating Trustworthy AI |
Trustworthy AI&Ethics&AI Evaluation |
| 24.09 |
University of Texas at San Antonio |
arxiv |
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing |
Multi-Agent Systems&Code Security&Fuzz Testing&Static Analysis |
| 24.09 |
Tsinghua University |
arxiv |
Language Models Learn to Mislead Humans via RLHF |
Reinforcement Learning from Human Feedback (RLHF)&U-SOPHISTRY&Misleading AI |
| 24.09 |
Stevens Institute of Technology |
arxiv |
Measuring Copyright Risks of Large Language Model via Partial Information Probing |
Copyright&Partial Information Probing |
| 24.09 |
IBM Research |
arxiv |
Attack Atlas: A Practitioner’s Perspective on Challenges and Pitfalls in Red Teaming GenAI |
Red Teaming&LLM Security&Adversarial Attacks |
| 24.09 |
Pengcheng Laboratory |
arxiv |
Multi-Designated Detector Watermarking for Language Models |
Watermarking&Claimability&Multi-designated Verifier Signature |
| 24.09 |
ETH Zurich |
arxiv |
An Adversarial Perspective on Machine Unlearning for AI Safety |
Machine Unlearning&Adversarial Attacks&Unlearning Robustness |
| 24.10 |
Google DeepMind |
arxiv |
A Watermark for Black-Box Language Models |
Watermarking&Black-Box Models&LLM Detection |
| 24.10 |
Mohamed Bin Zayed University of Artificial Intelligence |
arxiv |
Optimizing Adaptive Attacks Against Content Watermarks for Language Models |
Watermarking&Adaptive Attacks&LLM Security |
| 24.10 |
Rice University, Rutgers University |
arxiv |
Taylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion |
Taylor Expansion&Model Security |
| 24.10 |
PeopleTec |
arxiv |
Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders |
Cybersecurity&Hallucinations |
| 24.10 |
Fondazione Bruno Kessler, Université Côte d’Azur |
EMNLP 2024 |
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering |
Counterspeech&Safety Guardrails |
| 24.10 |
University of California, Davis, AWS AI Labs |
arxiv |
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models |
Safety alignment&Vision-Language models&Cross-modality representation manipulation |
| 24.10 |
North Carolina State University |
arxiv |
Superficial Safety Alignment Hypothesis: The Need for Efficient and Robust Safety Mechanisms in LLMs |
Superficial safety alignment&Safety mechanisms&Safety-critical components |
| 24.10 |
Shanghai Jiao Tong University, Chinese University of Hong Kong (Shenzhen), Tsinghua University |
arxiv |
ARCHILLES’ HEEL IN SEMI-OPEN LLMS: HIDING BOTTOM AGAINST RECOVERY ATTACKS |
Semi-open LLMs&Recovery attacks&Model resilience |
| 24.10 |
University of Tulsa |
arxiv |
Weak-to-Strong Generalization beyond Accuracy: A Pilot Study in Safety, Toxicity, and Legal Reasoning |
Weak-to-Strong Generalization&Safety&Toxicity |
| 24.10 |
Aalborg University |
arxiv |
Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis |
Language confusion&Multilingual LLMs&Security vulnerabilities |
| 24.10 |
Carnegie Mellon University |
arxiv |
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents |
LLM Safety&Browser Agents&Red Teaming |
| 24.10 |
Palisade Research |
arxiv |
LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild |
LLM Agents&Honeypots&Cybersecurity |
| 24.10 |
University of Pittsburgh |
arxiv |
Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents |
Embodied Agents&Multimodal Safety&Active Learning |
| 24.10 |
CSIRO’s Data61 |
arxiv |
From Solitary Directives to Interactive Encouragement! LLM Secure Code Generation by Natural Language Prompting |
Secure Code Generation&Encouragement Prompting |
| 24.10 |
AppCubic |
arxiv |
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models |
Prompt Injection&Jailbreaking&AI Security |
| 24.10 |
UC Berkeley |
arxiv |
SAFETYANALYST: Interpretable, Transparent, and Steerable LLM Safety Moderation |
LLM Safety&Interpretability&Content Moderation |
| 24.10 |
ShanghaiTech University |
arxiv |
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization |
Safety Alignment&Reinforcement Learning&Policy Optimization |
| 24.11 |
Zhejiang University |
arxiv |
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control |
Trustworthiness&Sparse Activation Control&Representation Control |
| 24.11 |
University of California, Riverside |
arxiv |
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models |
Vision-Language Models&Safety Alignment&Cross-Layer Vulnerability |
| 24.11 |
National University of Singapore |
EMNLP 2024 |
Multi-expert Prompting Improves Reliability, Safety and Usefulness of Large Language Models |
Multi-expert Prompting&LLM Safety&Reliability&Usefulness |
| 24.11 |
OpenAI |
NeurIPS 2024 |
Rule Based Rewards for Language Model Safety |
Rule Based Rewards&Safety Alignment&AI Feedback |
| 24.11 |
Center for Automation and Robotics, Spanish National Research Council |
arXiv |
Can Adversarial Attacks by Large Language Models Be Attributed? |
Adversarial Attribution&LLM Security&Formal Language Theory |
| 24.11 |
McGill University |
arXiv |
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset |
Helpful and Harmless Dataset&Safety Trade-offs&Bias Analysis |
| 24.11 |
Fudan University |
arxiv |
Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding |
Text-to-Image Generation&Safety&Prompt Embedding Sanitization |
| 24.11 |
Meta |
arxiv |
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations |
Multimodal LLM&Content Moderation&Adversarial Robustness |
| 24.11 |
Columbia University |
arxiv |
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations |
Backdoor Attacks&Explainability |
| 24.11 |
Ben-Gurion University of the Negev |
arxiv |
The Information Security Awareness of Large Language Models |
Information Security Awareness&Benchmarking |
| 24.11 |
Fordham University |
arxiv |
Next-Generation Phishing: How LLM Agents Empower Cyber Attackers |
Phishing Detection&Cybersecurity |
| 24.12 |
UC Berkeley |
arxiv |
Trust & Safety of LLMs and LLMs in Trust & Safety |
Trust and Safety&Prompt Injection |
| 24.12 |
Harvard Kennedy School, Avant Research Group |
arxiv |
Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects |
Phishing Attacks&Human-in-the-loop |
| 24.12 |
University of Massachusetts |
arxiv |
Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness |
Instruction Tuning&Safety&Helpfulness |
| 24.11 |
University of Pennsylvania, IBM T.J. Watson Research Center |
arxiv |
Cyber-Attack Technique Classification Using Two-Stage Trained Large Language Models |
Cyber-Attack Classification&Two-Stage Training |
| 24.12 |
University of New South Wales |
arxiv |
How Can LLMs and Knowledge Graphs Contribute to Robot Safety? A Few-Shot Learning Approach |
Robot Safety&Few-Shot Learning&Knowledge Graph Prompting |
| 24.12 |
Örebro University |
arxiv |
Large Language Models and Code Security: A Systematic Literature Review |
LLM-Generated Code&Vulnerability Detection&Data Poisoning Attacks |
| 24.12 |
Algiers Research Institute |
arxiv |
On the Validity of Traditional Vulnerability Scoring Systems for Adversarial Attacks against LLMs |
Adversarial Attacks&Vulnerability Metrics&Risk Assessment |
| 24.12 |
Alan Turing Institute |
arxiv |
SoK: Mind the Gap—On Closing the Applicability Gap in Automated Vulnerability Detection |
Automated Vulnerability Detection&Applicability Gap&Software Security |
| 25.01 |
Meta |
arxiv |
MLLM-as-a-Judge for Image Safety without Human Labeling |
Image Safety&Zero-Shot Judgment&Multimodal Large Language Models |
| 25.01 |
FAU Erlangen-Nürnberg |
arxiv |
Refusal Behavior in Large Language Models: A Nonlinear Perspective |
Refusal Behavior&Mechanistic Interpretability&AI Alignment |
| 25.01 |
University of Waterloo |
arxiv |
Advanced Real-Time Fraud Detection Using RAG-Based LLMs |
Fraud Detection&Retrieval-Augmented Generation&Real-Time AI Security |
| 25.01 |
Mondragon University, University of Seville |
arxiv |
Early External Safety Testing of OpenAI’s O3-Mini: Insights from Pre-Deployment Evaluation |
LLM Safety Testing&OpenAI O3-Mini |
| 25.02 |
Nanyang Technological University |
arxiv |
Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning |
LLM Alignment&Instruction Tuning&Reward Models |
| 25.02 |
University of Bristol |
arxiv |
The Dark Deep Side of DeepSeek: Fine-Tuning Attacks Against the Safety Alignment of CoT-Enabled Models |
Chain of Thought&Fine-Tuning Attack&LLM Safety |
| 25.02 |
Marburg University |
arxiv |
Editing Large Language Models Poses Serious Safety Risks |
Knowledge Editing&LLM Security Risks&Adversarial Manipulation |
| 25.02 |
Technical University of Munich |
AAAI 2025 |
Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment |
Medical Multimodal Models&Model Stealing&Adversarial Domain Alignment |
| 25.02 |
Georgia Institute of Technology |
arxiv |
Enhancing Phishing Email Identification with Large Language Models |
Phishing Detection&Cybersecurity |
| 25.02 |
Fudan University |
arxiv |
Safety at Scale: A Comprehensive Survey of Large Model Safety |
Large Model Safety&AI Security&Adversarial Attacks |
| 25.02 |
University of Maryland |
arxiv |
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities |
Model Tampering Attacks&LLM Security&Adversarial Robustness |
| 25.02 |
Penn State University |
arxiv |
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet |
Harmfulness Ranking&LLM Evaluation&AI Safety |
| 25.02 |
Peking University |
arxiv |
Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning |
LLM Safety&Reasoning Trade-off&Fine-Tuning |
| 25.02 |
City University of Hong Kong |
arxiv |
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis |
LLM Alignment&Safety Fine-Tuning&Jailbreak Attacks |
| 25.02 |
Tsinghua University |
arxiv |
“Nuclear Deployed!”: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents |
Autonomous LLM Agents&Catastrophic Risks&Decision-making |
| 25.02 |
University of Washington |
arxiv |
SAFECHAIN: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities |
LLM Safety&Chain-of-Thought Reasoning&Model Alignment |
| 25.02 |
University of California, Santa Cruz |
arxiv |
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 |
Large Reasoning Models&Safety Assessment&Adversarial Attacks |
| 25.02 |
34 Affiliates |
arxiv |
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective |
Safety Assessment&Guideline Paper |
| 25.02 |
University of Cambridge |
arxiv |
Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection |
Hateful Meme Detection&Multimodal Models&Contrastive Learning |
| 25.02 |
Cooperative AI Foundation |
arXiv |
Multi-Agent Risks from Advanced AI |
Multi-Agent Systems&AI Risk&AI Governance |
| 25.02 |
Apart Research, University of Science and Technology of Hanoi |
AAAI 2025 Workshop on Theory of Mind |
A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks |
Theory of Mind&AI Safety |
| 25.02 |
Truthful AI, University College London |
arxiv |
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs |
LLM Alignment&Fine-tuning Risks&Emergent Misalignment |
| 25.02 |
Wuhan University |
arxiv |
A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations |
LVLM Safety&Adversarial Attacks&Defense Mechanisms |
| 25.02 |
Clark Atlanta University |
arxiv |
SoK: Exploring Hallucinations and Security Risks in AI-Assisted Software Development with Insights for LLM Deployment |
Hallucinations&Security Risks&AI-Assisted Software Development |
| 25.02 |
Stony Brook Universit, Michigan State University |
arXiv |
Cyber Defense Reinvented: Large Language Models as Threat Intelligence Copilots |
Cyber Threat Intelligence&Large Language Models&Threat Detection |
| 25.03 |
HydroX AI |
arXiv |
Output Length Effect on DeepSeek-R1’s Safety in Forced Thinking |
Output Length&LLM Safety&Forced Thinking |
| 25.03 |
Tampere University |
arxiv |
Mapping Trustworthiness in Large Language Models: A Bibliometric Analysis Bridging Theory to Practice |
Trustworthiness&AI Ethics |
| 25.03 |
University of California, Santa Barbara |
arxiv |
Graphormer-Guided Task Planning: Beyond Static Rules with LLM Safety Perception |
LLM Planning&Graphormer&Risk-Aware Robotics |
| 25.03 |
University of Pennsylvania |
arxiv |
Safety Guardrails for LLM-Enabled Robots |
LLM-enabled Robotics&Jailbreaking Defense&Formal Safety Guarantees |
| 25.03 |
Peking University |
arxiv |
LIFE-CYCLE ROUTING VULNERABILITIES OF LLM ROUTER |
LLM Router&Adversarial Attack&Backdoor Attack |
| 25.03 |
Squirrel AI Learning |
arxiv |
A Survey on Trustworthy LLM Agents: Threats and Countermeasures |
Trustworthy Agent&LLM-based Agents&Multi-Agent System |
| 25.03 |
Cornell Tech |
arxiv |
Multi-Agent Systems Execute Arbitrary Malicious Code |
Multi-Agent Systems&Control-Flow Hijacking&Arbitrary Code Execution |
| 25.03 |
University of Utah |
arxiv |
A Comprehensive Study of LLM Secure Code Generation |
Secure Code Generation&Vulnerability Scanning&Functionality Evaluation |
| 25.03 |
University of Minnesota |
arxiv |
Safety Aware Task Planning via Large Language Models in Robotics |
LLM Robotics Planning&Safety-Aware Framework&Control Barrier Functions |
| 25.03 |
Peking University, Zhongguancun Lab, Tsinghua University |
arxiv |
Large Language Models powered Network Attack Detection: Architecture, Opportunities and Case Study |
Network Security&LLM for Security&Anomaly Detection |
| 25.03 |
Aim Intelligence, Yonsei University, Seoul National University |
arxiv |
sudo rm -rf agentic_security |
Agent Security&Multimodal Jailbreak&LLM Agent Exploitation |
| 25.03 |
Georgia Institute of Technology, IMT Mines Albi |
arxiv |
Leveraging Large Language Models for Risk Assessment in Hyperconnected Logistic Hub Network Deployment |
Risk Assessment&LLMs for Logistics&Supply Chain Resilience |
| 25.04 |
University of Twente |
arxiv |
Safety and Security Risk Mitigation in Satellite Missions via Attack-Fault-Defense Trees |
Cyber-Physical Systems&Attack-Fault-Defense Trees&Satellite Ground Segment |
| 25.04 |
Google DeepMind |
arxiv |
An Approach to Technical AGI Safety and Security |
AGI Safety&Misalignment Mitigation&Capability Control |
| 25.04 |
Earlham College |
ISDFS 2025 |
Debate-Driven Multi-Agent LLMs for Phishing Email Detection |
Phishing Detection&Multi-Agent LLMs&Debate Framework |
| 25.04 |
Indian Institute of Technology Kanpur |
MSR 2025 |
MaLAware: Automating the Comprehension of Malicious Software Behaviours using Large Language Models (LLMs) |
Malware Analysis&Behavior Explanation&LLMs for Cybersecurity |
| 25.04 |
Leidos |
arxiv |
MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits |
Model Context Protocol&Security Audit&Agentic LLM Exploits |
| 25.04 |
Norwegian University of Science and Technology |
arxiv |
An LLM Framework For Cryptography Over Chat Channels |
LLMs&Cryptography&Steganography |
| 25.04 |
Peking University |
arxiv |
SaRO: Enhancing LLM Safety through Reasoning-based Alignment |
Safety Alignment&Reasoning-based Alignment&LLMs |
| 25.04 |
Johns Hopkins University |
arxiv |
An Investigation of Large Language Models and Their Vulnerabilities in Spam Detection |
Spam Detection&Adversarial Attack&Data Poisoning |
| 25.04 |
TU Wien |
arxiv |
Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design |
Offensive Security&Benchmarking&LLM Penetration Testing |
| 25.04 |
Fraunhofer Institute for Cognitive Systems IKS |
arxiv |
Towards Automated Safety Requirements Derivation Using Agent-based RAG |
Agent-based RAG&Safety Requirements Derivation&Autonomous Driving |
| 25.04 |
Nanjing University |
arxiv |
Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask |
LLM-based Vulnerability Detection&Contextual Reasoning&Benchmark Evaluation |
| 25.04 |
Nanyang Technological University |
arxiv |
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment |
LLM Safety&LLM Lifecycle&Agent Alignment |
| 25.04 |
Arab American University |
arxiv |
Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends |
Traffic Safety&Multimodal Large Language Models&ADAS |
| 25.04 |
National University of Singapore |
arxiv |
Safety in Large Reasoning Models: A Survey |
Large Reasoning Models&Safety Taxonomy&Adversarial Attacks |
| 25.04 |
Amazon Web Services |
arxiv |
Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents |
Agentic AI Security&Threat Modeling&Mitigation Framework |
| 25.04 |
Alibaba Group |
NAACL 2025 |
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models |
Multimodal LLM&Safety Alignment&Risk Disentanglement |
| 25.04 |
University of Maryland |
NAACL 2025 |
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models |
RAG&Safety Alignment&Red Teaming |
| 25.05 |
University of Granada |
arxiv |
LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures |
Large Language Models&Security&Defense Mechanisms |
| 25.05 |
University of North Carolina at Chapel Hill |
Transactions on Machine Learning Research |
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation |
Multimodal LLMs&Information Unlearning&Security Evaluation |
| 25.05 |
University of Oxford |
arxiv |
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents |
Multi-Agent Systems&AI Security&Emergent Threats |
| 25.05 |
Huazhong University of Science and Technology |
arxiv |
Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study |
LLM Deployment&Security Analysis&Empirical Study |
| 25.05 |
Rutgers University |
arxiv |
Aligning Large Language Models with Healthcare Stakeholders: A Pathway to Trustworthy AI Integration |
Healthcare&Alignment&Large Language Models |
| 25.05 |
Metropolia University of Applied Sciences |
arxiv |
A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient |
LLM Safety&Ethical Evaluation&Danger Coefficient |
| 25.05 |
University of Kent |
arxiv |
Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data |
Safety Alignment&Pseudo-Malicious Data&Cybersecurity LLMs |
| 25.05 |
Carnegie Mellon University |
arxiv |
A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron? |
Computer-Using Agents&Security Threats&Safety Benchmarks |
| 25.05 |
NYU Tandon |
arxiv |
MARVEL: Multi-Agent RTL Vulnerability Extraction using Large Language Models |
RTL Security&Multi-Agent Systems&LLM for Hardware Verification |
| 25.05 |
Jerusalem College of Technology |
arxiv |
Proposal for Improving Google A2A Protocol: Safeguarding Sensitive Data in Multi-Agent Systems |
A2A Protocol&Sensitive Data Protection&Multi-Agent Security |
| 25.05 |
Huazhong University of Science and Technology |
arxiv |
From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents |
Mobile LLM Agents&Security Risks&AgentScan |
| 25.05 |
Mohamed bin Zayed University of Artificial Intelligence |
arxiv |
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study |
Safety Alignment&Subspace Geometry&Fine-Tuning Vulnerability |
| 25.05 |
Amazon Web Services |
arxiv |
From nuclear safety to LLM security: Applying non-probabilistic risk management strategies to build safe and secure LLM-powered systems |
Risk Management&LLM Security&Non-Probabilistic Strategies |
| 25.05 |
Infinite Optimization AI Lab |
arxiv |
Security Concerns for Large Language Models: A Survey |
LLM Security&Prompt Injection&Autonomous Agents |
| 25.05 |
Nanyang Technological University |
arxiv |
Understanding Refusal in Language Models with Sparse Autoencoders |
Refusal&Sparse Autoencoder&LLM Safety |
| 25.05 |
Seoul National University |
arxiv |
Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems |
Multi-agent LLM&Cross-domain Security&Threat Modeling |
| 25.05 |
University of Washington |
arxiv |
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities |
AI Safety&Multimodal Moderation&Universal Representation |
| 25.06 |
Tsinghua University |
arxiv |
The Security Threat of Compressed Projectors in Large Vision-Language Models |
Vision-Language Model&Compressed Projector&Adversarial Attack |
| 25.06 |
Michigan State University |
arxiv |
Comprehensive Vulnerability Analysis is Necessary for Trustworthy LLM-MAS |
LLM-MAS&Vulnerability Analysis&Trustworthy AI |
| 25.06 |
Singapore Management University |
arxiv |
Which Factors Make Code LLMs More Vulnerable to Backdoor Attacks? A Systematic Study |
Code LLM&Backdoor Attack&Adversarial Robustness |
| 25.06 |
University of Science and Technology of China |
arxiv |
SECNEURON: Reliable and Flexible Abuse Control in Local LLMs via Hybrid Neuron Encryption |
Local LLM&Abuse Control&Neuron Encryption |
| 25.06 |
Dartmouth College |
arxiv |
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets |
LLM Safety&Alignment Robustness&Representation Similarity |
| 25.06 |
Georgia Tech |
arxiv |
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety |
Interpretation&LLM Safety&Survey |
| 25.06 |
George Mason University |
ICML 2025 |
StealthInk: A Multi-bit and Stealthy Watermark for Large Language Models |
Watermark&LLM&Stealthy&Multi-bit |
| 25.06 |
SUNY-Albany, New Jersey Institute of Technology, Microsoft, Kent State University, University of Florida |
arxiv |
SoK: Are Watermarks in LLMs Ready for Deployment? |
Watermark&LLM&Model Stealing&IP Protection |
| 25.06 |
Tsinghua University, Apple, Beijing University of Posts and Telecommunications |
arxiv |
Enhancing Watermarking Quality for LLMs via Contextual Generation States Awareness |
Watermarking&LLM&Generation Quality&Context Awareness |
| 25.06 |
Shanghaitech University, Sun Yat-sen University |
arxiv |
Safeguarding Multimodal Knowledge Copyright in the RAG-as-a-Service Environment |
Multimodal RAG&Copyright Protection&Watermarking&Image Knowledge&Retrieval-Augmented Generation |
| 25.06 |
University of Applied Sciences Northwestern Switzerland |
arxiv |
Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications |
Multi-Agent System&LLM&Reasoning&Cybersecurity&Specification |
| 25.06 |
Sungkyunkwan University, Microsoft Research Asia |
ACL 2025 |
Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights |
Value Alignment&LLM Safety&Personalization&Harmful Behavior&Psychological Analysis |
| 25.06 |
Virelya Intelligence Research Labs |
arxiv |
Risks & Benefits of LLMs & GenAI for Platform Integrity, Healthcare Diagnostics, Cybersecurity, Privacy & AI Safety: A Comprehensive Survey, Roadmap & Implementation Blueprint for Automated Review, Compliance Assurance, Moderation, Abuse & Fraud Detection, App Security, and Trust in Digital Ecosystems |
Large Language Models&Generative AI&Platform Integrity&Cybersecurity&Compliance |
| 25.06 |
University of Pennsylvania |
arxiv |
A Fast, Reliable, and Secure Programming Language for LLM Agents with Code Actions |
Programming Language&LLM Agents&Code Actions&Security&Parallelization |
| 25.06 |
Universitas Muhammadiyah Surakarta |
arxiv |
Using LLMs for Security Advisory Investigations: How Far Are We? |
Security Advisory&CVE ID&LLMs&Hallucination&Reliability |
| 25.06 |
University of Texas at El Paso |
arxiv |
Evaluating Large Language Models for Phishing Detection, Self-Consistency, Faithfulness, and Explainability |
Phishing Detection&Large Language Models&Explainability&Self-Consistency&Fine-Tuning |
| 25.06 |
Universiti Sains Malaysia |
arxiv |
PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection |
Phishing Website Detection&Large Language Model&Multi-Agent System&Debate Framework&Explainability |
| 25.06 |
NTT |
arxiv |
Towards Safety Evaluations of Theory of Mind in Large Language Models |
Theory of Mind&LLM Safety&Evaluation |
| 25.06 |
The Ohio State University |
arxiv |
AI Safety vs. AI Security: Demystifying the Distinction and Boundaries |
AI Safety&AI Security&Risk Management |
| 25.06 |
Zhejiang University |
arxiv |
A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures |
LLM-Driven Agents&Agent Communication&Security Risks |
| 25.07 |
University of Science and Technology of China, Douyin Co., Ltd. |
arxiv |
SAFER: Probing Safety in Reward Models with Sparse Autoencoder |
Reward Model&Interpretability&Safety |
| 25.07 |
Princeton University |
arxiv |
Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs |
Alignment Erosion&Fine-Tuning&Safety |
| 25.07 |
AI Risk and Vulnerability Alliance |
arxiv |
Red Teaming AI Red Teaming |
Red Teaming&AI Security&Sociotechnical |
| 25.07 |
Shandong University |
arxiv |
We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems |
MCP Security&API Measurement&Privilege |
| 25.07 |
University College London |
arxiv |
Emergent Misalignment as Prompt Sensitivity: A Research Note |
Misalignment&Prompt Sensitivity&Finetuning |
| 25.07 |
Ludwig-Maximilians-Universität in Munich |
arxiv |
On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment |
Alignment&Filtering&Intractability |
| 25.07 |
University of Wisconsin-Madison |
arxiv |
Prompt-level Watermarking is Provably Impossible |
Watermarking&Prompt Injection&Impossibility |
| 25.07 |
UK AI Security Institute |
arxiv |
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety |
Chain of Thought&Monitorability&Safety |
| 25.07 |
Northeastern University |
arxiv |
LLMs Encode Harmfulness and Refusal Separately |
Harmfulness&Refusal&Safety |
| 25.07 |
Aymara |
arxiv |
Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix |
Safety Evaluation&LLM&Benchmark |
| 25.07 |
Shanghai Artificial Intelligence Laboratory |
arxiv |
Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report |
AI Risk&Safety&Benchmark |
| 25.07 |
Shanghai Artificial Intelligence Laboratory |
arxiv |
SafeWork-R1: Coevolving Safety and Intelligence under the AI-45° Law |
Safety&Reinforcement Learning&Multimodal |
| 25.07 |
University of Illinois Urbana-Champaign |
arxiv |
PurpCode: Reasoning for Safer Code Generation |
SecureCode&Reasoning&Alignment |
| 25.07 |
IBM Research |
arxiv |
OneShield - the Next Generation of LLM Guardrails |
Guardrails&Safety&Compliance |
| 25.08 |
University of Maryland, College Park |
arxiv |
Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation |
Predictive Auditing&LLM APIs&Reasoning Token Count&Token Inflation |
| 25.08 |
Independent Researcher, Arizona State University, University of California, Berkeley |
arxiv |
Measuring Harmfulness of Computer-Using Agents |
Computer-Using Agents&Safety Risks&CUAHarm Benchmark&Language Models |
| 25.08 |
Jimei University, Wenzhou-Kean University, The Hong Kong University of Science and Technology (Guangzhou), New York University, Xiamen University |
arxiv |
A Survey on Data Security in Large Language Models |
Large language model (LLM)&Data security&LLM vulnerabilities&Prompt injection |
| 25.08 |
University of South Florida |
arxiv |
Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM |
Harmful Content&LLM Safety&Jailbreak Mitigation |
| 25.08 |
Monash University |
ACM CCS 2025 |
Robust Anomaly Detection in O-RAN: Leveraging LLMs against Data Manipulation Attacks |
O-RAN Security&Anomaly Detection&Data Manipulation Attacks |
| 25.08 |
Zhejiang University |
arxiv |
Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends |
Copyright Protection&Model Fingerprinting&Text Watermarking |
| 25.08 |
Global Center on AI Governance |
arxiv |
Toward an African Agenda for AI Safety |
AI Safety in Africa&Governance&Socio-Technical Risks |
| 25.08 |
PeopleTec, Inc. |
arxiv |
SERVANT, STALKER, PREDATOR: How an Honest, Helpful, and Harmless (3H) Agent Unlocks Adversarial Skills |
Multi-Agent Systems&Service Orchestration&Composite Threats |
| 25.08 |
Nanyang Technological University |
EMNLP 2025 Findings |
Improving Alignment in LVLMs with Debiased Self-Judgment |
LVLM Alignment&Debiased Self-Judgment&Hallucination Mitigation |
| 25.09 |
Zhejiang University |
arxiv |
Web Fraud Attacks Against LLM-Driven Multi-Agent Systems |
Multi-Agent Systems&Web Fraud Attack&Security |
| 25.09 |
Alibaba AAIG |
arxiv |
Oyster-I: Beyond Refusal — Constructive Safety Alignment for Responsible Language Models |
Constructive Safety Alignment&Safety Benchmark&Game-Theoretic Modeling |
| 25.09 |
Kennesaw State University |
IEEE Internet of Things Journal |
A Survey: Towards Privacy and Security in Mobile Large Language Models |
Mobile LLMs&Privacy&Security |
| 25.09 |
University of Wisconsin-Madison |
arxiv |
Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLMs |
Prompt Injection&Threat Model&LLM Security |
| 25.09 |
Tennessee Tech University, University of Nebraska at Omaha |
arxiv |
Safety and Security Analysis of Large Language Models: Risk Profile and Harm Potential |
Safety and Security&Risk Profiling&Adversarial Prompts |
| 25.09 |
Instituto de Pesquisas Eldorado, SRI International |
arxiv |
LLM in the Middle: A Systematic Review of Threats and Mitigations to Real-World LLM-based Systems |
LLM Security&Threat Modeling&Systematic Review |
| 25.09 |
Alibaba Group, Zhejiang University |
arxiv |
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework |
Sparse Autoencoder&Safety Interpretation&LLM Interpretability |
| 25.09 |
Argonne National Laboratory |
arxiv |
Evaluating the Safety and Skill Reasoning of Large Reasoning Models Under Compute Constraints |
Reasoning Models&Safety Evaluation&Compute Constraints |
| 25.09 |
Chinese Academy of Sciences, Wuhan University, Renmin University of China, Macquarie University, Griffith University, Xiaomi Inc. |
arxiv |
LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions |
LLM-based Agents&Hallucinations&Trustworthiness |
| 25.09 |
Binghamton University, Duke University, University of Alabama at Birmingham |
arxiv |
Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting |
LLM Benchmark&Cybersecurity&Blue Teaming |
| 25.09 |
Universitat Pompeu Fabra |
INLG 2025 (accepted), arxiv |
Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs |
Lexical Simplification&Small LLMs&Safety&Knowledge Distillation |
| 25.10 |
City University of Hong Kong, Johns Hopkins University, George Mason University |
arxiv |
Towards Human-Centered RegTech: Unpacking Professionals' Strategies and Needs for Using LLMs Safely |
RegTech&Human-Centered NLP&Compliance Risk&LLM Safety |
| 25.10 |
University of Mannheim |
arxiv |
A Granular Study of Safety Pretraining under Model Abliteration |
Safety Pretraining&Model Abliteration&Refusal Robustness |
| 25.10 |
University of California, Riverside |
arxiv |
Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection? |
Jailbreak Detection&Internal Representations&Tensor Decomposition |
| 25.10 |
OpenAI & Anthropic & Google DeepMind & ETH Zürich & Northeastern University & HackAPrompt & AI Security Company |
arxiv |
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections |
Adaptive Attacks&LLM Jailbreak Defense&Prompt Injection Robustness |
| 25.10 |
Ruhr-Universität Bochum & Universität Bonn & Lamarr Institute for Machine Learning and Artificial Intelligence |
arxiv |
AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures? |
AI Alignment&Failure Modes&Risk Analysis |
| 25.10 |
Harbin Institute of Technology (Shenzhen) & Pengcheng Lab |
arxiv |
GRIDAI: Generating and Repairing Intrusion Detection Rules via Collaboration among Multiple LLM-based Agents |
Intrusion Detection&Rule Generation&Multi-Agent LLM System |
| 25.10 |
University of Massachusetts Amherst & ELLIS Institute Tübingen |
arxiv |
Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies |
Multi-Agent Systems&Security Evaluation&Blackboard Architecture |
| 25.10 |
Shanghai Jiao Tong University |
NeurIPS |
Stop DDoS Attacking the Research Community with AI-Generated Survey Papers |
AI-Generated Surveys&Research Integrity&Scholarly Oversight |
| 25.10 |
LMU Munich & TUM & Oxford & HKU |
NeurIPS Workshop |
Deep Research Brings Deeper Harm |
Deep Research Agents&LLM Safety Alignment&Biosecurity Risks |
| 25.10 |
University of Connecticut, University of Alabama at Birmingham |
arxiv |
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models |
Prompt Security&Jailbreak Taxonomy&Defense Evaluation |
| 25.10 |
École Normale Supérieure (ENS) - Université Paris Sciences et Lettres (PSL), CNRS, Université Sorbonne Nouvelle |
arxiv |
On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text? |
AI-Generated Text Detection&Watermarking&Ethical AI Evaluation |
| 25.11 |
University of Pennsylvania |
arxiv |
Watermarking Discrete Diffusion Language Models |
Discrete Diffusion&Watermarking&Generative Model Security |
| 25.11 |
CEA Paris-Saclay |
arxiv |
Watermarking Large Language Models in Europe: Interpreting the AI Act in Light of Technology |
Watermarking&AI Act&Compliance Evaluation |
| 25.11 |
Shandong University |
AAAI 2026 |
HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection |
Human Language Preference Optimization&Machine-Revised Text Detection&Adversarial Multi-Task Detection |
| 25.11 |
Massachusetts Institute of Technology |
arxiv |
Hiding in the AI Traffic: Abusing MCP for LLM-Powered Agentic Red Teaming |
Model Context Protocol&Agentic Red Teaming&Command and Control |
| 25.11 |
Zhejiang University |
AAAI 2026 |
Do Not Merge My Model! Safeguarding Open-Source LLMs Against Unauthorized Model Merging |
Model Merging Stealing&LLM IP Protection&Proactive Defense |
| 25.11 |
Beijing Institute of Technology |
arxiv |
Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion |
Existential Risk&Prefix Completion&LLM Safety Evaluation |
| 25.11 |
ETH Zurich, Huawei Technologies Switzerland AG |
arxiv |
Can LLMs Make (Personalized) Access Control Decisions? |
Access Control&Personalization&LLM Security |
| 25.11 |
Purdue University, Perplexity AI |
arxiv |
BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents |
Prompt Injection&AI Browser Agents&Benchmarking |
| 25.11 |
University of Tennessee, Sungkyunkwan University |
arxiv |
Supporting Students in Navigating LLM-Generated Insecure Code |
Insecure Code Generation&Cybersecurity Education&Bifröst Framework |
| 25.11 |
Vanta, MintMCP, Darktrace |
arxiv |
Securing the Model Context Protocol (MCP): Risks, Controls, and Governance |
Model Context Protocol&AI Governance&Agent Security |
| 25.11 |
Renmin University of China, Ant Group |
AAAI 2026 |
Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems |
LLM Multi-Agent Systems&Security Risks&Adversarial Defense |
| 25.11 |
Independent Researcher |
arxiv |
Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs |
Healthcare AI Safety&Adversarial Dataset&Context-Aware Alignment |
| 25.11 |
NVIDIA, Lakera AI |
arxiv |
A Safety and Security Framework for Real-World Agentic Systems |
Agentic Systems&Safety and Security Framework&AI Risk Taxonomy |
| 25.12 |
Sun Yat-sen University |
arxiv |
An Empirical Study on the Security Vulnerabilities of GPTs |
GPT Security&Prompt Injection&Tool Misuse |
| 25.12 |
Hiroshima University, The University of Tokyo, National Institute of Informatics |
IEEE ISPA 2025 |
Decentralized Multi-Agent System with Trust-Aware Communication |
Decentralized Multi-Agent Systems&Blockchain Communication&Trust-Aware Protocols |
| 25.12 |
China Telecom (TeleAI), Sichuan University, Peking University |
arxiv |
Aetheria: A Multimodal Interpretable Content Safety Framework Based on Multi-Agent Debate and Collaboration |
Content Safety&Multi-Agent Systems&Interpretable AI&Multimodal Analysis |
| 25.12 |
DEXAI – Icaro Lab, Sapienza University of Rome, Sant’Anna School of Advanced Studies |
arxiv |
Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions |
Multi-Agent Safety&Systemic Risk&Institutional AI |
| 25.12 |
Shandong University, Nanjing University |
arxiv |
“MCP Does Not Stand for Misuse Cryptography Protocol”: Uncovering Cryptographic Misuse in Model Context Protocol at Scale |
Model Context Protocol (MCP)&Cryptographic Misuse Detection&Program Analysis |
| 25.12 |
University of Pennsylvania, Carnegie Mellon University, Columbia University |
arxiv |
MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking |
LLM Watermarking&Model Fine-Tuning&Open-Weight Models |
| 25.12 |
University of Maryland, Oracle Labs, Oracle Health AI |
ML4H 2025 |
Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment |
Healthcare AI Assistants&Iterative Alignment&Safety vs Helpfulness |
| 25.12 |
Old Dominion University |
arxiv |
ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications |
Threat Modeling&Agentic AI&Vision-Language Models |
| 25.12 |
Singapore Management University |
arxiv |
SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security |
Causality Analysis&LLM Security&Jailbreak Detection |
| 25.12 |
FAIR at Meta |
arxiv |
Self-Improving AI & Human Co-Improvement for Safer Co-Superintelligence |
Self-Improving AI&Human-AI Collaboration&Co-Superintelligence |
| 25.12 |
University of North Carolina Wilmington |
arxiv |
From Description to Score: Can LLMs Quantify Vulnerabilities? |
Vulnerability Scoring&CVSS&Large Language Models |
| 25.12 |
Beihang University |
arxiv |
SoK: Trust-Authorization Mismatch in LLM Agent Interactions |
LLM Agents&Trust and Authorization&Agent Security |
| 25.12 |
Tribhuvan University |
arxiv |
Systematization of Knowledge: Security and Safety in the Model Context Protocol Ecosystem |
Model Context Protocol&LLM Security&Agentic AI Safety |
| 25.12 |
CISPA Helmholtz Center for Information Security |
NDSS 2026 |
Chasing Shadows: Pitfalls in LLM Security Research |
LLM Security Research&Reproducibility Pitfalls&Evaluation Methodology |
| 25.12 |
Cisco AI Threat and Security Research |
arxiv |
Cisco Integrated AI Security and Safety Framework Report |
AISecurity&ThreatTaxonomy&Governance |
| 25.12 |
The Beacom College of Computer & Cyber Sciences, Dakota State University |
arxiv |
Quantifying Return on Security Controls in LLM Systems |
RiskModeling&SecurityControls&LLMSafety |
| 25.12 |
National University of Defense Technology |
arxiv |
Exploring the Security Threats of Retriever Backdoors in Retrieval-Augmented Code Generation |
Retriever Backdoors&RAG Security&Code Generation |
| 25.12 |
BITS Pilani |
arxiv |
Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection |
Multi-Vulnerability&LLM Benchmarking&Code Security |
| 26.01 |
Zhejiang University |
arxiv |
RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation |
Domain-Specific Safety&Harmful Prompt Synthesis&Knowledge Graph |
| 26.01 |
Stanford University |
arxiv |
Toward Safe and Responsible AI Agents: A Three-Pillar Model for Transparency, Accountability, and Trustworthiness |
ResponsibleAI&AgentGovernance&Transparency |
| 26.01 |
Chinese Academy of Sciences |
arxiv |
Lightweight Yet Secure: Secure Scripting Language Generation via Lightweight LLMs |
SecureScripting&PowerShell&LightweightModels |
| 26.01 |
Unknown |
arxiv |
SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations |
PromptInjection&ConstitutionalAI&Cybersecurity |
| 26.01 |
Xi’an Jiaotong University |
arxiv |
Small Symbols, Big Risks: Exploring Emoticon Semantic Confusion in Large Language Models |
EmoticonConfusion&LLMSafety&Robustness |
| 26.01 |
Zhejiang University |
arxiv |
ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models |
ModelFingerprinting&Unlearning&Copyright |
| 26.01 |
Zhejiang University |
arxiv |
DNF: Dual-Layer Nested Fingerprinting for Large Language Model Intellectual Property Protection |
ModelFingerprinting&IPProtection&Backdoor |
| 26.01 |
Peking University |
arxiv |
ToolSafe: Enhancing Tool Invocation Safety of LLM-based Agents via Proactive Step-level Guardrail and Feedback |
ToolSafety&AgentGuardrails&PromptInjection |
| 26.01 |
Nanyang Technological University |
arxiv |
Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale |
AgentSecurity&SupplyChainRisk&VulnerabilityAnalysis |
| 26.01 |
Ben-Gurion University of the Negev |
arxiv |
AgentGuardian: Learning Access Control Policies to Govern AI Agent Behavior |
AgentGovernance&AccessControl&ExecutionFlow |
| 26.01 |
Fudan University |
arxiv |
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5 |
SafetyEvaluation&MultimodalLLM&AdversarialTesting |
| 26.01 |
Fujitsu Research of Europe |
arxiv |
AgenTRIM: Tool Risk Mitigation for Agentic AI |
AgenticAI&ToolSecurity&LeastPrivilege |
| 26.01 |
Nanjing University of Aeronautics and Astronautics |
arxiv |
SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment |
LLM Safety&Jailbreak Defense&Adaptive Alignment |
| 26.01 |
Unknown |
arxiv |
Breaking the Protocol: Security Analysis of the Model Context Protocol Specification and Prompt Injection Vulnerabilities in Tool-Integrated LLM Agents |
Model Context Protocol&Prompt Injection&Agent Security |
| 26.01 |
School of Automation, Northwestern Polytechnical University, Xi'an, China |
arxiv |
FNF: Functional Network Fingerprint for Large Language Models |
Model Fingerprinting&Intellectual Property&Functional Networks |
| 26.01 |
University of Science and Technology of China |
arxiv |
Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures |
Misalignment&Persona&Safety |
| 26.02 |
eBay Inc |
arxiv |
ZERO-TRUST RUNTIME VERIFICATION FOR AGENTIC PAYMENT PROTOCOLS: MITIGATING REPLAY AND CONTEXT-BINDING FAILURES IN AP2 |
Agentic Payments&Runtime Verification&Replay Attacks |
| 26.02 |
Huazhong University of Science and Technology |
arxiv |
Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models |
Vulnerability Reasoning&Benchmarking&RLVR |
| 26.02 |
Technical University of Darmstadt |
arxiv |
GoodVibe: Security-by-Vibe for LLM-Based Code Generation |
Code Security&Neuron-Level Tuning&Code Generation |
| 26.02 |
Canadian Institute for Cybersecurity (CIC), University of New Brunswick, New Brunswick, Canada |
arxiv |
Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP |
AI-Agent Protocols&Threat Modeling&MCP |
| 26.02 |
DiSTA, University of Insubria, Italy |
arxiv |
LoRA-based Parameter-Efficient LLMs for Continuous Learning in Edge-based Malware Detection |
Malware Detection&Edge Computing&Continuous Learning |
| 26.02 |
Unknown |
arxiv |
Agentic AI for Cybersecurity: A Meta-Cognitive Architecture for Governable Autonomy |
Cybersecurity&Agentic AI&Governable Autonomy |
| 26.02 |
University of Luxembourg, Interdisciplinary Center for Security, Reliability, and Trust (SnT), Trustworthy Software Engineering Group (TruX), Luxembourg |
arxiv |
Assessing Spear-Phishing Website Generation in Large Language Model Coding Agents |
Spear-Phishing&Coding Agents&Cyber Misuse |
| 26.02 |
ShanghaiTech University, Shanghai, China |
arxiv |
A Trajectory-Based Safety Audit of Clawdbot (OpenClaw) |
Agent Safety Audit&Trajectory Analysis&OpenClaw |
| 26.02 |
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China |
arxiv |
Intellicise Wireless Networks Meet Agentic AI: A Security and Privacy Perspective |
Agentic AI&Wireless Security&Privacy |
| 26.02 |
Applied Machine Learning Research |
arxiv |
Intent Laundering: AI Safety Datasets Are Not What They Seem |
Safety Datasets&Intent Laundering&Evaluation Robustness |
| 26.02 |
Fraunhofer ISST |
arxiv |
DAVE: A Policy-Enforcing LLM Spokesperson for Secure Multi-Document Data Sharing |
Data Sharing&Policy Enforcement&LLM Spokesperson |
| 26.02 |
Department of Computer Science, National University of Singapore |
arxiv |
LLM-enabled Applications Require System-Level Threat Monitoring |
threat monitoring&incident response&LLM systems |
| 26.02 |
University of Technology Sydney |
arxiv |
SoK: Agentic Skills — Beyond Tool Use in LLM Agents |
agentic skills&supply chain&survey |
| 26.02 |
Amazon Web Services |
arxiv |
Manifold of Failure: Behavioral Attraction Basins in Language Models |
failure manifold&MAP-Elites&alignment deviation |
| 26.02 |
National University of Singapore |
arxiv |
IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation |
auditing&verifiable computation&API integrity |
| 26.03 |
Shanghai Innovation Institute |
arxiv |
From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions |
agentic AI&web security&survey |
| 26.03 |
Sahara AI |
arxiv |
Proof-of-Guardrail in AI Agents and What (Not) to Trust from It |
agent guardrails&TEE attestation&verifiable safety |
| 26.03 |
Beihang University |
arxiv |
Evolving Deception: When Agents Evolve, Deception Wins |
deceptive agents&self-evolution&alignment drift |
| 26.03 |
Communication and Distributed Systems, RWTH Aachen University, Germany |
arxiv |
Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers |
artifact evaluation&reproducibility&cybersecurity |
| 26.03 |
Shandong University, Qingdao, Shandong, China |
arxiv |
Give Them an Inch and They Will Take a Mile: Understanding and Measuring Caller Identity Confusion in MCP-Based AI Systems |
MCP security&caller identity&authorization |
| 26.03 |
Crew Scaler |
arxiv |
Security Considerations for Multi-agent Systems |
multi-agent systems&security frameworks&threat taxonomy |
| 26.03 |
Institute of Information Engineering, Chinese Academy of Sciences |
arxiv |
ProvAgent: Threat Detection Based on Identity-Behavior Binding and Multi-Agent Collaborative Attack Investigation |
threat detection&provenance graphs&multi-agent investigation |
| 26.03 |
Shandong University |
arxiv |
Don’t Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw |
code agents&OpenClaw&human-in-the-loop defense |
| 26.03 |
School of Interactive Computing, Georgia Institute of Technology |
arxiv |
Safe and Scalable Web Agent Learning via Recreated Websites |
web agents&synthetic environments&self-evolution |
| 26.03 |
Ant Group & Tsinghua University, China |
arxiv |
Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats |
autonomous agents&OpenClaw&lifecycle security |
| 26.03 |
College of Intelligent Science and Engineering, Jinan University |
arxiv |
Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework |
agent memory&governance&semantic drift |
| 26.03 |
State Key Laboratory of Complex & Critical Software Environment, Beihang University |
arxiv |
Uncovering Security Threats and Architecting Defenses in Autonomous Agents: A Case Study of OpenClaw |
Autonomous Agents&Threat Modeling&Defense Architecture |
| 26.03 |
Unknown |
arxiv |
Evaluation of Audio Language Models for Fairness, Safety, and Security |
Audio LLMs&Safety Evaluation&Structural Taxonomy |
| 26.03 |
Centre for Philosophy and AI Research, Friedrich-Alexander-University Erlangen-Nuremberg |
arxiv |
Questionnaire Responses Do Not Capture the Safety of AI Agents |
AI Agents&Safety Assessment&Construct Validity |
| 26.03 |
University of Connecticut |
arxiv |
Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning |
Code LLMs&Data Poisoning&Vulnerability Scanning |
| 26.03 |
Dartmouth College |
arxiv |
Retrieval-Augmented LLMs for Security Incident Analysis |
Security Incident Analysis&RAG&MITRE ATTACK |
| 26.03 |
Purdue University |
arxiv |
Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety |
Agent Safety&Benchmark Auditing&Tool Calls |
| 26.03 |
University of Electronic Science and Technology of China, Chengdu, China |
arxiv |
Functional Subspace Watermarking for Large Language Models |
Model Watermarking&Functional Subspace&Ownership Verification |
| 26.03 |
UC Santa Cruz |
arxiv |
A Framework for Formalizing LLM Agent Security |
Agent security&Contextual security&Authorization |
| 26.03 |
City University of Hong Kong |
arxiv |
PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management |
Mobile power management&LLM agents&Personalization |
| 26.03 |
Shanghai Jiao Tong University, Shanghai, China |
arxiv |
Trojan’s Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance |
OpenClaw&Guidance injection&Autonomous coding agents |
| 26.03 |
BigCommerce |
arxiv |
An Agentic Multi-Agent Architecture for Cybersecurity Risk Management⋆ |
Cybersecurity risk&Multi-agent systems&Risk assessment |
| 26.03 |
Lulea tekniska universitet, Sweden |
arxiv |
Agentproof: Static Verification of Agent Workflow Graphs |
Static verification&Workflow graphs&Temporal safety |
| 26.03 |
Department of Computing Science, Umea University, Umea, Sweden |
arxiv |
Memory poisoning and secure multi-agent systems |
Memory poisoning&Multi-agent systems&Cryptographic mitigation |
| 26.03 |
University of Washington, USA |
arxiv |
AC4A: Access Control for Agents |
Access control&Agent permissions&API security |
| 26.03 |
Sun Yat-sen University |
arxiv |
SkillProbe: Security Auditing for Emerging Agent Skill Marketplaces via Multi-Agent Collaboration |
Skill marketplaces&Security auditing&Multi-agent collaboration |
| 26.03 |
Department of Computer Science, New York Institute of Technology, Vancouver, BC, Canada |
arxiv |
Auditing MCP Servers for Over-Privileged Tool Capabilities |
MCP servers&Capability auditing&Tool privileges |
| 26.03 |
New York Institute of Technology, Vancouver, BC, Canada |
arxiv |
Are AI-assisted Development Tools Immune to Prompt Injection? |
Prompt injection&MCP clients&Development tools |
| 26.03 |
Department of Computer Science, New York Institute of Technology, Canada |
arxiv |
Model Context Protocol Threat Modeling and Analyzing Vulnerabilities to Prompt Injection with Tool Poisoning |
MCP security&Tool poisoning&Threat modeling |
| 26.03 |
Beijing University of Posts and Telecommunications |
arxiv |
ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers |
OpenClaw security&Watcher middleware&Runtime protection |
| 26.03 |
Department of Computer Science, Institute of Artificial Intelligence, University of Central Florida |
Transactions on Machine Learning Research 2026 |
AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective |
Foundation model security&Threat taxonomy&Cross-modal defense |
| 26.03 |
CSIRO Data61 |
arxiv |
Clawed and Dangerous: Can We Trust Open Agentic Systems? |
Agent security&Open agentic systems&Software engineering |
| 26.03 |
Rensselaer Polytechnic Institute |
arxiv |
SAFETYDRIFT: Predicting When AI Agents Cross the Line Before They Actually Do |
Agent safety&Trajectory prediction&Runtime monitoring |
| 26.03 |
SUCCESS Lab, Texas A&M University |
arxiv |
A Systematic Taxonomy of Security Vulnerabilities in the OpenClaw AI Agent Framework |
Agent vulnerabilities&Security taxonomy&OpenClaw |
| 26.03 |
The Hong Kong University of Science and Technology (Guangzhou), China |
arxiv |
“What Did It Actually Do?”: Understanding Risk Awareness and Traceability for Computer-Use Agents |
Risk awareness&Agent traceability&Computer-use agents |
| 26.03 |
Department of Electrical and Computer Engineering, Tandon School of Engineering, New York University, USA |
arxiv |
Safeguarding LLMs Against Misuse and AI-Driven Malware Using Steganographic Canaries |
Malware defense&Steganographic canaries&LLM misuse |
| 26.03 |
Singapore Management University, Singapore |
arxiv |
SafeClaw-R: Towards Safe and Secure Multi-Agent Personal Assistants |
Multi-agent assistants&Runtime enforcement&Skill security |
| 26.03 |
Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy |
arxiv |
Security in LLM-as-a-Judge: A Comprehensive SoK |
LLM-as-a-Judge&Security survey&Evaluation robustness |
| 26.04 |
Mattersec Labs |
arxiv |
SecLens: Role-Specific Evaluation of LLMs for Security Vulnerability Detection |
Vulnerability detection&Stakeholder evaluation&Role-specific scoring |
| 26.04 |
University of Melbourne |
arxiv |
Combating Data Laundering in LLM Training |
Data laundering&Training data detection&Synthesis data reversion |
| 26.04 |
Fudan University, China |
arxiv |
From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers |
MCP security&Malicious servers&Behavioral deviation detection |
| 26.04 |
University of California, Los Angeles |
EACL 2026 |
Open-Domain Safety Policy Construction |
Safety policy construction&Agentic research&Content moderation |
| 26.04 |
Constellation |
arxiv |
An Independent Safety Evaluation of Kimi K2.5 |
Bias Reduction&Safety Evaluation&Open-Weight Models |
| 26.04 |
UC Santa Cruz |
arxiv |
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw |
Personal AI&Threat Taxonomy&Safety Evaluation |
| 26.04 |
aKASTEL Security Research Labs, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany |
arxiv |
Foundations for Agentic AI Investigations from the Forensic Analysis of OpenClaw |
Threat Taxonomy&AI Forensics&OpenClaw |
| 26.04 |
University of the Cumberlands |
arxiv |
A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms |
Threat Taxonomy&Benchmarking&MCP Security |
| 26.04 |
Computer Science & Engineering, Mississippi State University |
arxiv |
Attribution-Driven Explainable Intrusion Detection with Encoder-Based Large Language Models |
Security Analysis&Intrusion Detection&Explainability |
| 26.04 |
Research Institute of Trustworthy Autonomous |
arxiv |
ClawLess: A Security Model of AI Agents |
Clawless&Security&Model |
| 26.04 |
Department of Earth Science and Engineering, Imperial College London, London, United Kingdom |
arxiv |
SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills |
Agent Skills&OpenClaw&Benchmarking |
| 26.04 |
The Pennsylvania State University |
arxiv |
TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation |
Tool Poisoning&TRUSTDESC&Preventing |
| 26.04 |
The Hong Kong Polytechnic University |
arxiv |
Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions |
Threat Taxonomy&Benchmarking&RAG Security |
| 26.04 |
Arizona State University |
arxiv |
Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit |
Large Language&Security Operations Center&LLM tools |
| 26.04 |
University of Delaware |
arxiv |
Towards Personalizing Secure Programming Education with LLM-Injected Vulnerabilities |
Personalizing Secure Programming&Secure Programming Education&Personalizing Secure |
| 26.04 |
DEXAI - Icaro Lab |
arxiv |
Agentic Microphysics: A Manifesto for Generative AI Safety |
Agentic Microphysics&Manifesto for Generative&Manifesto |
| 26.04 |
Chongqing University |
ACL 2026 |
DEEPGUARD: Secure Code Generation via Multi-Layer Semantic Aggregation |
DEEPGUARD&Multi-Layer Semantic Aggregation&Semantic Aggregation |
| 26.04 |
New York University |
FORC 2026 |
Can we Watermark Low-Entropy LLM Outputs? |
Low-Entropy LLM Outputs&Watermark Low-Entropy LLM&output |
| 26.04 |
MemTensor, China |
arxiv |
A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty |
Agent Memory&Memory Security&Governance |
| 26.04 |
Tandon School of Engineering, New York University |
arxiv |
Surgical Repair of Insecure Code Generation in LLMs |
Code Security&Model Repair&Mechanistic Diagnosis |
| 26.04 |
ETH Zurich |
arxiv |
Using large language models for embodied planning introduces systematic safety risks |
Embodied Planning&Robotics Safety&LLM Agents |
| 26.04 |
BlueFocus Communication Group |
arxiv |
Owner-Harm: A Missing Threat Model for AI Agent Safety |
Agent Safety&Threat Model&Owner Harm |
| 26.04 |
University of Oslo |
arxiv |
Towards Agentic Investigation of Security Alerts |
Security Alerts&Agentic Investigation&Cybersecurity |
| 26.05 |
Fudan University |
arxiv |
Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses |
Embodied AI&Safety Survey&Attacks |
| 26.05 |
Shanghai Jiao Tong University |
arxiv |
ClawGuard: Out-of-Band Detection of LLM Agent Workflow Hijacking via EM Side Channel |
Workflow Hijacking&Side Channel&Agent Security |