Skip to content

Latest commit

 

History

History
476 lines (462 loc) · 277 KB

File metadata and controls

476 lines (462 loc) · 277 KB

Security

Different from the main README🕵️

  • Within this subtopic, we will be updating with the latest articles. This will help researchers in this area to quickly understand recent trends.
  • In addition to providing the most recent updates, we will also add keywords to each subtopic to help you find content of interest more quickly.
  • Within each subtopic, we will also update with profiles of scholars we admire and endorse in the field. Their work is often of high quality and forward-looking!"

📑Papers

Date Institute Publication Paper Keywords
20.10 Facebook AI Research arxiv Recipes for Safety in Open-domain Chatbots Toxic Behavior&Open-domain
22.02 DeepMind EMNLP2022 Red Teaming Language Models with Language Model Red Teaming&Harm Test
22.03 OpenAI NIPS2022 Training language models to follow instructions with human feedback InstructGPT&RLHF&Harmless
22.04 Anthropic arxiv Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Helpful&Harmless
22.05 UCSD EMNLP2022 An Empirical Analysis of Memorization in Fine-tuned Autoregressive Language Models Privacy Risks&Memorization
22.09 Anthropic arxiv Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Red Teaming&Harmless&Helpful
22.12 Anthropic arxiv Constitutional AI: Harmlessness from AI Feedback Harmless&Self-improvement&RLAIF
23.07 UC Berkeley NIPS2023 Jailbroken: How Does LLM Safety Training Fail? Jailbreak&Competing Objectives&Mismatched Generalization
23.08 The Chinese University of Hong Kong Shenzhen China, Tencent AI Lab, The Chinese University of Hong Kong arxiv GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs Via Cipher Safety Alignment&Adversarial Attack
23.08 University College London, University College London, Tilburg University arxiv Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities Security&AI Alignment
23.09 Peking University arxiv RAIN: Your Language Models Can Align Themselves without Finetuning Self-boosting&Rewind Mechanisms
23.10 Princeton University, Virginia Tech, IBM Research, Stanford University arxiv FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO! Fine-tuning****Safety Risks&Adversarial Training
23.10 UC Riverside arXiv Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks Adversarial Attacks&Vulnerabilities&Model Security
23.10 Rice University NAACL2024(findings) Secure Your Model: An Effective Key Prompt Protection Mechanism for Large Language Models Key Prompt Protection&Large Language Models&Unauthorized Access Prevention
23.11 KAIST AI arxiv HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning Hate Speech&Detection
23.11 CMU AACL2023(ART or Safety workshop) Measuring Adversarial Datasets Adversarial Robustness&AI Safety&Adversarial Datasets
23.11 UIUC arxiv Removing RLHF Protections in GPT-4 via Fine-Tuning Remove Protection&Fine-Tuning
23.11 IT University of Copenhagen,University of Washington arxiv Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild Red Teaming
23.11 Fudan University&Shanghai AI lab arxiv Fake Alignment: Are LLMs Really Aligned Well? Alignment Failure&Safety Evaluation
23.11 University of Southern California arxiv SAFER-INSTRUCT: Aligning Language Models with Automated Preference Data RLHF&Safety
23.11 Google Research arxiv AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications Adversarial Testing&AI-Assisted Red Teaming&Application Safety
23.11 Tencent AI Lab arxiv ADVERSARIAL PREFERENCE OPTIMIZATION Human Preference Alignment&Adversarial Preference Optimization&Annotation Reduction
23.11 Docta.ai arxiv Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models Data Credibility&Safety alignment
23.11 CIIRC CTU in Prague arxiv A Security Risk Taxonomy for Large Language Models Security risks&Taxonomy&Prompt-based attacks
23.11 Meta&University of Illinois Urbana-Champaign NAACL2024 MART: Improving LLM Safety with Multi-round Automatic Red-Teaming Automatic Red-Teaming&LLM Safety&Adversarial Prompt Writing
23.11 The Ohio State University&University of California, Davis NAACL2024 How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities Open-Source LLMs&Malicious Demonstrations&Trustworthiness
23.12 Drexel University arXiv A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly Security&Privacy&Attacks
23.12 Tenyx arXiv Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation Geometric Interpretation&Intrinsic Dimension&Toxicity Detection
23.12 Independent (Now at Google DeepMind) arXiv Scaling Laws for Adversarial Attacks on Language Model Activations Adversarial Attacks&Language Model Activations&Scaling Laws
23.12 University of Liechtenstein, University of Duesseldorf arxiv NEGOTIATING WITH LLMS: PROMPT HACKS, SKILL GAPS, AND REASONING DEFICITS Negotiation&Reasoning&Prompt Hacking
23.12 University of Wisconsin Madison, University of Michigan Ann Arbor, ASU, Washington University arXiv Exploring the Limits of ChatGPT in Software Security Applications Software Security
23.12 GenAI at Meta arxiv Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations Human-AI Conversation&Safety Risk taxonomy
23.12 University of California Riverside, Microsoft arxiv Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack Safety Alignment&Summarization&Vulnerability
23.12 MIT, Harvard NIPS2023(Workshop) Forbidden Facts: An Investigation of Competing Objectives in Llama-2 Competing Objectives&Forbidden Fact Task&Model Decomposition
23.12 University of Science and Technology of China arxiv Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models Text Protection&Silent Guardian
23.12 OpenAI Open AI Practices for Governing Agentic AI Systems Agentic AI Systems&LM Based Agent
23.12 University of Massachusetts Amherst, Columbia University, Google, Stanford University, New York University arxiv Learning and Forgetting Unsafe Examples in Large Language Models Safety Issues&ForgetFilter Algorithm&Unsafe Content
23.12 Tencent AI Lab, The Chinese University of Hong Kong arxiv Aligning Language Models with Judgments Judgment Alignment&Contrastive Unlikelihood Training
24.01 Delft University of Technology arxiv Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks Red Teaming&Hallucinations&Mathematics Tasks
24.01 Apart Research, University of Edinburgh, Imperial College London, University of Oxford arxiv Large Language Models Relearn Removed Concepts Neuroplasticity&Concept Redistribution
24.01 Tsinghua University, Xiaomi AI Lab, Huawei, Shenzhen Heytap Technology, vivo AI Lab, Viomi Technology, Li Auto, Beijing University of Posts and Telecommunications, Soochow University arxiv PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY EFFICIENCY AND SECURITY Intelligent Personal Assistant&LLM Agent&Security and Privacy
24.01 Zhongguancun Laboratory, Tsinghua University, Institute of Information Engineering Chinese Academy of Sciences, Ant Group arxiv Risk Taxonomy Mitigation and Assessment Benchmarks of Large Language Model Systems Safety&Risk Taxonomy&Mitigation Strategies
24.01 Google Research arxiv Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Interpretability
24.01 Ben-Gurion University of the Negev Israel arxiv GPT IN SHEEP’S CLOTHING: THE RISK OF CUSTOMIZED GPTS GPTs&Cybersecurity&ChatGPT
24.01 Shanghai Jiao Tong University arxiv R-Judge: Benchmarking Safety Risk Awareness for LLM Agents LLM Agents&Safety Risk Awareness&Benchmark
24.01 Ant Group arxiv A FAST PERFORMANT SECURE DISTRIBUTED TRAINING FRAMEWORK FOR LLM Distributed LLM&Security
24.01 Shanghai Artificial Intelligence Laboratory, Dalian University of Technology, University of Science and Technology of China arxiv PsySafe: A Comprehensive Framework for Psychological-based Attack Defense and Evaluation of Multi-agent System Safety Multi-agent Systems&Agent Psychology&Safety
24.01 Rochester Institute of Technology arxiv Mitigating Security Threats in LLMs Security Threats&Prompt Injection&Jailbreaking
24.01 Johns Hopkins University, University of Pennsylvania, Ohio State University arxiv The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts Multilingualism&Safety&Resource Disparity
24.01 University of Florida arxiv Adaptive Text Watermark for Large Language Models Text Watermarking&Robustness&Security
24.01 The Hebrew University arXiv TRADEOFFS BETWEEN ALIGNMENT AND HELPFULNESS IN LANGUAGE MODELS Language Model Alignment&AI Safety&Representation Engineering
24.01 Google Research, Anthropic arxiv Gradient-Based Language Model Red Teaming Red Teaming&Safety&Prompt Learning
24.01 National University of Singapore, Pennsylvania State University arxiv Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code Watermarking&Error Correction Code&AI Ethics
24.01 Tsinghua University, University of California Los Angeles, WeChat AI Tencent Inc. arxiv Prompt-Driven LLM Safeguarding via Directed Representation Optimization Safety Prompts&Representation Optimization
24.02 Rensselaer Polytechnic Institute, IBM T.J. Watson Research Center, IBM Research arxiv Adaptive Primal-Dual Method for Safe Reinforcement Learning Safe Reinforcement Learning&Adaptive Primal-Dual&Adaptive Learning Rates
24.02 Jagiellonian University, University of Modena and Reggio Emilia, Alma Mater Studiorum University of Bologna, European University Institute arxiv No More Trade-Offs: GPT and Fully Informative Privacy Policies ChatGPT&Privacy Policies&Legal Requirements
24.02 Florida International University arxiv Security and Privacy Challenges of Large Language Models: A Survey Security&Privacy Challenges&Suevey
24.02 Rutgers University, University of California, Santa Barbara, NEC Labs America arxiv TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution LLM-based Agents&Safety&Trustworthiness
24.02 University of Maryland College Park, JPMorgan AI Research, University of Waterloo, Salesforce Research arxiv Shadowcast: Stealthy Data Poisoning Attacks against VLMs Vision-Language Models&Data Poisoning&Security
24.02 Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong arxiv SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models Safety Benchmark&Safety Evaluation**&Hierarchical Taxonomy
24.02 Fudan University arxiv ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages Tool Learning&Large Language Models (LLMs)&Safety Issues&ToolSword
24.02 Paul G. Allen School of Computer Science & Engineering, University of Washington arxiv SPML: A DSL for Defending Language Models Against Prompt Attacks Domain-Specific Language (DSL)&Chatbot Definitions&System Prompt Meta Language (SPML)
24.02 Tsinghua University arxiv ShieldLM: Empowering LLMs as Aligned Customizable and Explainable Safety Detectors Safety Detectors&Customizable&Explainable
24.02 Dalhousie University arxiv Immunization Against Harmful Fine-tuning Attacks Fine-tuning Attacks&Immunization
24.02 Chinese Academy of Sciences, University of Chinese Academy of Sciences, Alibaba Group arxiv SoFA: Shielded On-the-fly Alignment via Priority Rule Following Priority Rule Following&Alignment
24.02 Universidade Federal de Santa Catarina arxiv A Survey of Large Language Models in Cybersecurity Cybersecurity&Vulnerability Assessment
24.02 Zhejiang University arxiv PRSA: Prompt Reverse Stealing Attacks against Large Language Models Prompt Reverse Stealing Attacks&Security
24.02 Shanghai Artificial Intelligence Laboratory NAACL2024 Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey Large Language Models&Conversation Safety&Survey
24.03 Tulane University arxiv ENHANCING LLM SAFETY VIA CONSTRAINED DIRECT PREFERENCE OPTIMIZATION Reinforcement Learning&Human Feedback&Safety Constraints
24.03 University of Illinois Urbana-Champaign arxiv INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents Tool Integration&Security&Indirect Prompt Injection
24.03 Harvard University arxiv Towards Safe and Aligned Large Language Models for Medicine *Medical Safety&Alignment&Ethical Principles
24.03 Rensselaer Polytechnic Institute, University of Michigan, IBM Research, MIT-IBM Watson AI Lab arxiv ALIGNERS: DECOUPLING LLMS AND ALIGNMENT Alignment&Synthetic Data
24.03 MIT, Princeton University, Stanford University, Georgetown University, AI Risk and Vulnerability Alliance, Eleuther AI, Brown University, Carnegie Mellon University, Virginia Tech, Northeastern University, UCSB, University of Pennsylvania, UIUC arxiv A Safe Harbor for AI Evaluation and Red Teaming AI Evaluation&Red Teaming&Safe Harbor
24.03 University of Southern California arxiv Logits of API-Protected LLMs Leak Proprietary Information API-Protected LLMs&Softmax Bottleneck&Embedding Size Detection
24.03 University of Bristol arxiv Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention Safety&Prompt Engineering
24.03 XiaMen University, Yanshan University, IDEA Research, Inner Mongolia University, Microsoft, Microsoft Research Asia arxiv Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models Safety&Guidelines&Alignment
24.03 Tianjin University, Tianjin University, Zhengzhou University, China Academy of Information and Communications Technology arxiv OpenEval: Benchmarking Chinese LLMs across Capability, Alignment, and Safety Chinese LLMs&Benchmarking&Safety
24.03 Center for Cybersecurity Systems and Networks, AIShield Bosch Global Software Technologies Bengaluru India arxiv Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal LLM Security&Threat modeling&Risk Assessment
24.03 Queen’s University Belfast arxiv AI Safety: Necessary but insufficient and possibly problematic AI Safety&Transparency&Structural Harm
24.04 Provable Responsible AI and Data Analytics (PRADA) Lab, King Abdullah University of Science and Technology arxiv Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs Dialectical Alignment&3H Principle&Security Threats
24.04 LibrAI, Tsinghua University, Harbin Institute of Technology, Monash University, The University of Melbourne, MBZUAI arxiv Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models Red Teaming&Safety
24.04 University of California, Santa Barbara, Meta AI arxiv Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models Safety&Helpfulness&Controllability
24.04 School of Information and Software Engineering, University of Electronic Science and Technology of China arxiv Exploring Backdoor Vulnerabilities of Chat Models Backdoor Attacks&Chat Models&Security
24.04 Enkrypt AI arxiv INCREASED LLM VULNERABILITIES FROM FINE-TUNING AND QUANTIZATION Fine-tuning&Quantization&LLM Vulnerabilities
24.04 TongJi University, Tsinghua University&, eijing University of Technology, Nanyang Technological University, Peng Cheng Laboratory arxiv Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security Multimodal Large Language Models&Security Vulnerabilities&Image Inputs
24.04 University of Washington, Carnegie Mellon University, University of British Columbia, Vector Institute for AI arxiv CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge AI-Assisted Red-Teaming&Multicultural Knowledge
24.04 Nanjing University DLSP 2024 Subtoxic Questions: Dive Into Attitude Change of LLM’s Response in Jailbreak Attempts Jailbreak&Subtoxic Questions&GAC Model
24.04 Innodata arxiv Benchmarking Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Propensity for Hallucinations Evaluation&Safety
24.04 University of Cambridge, New York University, ETH Zurich arxiv Foundational Challenges in Assuring Alignment and Safety of Large Language Models Alignment&Safety
24.04 Zhejiang University arxiv TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment Intellectual Property Protection&Edge-deployed Transformer Model
24.04 Harvard University arxiv More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness Reinforcement Learning from Human Feedback&Trustworthiness
24.05 University of Maryland arxiv Constrained Decoding for Secure Code Generation Code Generation&Code LLM&Secure Code Generation&AI Safety
24.05 Huazhong University of Science and Technology arxiv Large Language Models for Cyber Security: A Systematic Literature Review Cybersecurity&Systematic Review
24.04 CSIRO’s Data61 ACM International Conference on AI-powered Software An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping AI Safety&Evaluation Framework&AI Lifecycle Mapping
24.05 CSAIL and CBMM, MIT arxiv SecureLLM: Using Compositionality to Build Provably Secure Language Models for Private, Sensitive, and Secret Data SecureLLM&Compositionality
24.05 Carnegie Mellon University arxiv Human–AI Safety: A Descendant of Generative AI and Control Systems Safety Human–AI Safety&Generative AI
24.05 University of York arxiv Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding Safe Reinforcement Learning&Black-Box Environments&Adaptive Shielding
24.05 Princeton University arxiv AI Risk Management Should Incorporate Both Safety and Security AI Safety&AI Security&Risk Management
24.05 University of Oslo arxiv AI Safety: A Climb to Armageddon? AI Safety&Existential Risk&AI Governance
24.06 Zscaler, Inc. arxiv Exploring Vulnerabilities and Protections in Large Language Models: A Survey Prompt Hacking&Adversarial Attacks&Suvery
24.06 Texas A & M University - San Antonio arxiv Transforming Computer Security and Public Trust Through the Exploration of Fine-Tuning Large Language Models Fine-Tuning&Cyber Security
24.06 Alibaba Group arxiv How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States LLM Safety&Alignment&Jailbreak
24.06 UC Davis arxiv Security of AI Agents Security&AI Agents&Vulnerabilities
24.06 University of Connecticut USENIX Security ‘24 An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection Backdoor Attack&Code Completion Models&Vulnerability Detection
24.06 University of California, Irvine arxiv TorchOpera: A Compound AI System for LLM Safety TorchOpera&LLM Safety&Compound AI System
24.06 NVIDIA Corporation arxiv garak: A Framework for Security Probing Large Language Models garak&Security Probing
24.06 Carnegie Mellon University arxiv Current State of LLM Risks and AI Guardrails LLM Risks&AI Guardrails
24.06 Johns Hopkins University arxiv Every Language Counts: Learn and Unlearn in Multilingual LLMs Multilingual LLMs&Fake Information&Unlearning
24.06 Tsinghua University arxiv Finding Safety Neurons in Large Language Models Safety Neurons&Mechanistic Interpretability&AI Safety
24.06 Center for AI Safety and Governance, Institute for AI, Peking University arxiv SAFESORA: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset Safety Alignment&Text2Video Generation
24.06 Samsung R&D Institute UK, KAUST, University of Oxford arxiv Model Merging and Safety Alignment: One Bad Model Spoils the Bunch Model Merging&Safety Alignment
24.06 Hofstra University arxiv Analyzing Multi-Head Attention on Trojan BERT Models Trojan Attack&BERT Models&Multi-Head Attention
24.06 Fudan University arxiv SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance Safety Alignment&Jailbreak Attacks&Response Disparity
24.06 Stony Brook University NAACL 2024 Workshop Automated Adversarial Discovery for Safety Classifiers Safety Classifiers&Adversarial Attacks&Toxicity
24.07 University of Utah arxiv Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression Model Compression&Safety Evaluation
24.07 University of Alberta arxiv Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture Multilingual Blending&LLM Safety Alignment&Language Mixture
24.07 Singapore National Eye Centre arxiv A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models – Safety, Consensus, Objectivity, Reproducibility and Explainability Evaluation Framework
24.07 Microsoft arxiv SLIP: Securing LLM’s IP Using Weights Decomposition Hybrid Inference&Model Security&Weights Decomposition
24.07 Microsoft arxiv Phi-3 Safety Post-Training: Aligning Language Models with a “Break-Fix” Cycle Phi-3&Safety Post-Training
24.07 Tsinghua University arxiv Course-Correction: Safety Alignment Using Synthetic Preferences Course-Correction&Safety Alignment&Synthetic Preferences
24.07 Northwestern University arxiv From Sands to Mansions: Enabling Automatic Full-Life-Cycle Cyberattack Construction with LLM Cyberattack Construction&Full-Life-Cycle
24.07 Singapore University of Technology and Design arxiv AI Safety in Generative AI Large Language Models: A Survey Generative AI&AI Safety
24.07 Lehigh University arxiv Blockchain for Large Language Model Security and Safety: A Holistic Survey Blockchain&Security&Safety
24.08 OpenAI openai Rule-Based Rewards for Language Model Safety Reinforcement Learning&Safety&Rule-Based Rewards
24.08 University of Texas at Austin arxiv HIDE AND SEEK: Fingerprinting Large Language Models with Evolutionary Learning Model Fingerprinting&In-context Learning
24.08 Technical University of Munich arxiv Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study Secure Code Assessment&Vulnerability Detection
24.08 Offenburg University of Applied Sciences arxiv "You still have to study" - On the Security of LLM generated code Code Security&Prompting Techniques
24.08 University of Connecticut arxiv Clip2Safety: A Vision Language Model for Interpretable and Fine-Grained Detection of Safety Compliance in Diverse Workplaces Vision Language Model&Safety Compliance&Personal Protective Equipment Detection
24.08 Pabna University of Science and Technology arxiv Risks, Causes, and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey Privacy&Bias&Interpretability
24.08 Quinnipiac University arxiv Is Generative AI the Next Tactical Cyber Weapon For Threat Actors? Unforeseen Implications of AI Generated Cyber Attacks Generative AI&Cybersecurity&Cyber Attacks
24.08 Nanyang Technological University arxiv Trustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations AI Safety&Trustworthy&Responsible
24.08 King Abdullah University of Science and Technology arxiv Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models Safety&Helpfulness&LLM Alignment
24.08 University of Calgary arxiv Trustworthy and Responsible AI for Human-Centric Autonomous Decision-Making Systems Trustworthy AI&Algorithmic Bias&Responsible AI
24.08 University of Oxford arxiv AI Security Audits: Challenges and Innovations in Assessing Large Language Models AI Security Audits&Vulnerability Assessment&AI Ethics
24.08 University of Science and Technology of China arxiv Safety Layers of Aligned Large Language Models: The Key to LLM Security Aligned LLM&Safety Layers&Security Degradation
24.09 University of Texas at San Antonio arxiv Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs Source Code Security&LLMs&Reinforcement Learning
24.09 The Hong Kong Polytechnic University arxiv Alignment-Aware Model Extraction Attacks on Large Language Models Model Extraction Attacks&LLM Alignment&Watermark Resistance
24.09 University of Oxford, Redwood Research arxiv Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols AI Control&Safety Protocols&Game Theory
24.09 University of Galway ECAI AIEB Workshop Ethical AI Governance: Methods for Evaluating Trustworthy AI Trustworthy AI&Ethics&AI Evaluation
24.09 University of Texas at San Antonio arxiv AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing Multi-Agent Systems&Code Security&Fuzz Testing&Static Analysis
24.09 Tsinghua University arxiv Language Models Learn to Mislead Humans via RLHF Reinforcement Learning from Human Feedback (RLHF)&U-SOPHISTRY&Misleading AI
24.09 Stevens Institute of Technology arxiv Measuring Copyright Risks of Large Language Model via Partial Information Probing Copyright&Partial Information Probing
24.09 IBM Research arxiv Attack Atlas: A Practitioner’s Perspective on Challenges and Pitfalls in Red Teaming GenAI Red Teaming&LLM Security&Adversarial Attacks
24.09 Pengcheng Laboratory arxiv Multi-Designated Detector Watermarking for Language Models Watermarking&Claimability&Multi-designated Verifier Signature
24.09 ETH Zurich arxiv An Adversarial Perspective on Machine Unlearning for AI Safety Machine Unlearning&Adversarial Attacks&Unlearning Robustness
24.10 Google DeepMind arxiv A Watermark for Black-Box Language Models Watermarking&Black-Box Models&LLM Detection
24.10 Mohamed Bin Zayed University of Artificial Intelligence arxiv Optimizing Adaptive Attacks Against Content Watermarks for Language Models Watermarking&Adaptive Attacks&LLM Security
24.10 Rice University, Rutgers University arxiv Taylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion Taylor Expansion&Model Security
24.10 PeopleTec arxiv Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders Cybersecurity&Hallucinations
24.10 Fondazione Bruno Kessler, Université Côte d’Azur EMNLP 2024 Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering Counterspeech&Safety Guardrails
24.10 University of California, Davis, AWS AI Labs arxiv Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models Safety alignment&Vision-Language models&Cross-modality representation manipulation
24.10 North Carolina State University arxiv Superficial Safety Alignment Hypothesis: The Need for Efficient and Robust Safety Mechanisms in LLMs Superficial safety alignment&Safety mechanisms&Safety-critical components
24.10 Shanghai Jiao Tong University, Chinese University of Hong Kong (Shenzhen), Tsinghua University arxiv ARCHILLES’ HEEL IN SEMI-OPEN LLMS: HIDING BOTTOM AGAINST RECOVERY ATTACKS Semi-open LLMs&Recovery attacks&Model resilience
24.10 University of Tulsa arxiv Weak-to-Strong Generalization beyond Accuracy: A Pilot Study in Safety, Toxicity, and Legal Reasoning Weak-to-Strong Generalization&Safety&Toxicity
24.10 Aalborg University arxiv Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis Language confusion&Multilingual LLMs&Security vulnerabilities
24.10 Carnegie Mellon University arxiv Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents LLM Safety&Browser Agents&Red Teaming
24.10 Palisade Research arxiv LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild LLM Agents&Honeypots&Cybersecurity
24.10 University of Pittsburgh arxiv Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents Embodied Agents&Multimodal Safety&Active Learning
24.10 CSIRO’s Data61 arxiv From Solitary Directives to Interactive Encouragement! LLM Secure Code Generation by Natural Language Prompting Secure Code Generation&Encouragement Prompting
24.10 AppCubic arxiv Jailbreaking and Mitigation of Vulnerabilities in Large Language Models Prompt Injection&Jailbreaking&AI Security
24.10 UC Berkeley arxiv SAFETYANALYST: Interpretable, Transparent, and Steerable LLM Safety Moderation LLM Safety&Interpretability&Content Moderation
24.10 ShanghaiTech University arxiv Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization Safety Alignment&Reinforcement Learning&Policy Optimization
24.11 Zhejiang University arxiv Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control Trustworthiness&Sparse Activation Control&Representation Control
24.11 University of California, Riverside arxiv Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models Vision-Language Models&Safety Alignment&Cross-Layer Vulnerability
24.11 National University of Singapore EMNLP 2024 Multi-expert Prompting Improves Reliability, Safety and Usefulness of Large Language Models Multi-expert Prompting&LLM Safety&Reliability&Usefulness
24.11 OpenAI NeurIPS 2024 Rule Based Rewards for Language Model Safety Rule Based Rewards&Safety Alignment&AI Feedback
24.11 Center for Automation and Robotics, Spanish National Research Council arXiv Can Adversarial Attacks by Large Language Models Be Attributed? Adversarial Attribution&LLM Security&Formal Language Theory
24.11 McGill University arXiv Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset Helpful and Harmless Dataset&Safety Trade-offs&Bias Analysis
24.11 Fudan University arxiv Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding Text-to-Image Generation&Safety&Prompt Embedding Sanitization
24.11 Meta arxiv Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations Multimodal LLM&Content Moderation&Adversarial Robustness
24.11 Columbia University arxiv When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations Backdoor Attacks&Explainability
24.11 Ben-Gurion University of the Negev arxiv The Information Security Awareness of Large Language Models Information Security Awareness&Benchmarking
24.11 Fordham University arxiv Next-Generation Phishing: How LLM Agents Empower Cyber Attackers Phishing Detection&Cybersecurity
24.12 UC Berkeley arxiv Trust & Safety of LLMs and LLMs in Trust & Safety Trust and Safety&Prompt Injection
24.12 Harvard Kennedy School, Avant Research Group arxiv Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects Phishing Attacks&Human-in-the-loop
24.12 University of Massachusetts arxiv Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness Instruction Tuning&Safety&Helpfulness
24.11 University of Pennsylvania, IBM T.J. Watson Research Center arxiv Cyber-Attack Technique Classification Using Two-Stage Trained Large Language Models Cyber-Attack Classification&Two-Stage Training
24.12 University of New South Wales arxiv How Can LLMs and Knowledge Graphs Contribute to Robot Safety? A Few-Shot Learning Approach Robot Safety&Few-Shot Learning&Knowledge Graph Prompting
24.12 Örebro University arxiv Large Language Models and Code Security: A Systematic Literature Review LLM-Generated Code&Vulnerability Detection&Data Poisoning Attacks
24.12 Algiers Research Institute arxiv On the Validity of Traditional Vulnerability Scoring Systems for Adversarial Attacks against LLMs Adversarial Attacks&Vulnerability Metrics&Risk Assessment
24.12 Alan Turing Institute arxiv SoK: Mind the Gap—On Closing the Applicability Gap in Automated Vulnerability Detection Automated Vulnerability Detection&Applicability Gap&Software Security
25.01 Meta arxiv MLLM-as-a-Judge for Image Safety without Human Labeling Image Safety&Zero-Shot Judgment&Multimodal Large Language Models
25.01 FAU Erlangen-Nürnberg arxiv Refusal Behavior in Large Language Models: A Nonlinear Perspective Refusal Behavior&Mechanistic Interpretability&AI Alignment
25.01 University of Waterloo arxiv Advanced Real-Time Fraud Detection Using RAG-Based LLMs Fraud Detection&Retrieval-Augmented Generation&Real-Time AI Security
25.01 Mondragon University, University of Seville arxiv Early External Safety Testing of OpenAI’s O3-Mini: Insights from Pre-Deployment Evaluation LLM Safety Testing&OpenAI O3-Mini
25.02 Nanyang Technological University arxiv Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning LLM Alignment&Instruction Tuning&Reward Models
25.02 University of Bristol arxiv The Dark Deep Side of DeepSeek: Fine-Tuning Attacks Against the Safety Alignment of CoT-Enabled Models Chain of Thought&Fine-Tuning Attack&LLM Safety
25.02 Marburg University arxiv Editing Large Language Models Poses Serious Safety Risks Knowledge Editing&LLM Security Risks&Adversarial Manipulation
25.02 Technical University of Munich AAAI 2025 Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment Medical Multimodal Models&Model Stealing&Adversarial Domain Alignment
25.02 Georgia Institute of Technology arxiv Enhancing Phishing Email Identification with Large Language Models Phishing Detection&Cybersecurity
25.02 Fudan University arxiv Safety at Scale: A Comprehensive Survey of Large Model Safety Large Model Safety&AI Security&Adversarial Attacks
25.02 University of Maryland arxiv Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities Model Tampering Attacks&LLM Security&Adversarial Robustness
25.02 Penn State University arxiv Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet Harmfulness Ranking&LLM Evaluation&AI Safety
25.02 Peking University arxiv Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning LLM Safety&Reasoning Trade-off&Fine-Tuning
25.02 City University of Hong Kong arxiv The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis LLM Alignment&Safety Fine-Tuning&Jailbreak Attacks
25.02 Tsinghua University arxiv “Nuclear Deployed!”: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents Autonomous LLM Agents&Catastrophic Risks&Decision-making
25.02 University of Washington arxiv SAFECHAIN: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities LLM Safety&Chain-of-Thought Reasoning&Model Alignment
25.02 University of California, Santa Cruz arxiv The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 Large Reasoning Models&Safety Assessment&Adversarial Attacks
25.02 34 Affiliates arxiv On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective Safety Assessment&Guideline Paper
25.02 University of Cambridge arxiv Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection Hateful Meme Detection&Multimodal Models&Contrastive Learning
25.02 Cooperative AI Foundation arXiv Multi-Agent Risks from Advanced AI Multi-Agent Systems&AI Risk&AI Governance
25.02 Apart Research, University of Science and Technology of Hanoi AAAI 2025 Workshop on Theory of Mind A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks Theory of Mind&AI Safety
25.02 Truthful AI, University College London arxiv Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs LLM Alignment&Fine-tuning Risks&Emergent Misalignment
25.02 Wuhan University arxiv A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations LVLM Safety&Adversarial Attacks&Defense Mechanisms
25.02 Clark Atlanta University arxiv SoK: Exploring Hallucinations and Security Risks in AI-Assisted Software Development with Insights for LLM Deployment Hallucinations&Security Risks&AI-Assisted Software Development
25.02 Stony Brook Universit, Michigan State University arXiv Cyber Defense Reinvented: Large Language Models as Threat Intelligence Copilots Cyber Threat Intelligence&Large Language Models&Threat Detection
25.03 HydroX AI arXiv Output Length Effect on DeepSeek-R1’s Safety in Forced Thinking Output Length&LLM Safety&Forced Thinking
25.03 Tampere University arxiv Mapping Trustworthiness in Large Language Models: A Bibliometric Analysis Bridging Theory to Practice Trustworthiness&AI Ethics
25.03 University of California, Santa Barbara arxiv Graphormer-Guided Task Planning: Beyond Static Rules with LLM Safety Perception LLM Planning&Graphormer&Risk-Aware Robotics
25.03 University of Pennsylvania arxiv Safety Guardrails for LLM-Enabled Robots LLM-enabled Robotics&Jailbreaking Defense&Formal Safety Guarantees
25.03 Peking University arxiv LIFE-CYCLE ROUTING VULNERABILITIES OF LLM ROUTER LLM Router&Adversarial Attack&Backdoor Attack
25.03 Squirrel AI Learning arxiv A Survey on Trustworthy LLM Agents: Threats and Countermeasures Trustworthy Agent&LLM-based Agents&Multi-Agent System
25.03 Cornell Tech arxiv Multi-Agent Systems Execute Arbitrary Malicious Code Multi-Agent Systems&Control-Flow Hijacking&Arbitrary Code Execution
25.03 University of Utah arxiv A Comprehensive Study of LLM Secure Code Generation Secure Code Generation&Vulnerability Scanning&Functionality Evaluation
25.03 University of Minnesota arxiv Safety Aware Task Planning via Large Language Models in Robotics LLM Robotics Planning&Safety-Aware Framework&Control Barrier Functions
25.03 Peking University, Zhongguancun Lab, Tsinghua University arxiv Large Language Models powered Network Attack Detection: Architecture, Opportunities and Case Study Network Security&LLM for Security&Anomaly Detection
25.03 Aim Intelligence, Yonsei University, Seoul National University arxiv sudo rm -rf agentic_security Agent Security&Multimodal Jailbreak&LLM Agent Exploitation
25.03 Georgia Institute of Technology, IMT Mines Albi arxiv Leveraging Large Language Models for Risk Assessment in Hyperconnected Logistic Hub Network Deployment Risk Assessment&LLMs for Logistics&Supply Chain Resilience
25.04 University of Twente arxiv Safety and Security Risk Mitigation in Satellite Missions via Attack-Fault-Defense Trees Cyber-Physical Systems&Attack-Fault-Defense Trees&Satellite Ground Segment
25.04 Google DeepMind arxiv An Approach to Technical AGI Safety and Security AGI Safety&Misalignment Mitigation&Capability Control
25.04 Earlham College ISDFS 2025 Debate-Driven Multi-Agent LLMs for Phishing Email Detection Phishing Detection&Multi-Agent LLMs&Debate Framework
25.04 Indian Institute of Technology Kanpur MSR 2025 MaLAware: Automating the Comprehension of Malicious Software Behaviours using Large Language Models (LLMs) Malware Analysis&Behavior Explanation&LLMs for Cybersecurity
25.04 Leidos arxiv MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits Model Context Protocol&Security Audit&Agentic LLM Exploits
25.04 Norwegian University of Science and Technology arxiv An LLM Framework For Cryptography Over Chat Channels LLMs&Cryptography&Steganography
25.04 Peking University arxiv SaRO: Enhancing LLM Safety through Reasoning-based Alignment Safety Alignment&Reasoning-based Alignment&LLMs
25.04 Johns Hopkins University arxiv An Investigation of Large Language Models and Their Vulnerabilities in Spam Detection Spam Detection&Adversarial Attack&Data Poisoning
25.04 TU Wien arxiv Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design Offensive Security&Benchmarking&LLM Penetration Testing
25.04 Fraunhofer Institute for Cognitive Systems IKS arxiv Towards Automated Safety Requirements Derivation Using Agent-based RAG Agent-based RAG&Safety Requirements Derivation&Autonomous Driving
25.04 Nanjing University arxiv Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask LLM-based Vulnerability Detection&Contextual Reasoning&Benchmark Evaluation
25.04 Nanyang Technological University arxiv A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment LLM Safety&LLM Lifecycle&Agent Alignment
25.04 Arab American University arxiv Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends Traffic Safety&Multimodal Large Language Models&ADAS
25.04 National University of Singapore arxiv Safety in Large Reasoning Models: A Survey Large Reasoning Models&Safety Taxonomy&Adversarial Attacks
25.04 Amazon Web Services arxiv Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents Agentic AI Security&Threat Modeling&Mitigation Framework
25.04 Alibaba Group NAACL 2025 DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models Multimodal LLM&Safety Alignment&Risk Disentanglement
25.04 University of Maryland NAACL 2025 RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models RAG&Safety Alignment&Red Teaming
25.05 University of Granada arxiv LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures Large Language Models&Security&Defense Mechanisms
25.05 University of North Carolina at Chapel Hill Transactions on Machine Learning Research Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation Multimodal LLMs&Information Unlearning&Security Evaluation
25.05 University of Oxford arxiv Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents Multi-Agent Systems&AI Security&Emergent Threats
25.05 Huazhong University of Science and Technology arxiv Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study LLM Deployment&Security Analysis&Empirical Study
25.05 Rutgers University arxiv Aligning Large Language Models with Healthcare Stakeholders: A Pathway to Trustworthy AI Integration Healthcare&Alignment&Large Language Models
25.05 Metropolia University of Applied Sciences arxiv A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient LLM Safety&Ethical Evaluation&Danger Coefficient
25.05 University of Kent arxiv Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data Safety Alignment&Pseudo-Malicious Data&Cybersecurity LLMs
25.05 Carnegie Mellon University arxiv A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron? Computer-Using Agents&Security Threats&Safety Benchmarks
25.05 NYU Tandon arxiv MARVEL: Multi-Agent RTL Vulnerability Extraction using Large Language Models RTL Security&Multi-Agent Systems&LLM for Hardware Verification
25.05 Jerusalem College of Technology arxiv Proposal for Improving Google A2A Protocol: Safeguarding Sensitive Data in Multi-Agent Systems A2A Protocol&Sensitive Data Protection&Multi-Agent Security
25.05 Huazhong University of Science and Technology arxiv From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents Mobile LLM Agents&Security Risks&AgentScan
25.05 Mohamed bin Zayed University of Artificial Intelligence arxiv Safety Subspaces are Not Distinct: A Fine-Tuning Case Study Safety Alignment&Subspace Geometry&Fine-Tuning Vulnerability
25.05 Amazon Web Services arxiv From nuclear safety to LLM security: Applying non-probabilistic risk management strategies to build safe and secure LLM-powered systems Risk Management&LLM Security&Non-Probabilistic Strategies
25.05 Infinite Optimization AI Lab arxiv Security Concerns for Large Language Models: A Survey LLM Security&Prompt Injection&Autonomous Agents
25.05 Nanyang Technological University arxiv Understanding Refusal in Language Models with Sparse Autoencoders Refusal&Sparse Autoencoder&LLM Safety
25.05 Seoul National University arxiv Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems Multi-agent LLM&Cross-domain Security&Threat Modeling
25.05 University of Washington arxiv OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities AI Safety&Multimodal Moderation&Universal Representation
25.06 Tsinghua University arxiv The Security Threat of Compressed Projectors in Large Vision-Language Models Vision-Language Model&Compressed Projector&Adversarial Attack
25.06 Michigan State University arxiv Comprehensive Vulnerability Analysis is Necessary for Trustworthy LLM-MAS LLM-MAS&Vulnerability Analysis&Trustworthy AI
25.06 Singapore Management University arxiv Which Factors Make Code LLMs More Vulnerable to Backdoor Attacks? A Systematic Study Code LLM&Backdoor Attack&Adversarial Robustness
25.06 University of Science and Technology of China arxiv SECNEURON: Reliable and Flexible Abuse Control in Local LLMs via Hybrid Neuron Encryption Local LLM&Abuse Control&Neuron Encryption
25.06 Dartmouth College arxiv Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets LLM Safety&Alignment Robustness&Representation Similarity
25.06 Georgia Tech arxiv Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety Interpretation&LLM Safety&Survey
25.06 George Mason University ICML 2025 StealthInk: A Multi-bit and Stealthy Watermark for Large Language Models Watermark&LLM&Stealthy&Multi-bit
25.06 SUNY-Albany, New Jersey Institute of Technology, Microsoft, Kent State University, University of Florida arxiv SoK: Are Watermarks in LLMs Ready for Deployment? Watermark&LLM&Model Stealing&IP Protection
25.06 Tsinghua University, Apple, Beijing University of Posts and Telecommunications arxiv Enhancing Watermarking Quality for LLMs via Contextual Generation States Awareness Watermarking&LLM&Generation Quality&Context Awareness
25.06 Shanghaitech University, Sun Yat-sen University arxiv Safeguarding Multimodal Knowledge Copyright in the RAG-as-a-Service Environment Multimodal RAG&Copyright Protection&Watermarking&Image Knowledge&Retrieval-Augmented Generation
25.06 University of Applied Sciences Northwestern Switzerland arxiv Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications Multi-Agent System&LLM&Reasoning&Cybersecurity&Specification
25.06 Sungkyunkwan University, Microsoft Research Asia ACL 2025 Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights Value Alignment&LLM Safety&Personalization&Harmful Behavior&Psychological Analysis
25.06 Virelya Intelligence Research Labs arxiv Risks & Benefits of LLMs & GenAI for Platform Integrity, Healthcare Diagnostics, Cybersecurity, Privacy & AI Safety: A Comprehensive Survey, Roadmap & Implementation Blueprint for Automated Review, Compliance Assurance, Moderation, Abuse & Fraud Detection, App Security, and Trust in Digital Ecosystems Large Language Models&Generative AI&Platform Integrity&Cybersecurity&Compliance
25.06 University of Pennsylvania arxiv A Fast, Reliable, and Secure Programming Language for LLM Agents with Code Actions Programming Language&LLM Agents&Code Actions&Security&Parallelization
25.06 Universitas Muhammadiyah Surakarta arxiv Using LLMs for Security Advisory Investigations: How Far Are We? Security Advisory&CVE ID&LLMs&Hallucination&Reliability
25.06 University of Texas at El Paso arxiv Evaluating Large Language Models for Phishing Detection, Self-Consistency, Faithfulness, and Explainability Phishing Detection&Large Language Models&Explainability&Self-Consistency&Fine-Tuning
25.06 Universiti Sains Malaysia arxiv PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection Phishing Website Detection&Large Language Model&Multi-Agent System&Debate Framework&Explainability
25.06 NTT arxiv Towards Safety Evaluations of Theory of Mind in Large Language Models Theory of Mind&LLM Safety&Evaluation
25.06 The Ohio State University arxiv AI Safety vs. AI Security: Demystifying the Distinction and Boundaries AI Safety&AI Security&Risk Management
25.06 Zhejiang University arxiv A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures LLM-Driven Agents&Agent Communication&Security Risks
25.07 University of Science and Technology of China, Douyin Co., Ltd. arxiv SAFER: Probing Safety in Reward Models with Sparse Autoencoder Reward Model&Interpretability&Safety
25.07 Princeton University arxiv Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs Alignment Erosion&Fine-Tuning&Safety
25.07 AI Risk and Vulnerability Alliance arxiv Red Teaming AI Red Teaming Red Teaming&AI Security&Sociotechnical
25.07 Shandong University arxiv We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems MCP Security&API Measurement&Privilege
25.07 University College London arxiv Emergent Misalignment as Prompt Sensitivity: A Research Note Misalignment&Prompt Sensitivity&Finetuning
25.07 Ludwig-Maximilians-Universität in Munich arxiv On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment Alignment&Filtering&Intractability
25.07 University of Wisconsin-Madison arxiv Prompt-level Watermarking is Provably Impossible Watermarking&Prompt Injection&Impossibility
25.07 UK AI Security Institute arxiv Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety Chain of Thought&Monitorability&Safety
25.07 Northeastern University arxiv LLMs Encode Harmfulness and Refusal Separately Harmfulness&Refusal&Safety
25.07 Aymara arxiv Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix Safety Evaluation&LLM&Benchmark
25.07 Shanghai Artificial Intelligence Laboratory arxiv Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report AI Risk&Safety&Benchmark
25.07 Shanghai Artificial Intelligence Laboratory arxiv SafeWork-R1: Coevolving Safety and Intelligence under the AI-45° Law Safety&Reinforcement Learning&Multimodal
25.07 University of Illinois Urbana-Champaign arxiv PurpCode: Reasoning for Safer Code Generation SecureCode&Reasoning&Alignment
25.07 IBM Research arxiv OneShield - the Next Generation of LLM Guardrails Guardrails&Safety&Compliance
25.08 University of Maryland, College Park arxiv Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation Predictive Auditing&LLM APIs&Reasoning Token Count&Token Inflation
25.08 Independent Researcher, Arizona State University, University of California, Berkeley arxiv Measuring Harmfulness of Computer-Using Agents Computer-Using Agents&Safety Risks&CUAHarm Benchmark&Language Models
25.08 Jimei University, Wenzhou-Kean University, The Hong Kong University of Science and Technology (Guangzhou), New York University, Xiamen University arxiv A Survey on Data Security in Large Language Models Large language model (LLM)&Data security&LLM vulnerabilities&Prompt injection
25.08 University of South Florida arxiv Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM Harmful Content&LLM Safety&Jailbreak Mitigation
25.08 Monash University ACM CCS 2025 Robust Anomaly Detection in O-RAN: Leveraging LLMs against Data Manipulation Attacks O-RAN Security&Anomaly Detection&Data Manipulation Attacks
25.08 Zhejiang University arxiv Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends Copyright Protection&Model Fingerprinting&Text Watermarking
25.08 Global Center on AI Governance arxiv Toward an African Agenda for AI Safety AI Safety in Africa&Governance&Socio-Technical Risks
25.08 PeopleTec, Inc. arxiv SERVANT, STALKER, PREDATOR: How an Honest, Helpful, and Harmless (3H) Agent Unlocks Adversarial Skills Multi-Agent Systems&Service Orchestration&Composite Threats
25.08 Nanyang Technological University EMNLP 2025 Findings Improving Alignment in LVLMs with Debiased Self-Judgment LVLM Alignment&Debiased Self-Judgment&Hallucination Mitigation
25.09 Zhejiang University arxiv Web Fraud Attacks Against LLM-Driven Multi-Agent Systems Multi-Agent Systems&Web Fraud Attack&Security
25.09 Alibaba AAIG arxiv Oyster-I: Beyond Refusal — Constructive Safety Alignment for Responsible Language Models Constructive Safety Alignment&Safety Benchmark&Game-Theoretic Modeling
25.09 Kennesaw State University IEEE Internet of Things Journal A Survey: Towards Privacy and Security in Mobile Large Language Models Mobile LLMs&Privacy&Security
25.09 University of Wisconsin-Madison arxiv Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLMs Prompt Injection&Threat Model&LLM Security
25.09 Tennessee Tech University, University of Nebraska at Omaha arxiv Safety and Security Analysis of Large Language Models: Risk Profile and Harm Potential Safety and Security&Risk Profiling&Adversarial Prompts
25.09 Instituto de Pesquisas Eldorado, SRI International arxiv LLM in the Middle: A Systematic Review of Threats and Mitigations to Real-World LLM-based Systems LLM Security&Threat Modeling&Systematic Review
25.09 Alibaba Group, Zhejiang University arxiv Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework Sparse Autoencoder&Safety Interpretation&LLM Interpretability
25.09 Argonne National Laboratory arxiv Evaluating the Safety and Skill Reasoning of Large Reasoning Models Under Compute Constraints Reasoning Models&Safety Evaluation&Compute Constraints
25.09 Chinese Academy of Sciences, Wuhan University, Renmin University of China, Macquarie University, Griffith University, Xiaomi Inc. arxiv LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions LLM-based Agents&Hallucinations&Trustworthiness
25.09 Binghamton University, Duke University, University of Alabama at Birmingham arxiv Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting LLM Benchmark&Cybersecurity&Blue Teaming
25.09 Universitat Pompeu Fabra INLG 2025 (accepted), arxiv Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs Lexical Simplification&Small LLMs&Safety&Knowledge Distillation
25.10 City University of Hong Kong, Johns Hopkins University, George Mason University arxiv Towards Human-Centered RegTech: Unpacking Professionals' Strategies and Needs for Using LLMs Safely RegTech&Human-Centered NLP&Compliance Risk&LLM Safety
25.10 University of Mannheim arxiv A Granular Study of Safety Pretraining under Model Abliteration Safety Pretraining&Model Abliteration&Refusal Robustness
25.10 University of California, Riverside arxiv Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection? Jailbreak Detection&Internal Representations&Tensor Decomposition
25.10 OpenAI & Anthropic & Google DeepMind & ETH Zürich & Northeastern University & HackAPrompt & AI Security Company arxiv The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections Adaptive Attacks&LLM Jailbreak Defense&Prompt Injection Robustness
25.10 Ruhr-Universität Bochum & Universität Bonn & Lamarr Institute for Machine Learning and Artificial Intelligence arxiv AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures? AI Alignment&Failure Modes&Risk Analysis
25.10 Harbin Institute of Technology (Shenzhen) & Pengcheng Lab arxiv GRIDAI: Generating and Repairing Intrusion Detection Rules via Collaboration among Multiple LLM-based Agents Intrusion Detection&Rule Generation&Multi-Agent LLM System
25.10 University of Massachusetts Amherst & ELLIS Institute Tübingen arxiv Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies Multi-Agent Systems&Security Evaluation&Blackboard Architecture
25.10 Shanghai Jiao Tong University NeurIPS Stop DDoS Attacking the Research Community with AI-Generated Survey Papers AI-Generated Surveys&Research Integrity&Scholarly Oversight
25.10 LMU Munich & TUM & Oxford & HKU NeurIPS Workshop Deep Research Brings Deeper Harm Deep Research Agents&LLM Safety Alignment&Biosecurity Risks
25.10 University of Connecticut, University of Alabama at Birmingham arxiv SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models Prompt Security&Jailbreak Taxonomy&Defense Evaluation
25.10 École Normale Supérieure (ENS) - Université Paris Sciences et Lettres (PSL), CNRS, Université Sorbonne Nouvelle arxiv On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text? AI-Generated Text Detection&Watermarking&Ethical AI Evaluation
25.11 University of Pennsylvania arxiv Watermarking Discrete Diffusion Language Models Discrete Diffusion&Watermarking&Generative Model Security
25.11 CEA Paris-Saclay arxiv Watermarking Large Language Models in Europe: Interpreting the AI Act in Light of Technology Watermarking&AI Act&Compliance Evaluation
25.11 Shandong University AAAI 2026 HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection Human Language Preference Optimization&Machine-Revised Text Detection&Adversarial Multi-Task Detection
25.11 Massachusetts Institute of Technology arxiv Hiding in the AI Traffic: Abusing MCP for LLM-Powered Agentic Red Teaming Model Context Protocol&Agentic Red Teaming&Command and Control
25.11 Zhejiang University AAAI 2026 Do Not Merge My Model! Safeguarding Open-Source LLMs Against Unauthorized Model Merging Model Merging Stealing&LLM IP Protection&Proactive Defense
25.11 Beijing Institute of Technology arxiv Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion Existential Risk&Prefix Completion&LLM Safety Evaluation
25.11 ETH Zurich, Huawei Technologies Switzerland AG arxiv Can LLMs Make (Personalized) Access Control Decisions? Access Control&Personalization&LLM Security
25.11 Purdue University, Perplexity AI arxiv BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents Prompt Injection&AI Browser Agents&Benchmarking
25.11 University of Tennessee, Sungkyunkwan University arxiv Supporting Students in Navigating LLM-Generated Insecure Code Insecure Code Generation&Cybersecurity Education&Bifröst Framework
25.11 Vanta, MintMCP, Darktrace arxiv Securing the Model Context Protocol (MCP): Risks, Controls, and Governance Model Context Protocol&AI Governance&Agent Security
25.11 Renmin University of China, Ant Group AAAI 2026 Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems LLM Multi-Agent Systems&Security Risks&Adversarial Defense
25.11 Independent Researcher arxiv Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs Healthcare AI Safety&Adversarial Dataset&Context-Aware Alignment
25.11 NVIDIA, Lakera AI arxiv A Safety and Security Framework for Real-World Agentic Systems Agentic Systems&Safety and Security Framework&AI Risk Taxonomy
25.12 Sun Yat-sen University arxiv An Empirical Study on the Security Vulnerabilities of GPTs GPT Security&Prompt Injection&Tool Misuse
25.12 Hiroshima University, The University of Tokyo, National Institute of Informatics IEEE ISPA 2025 Decentralized Multi-Agent System with Trust-Aware Communication Decentralized Multi-Agent Systems&Blockchain Communication&Trust-Aware Protocols
25.12 China Telecom (TeleAI), Sichuan University, Peking University arxiv Aetheria: A Multimodal Interpretable Content Safety Framework Based on Multi-Agent Debate and Collaboration Content Safety&Multi-Agent Systems&Interpretable AI&Multimodal Analysis
25.12 DEXAI – Icaro Lab, Sapienza University of Rome, Sant’Anna School of Advanced Studies arxiv Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions Multi-Agent Safety&Systemic Risk&Institutional AI
25.12 Shandong University, Nanjing University arxiv “MCP Does Not Stand for Misuse Cryptography Protocol”: Uncovering Cryptographic Misuse in Model Context Protocol at Scale Model Context Protocol (MCP)&Cryptographic Misuse Detection&Program Analysis
25.12 University of Pennsylvania, Carnegie Mellon University, Columbia University arxiv MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking LLM Watermarking&Model Fine-Tuning&Open-Weight Models
25.12 University of Maryland, Oracle Labs, Oracle Health AI ML4H 2025 Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment Healthcare AI Assistants&Iterative Alignment&Safety vs Helpfulness
25.12 Old Dominion University arxiv ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications Threat Modeling&Agentic AI&Vision-Language Models
25.12 Singapore Management University arxiv SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security Causality Analysis&LLM Security&Jailbreak Detection
25.12 FAIR at Meta arxiv Self-Improving AI & Human Co-Improvement for Safer Co-Superintelligence Self-Improving AI&Human-AI Collaboration&Co-Superintelligence
25.12 University of North Carolina Wilmington arxiv From Description to Score: Can LLMs Quantify Vulnerabilities? Vulnerability Scoring&CVSS&Large Language Models
25.12 Beihang University arxiv SoK: Trust-Authorization Mismatch in LLM Agent Interactions LLM Agents&Trust and Authorization&Agent Security
25.12 Tribhuvan University arxiv Systematization of Knowledge: Security and Safety in the Model Context Protocol Ecosystem Model Context Protocol&LLM Security&Agentic AI Safety
25.12 CISPA Helmholtz Center for Information Security NDSS 2026 Chasing Shadows: Pitfalls in LLM Security Research LLM Security Research&Reproducibility Pitfalls&Evaluation Methodology
25.12 Cisco AI Threat and Security Research arxiv Cisco Integrated AI Security and Safety Framework Report AISecurity&ThreatTaxonomy&Governance
25.12 The Beacom College of Computer & Cyber Sciences, Dakota State University arxiv Quantifying Return on Security Controls in LLM Systems RiskModeling&SecurityControls&LLMSafety
25.12 National University of Defense Technology arxiv Exploring the Security Threats of Retriever Backdoors in Retrieval-Augmented Code Generation Retriever Backdoors&RAG Security&Code Generation
25.12 BITS Pilani arxiv Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection Multi-Vulnerability&LLM Benchmarking&Code Security
26.01 Zhejiang University arxiv RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation Domain-Specific Safety&Harmful Prompt Synthesis&Knowledge Graph
26.01 Stanford University arxiv Toward Safe and Responsible AI Agents: A Three-Pillar Model for Transparency, Accountability, and Trustworthiness ResponsibleAI&AgentGovernance&Transparency
26.01 Chinese Academy of Sciences arxiv Lightweight Yet Secure: Secure Scripting Language Generation via Lightweight LLMs SecureScripting&PowerShell&LightweightModels
26.01 Unknown arxiv SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations PromptInjection&ConstitutionalAI&Cybersecurity
26.01 Xi’an Jiaotong University arxiv Small Symbols, Big Risks: Exploring Emoticon Semantic Confusion in Large Language Models EmoticonConfusion&LLMSafety&Robustness
26.01 Zhejiang University arxiv ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models ModelFingerprinting&Unlearning&Copyright
26.01 Zhejiang University arxiv DNF: Dual-Layer Nested Fingerprinting for Large Language Model Intellectual Property Protection ModelFingerprinting&IPProtection&Backdoor
26.01 Peking University arxiv ToolSafe: Enhancing Tool Invocation Safety of LLM-based Agents via Proactive Step-level Guardrail and Feedback ToolSafety&AgentGuardrails&PromptInjection
26.01 Nanyang Technological University arxiv Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale AgentSecurity&SupplyChainRisk&VulnerabilityAnalysis
26.01 Ben-Gurion University of the Negev arxiv AgentGuardian: Learning Access Control Policies to Govern AI Agent Behavior AgentGovernance&AccessControl&ExecutionFlow
26.01 Fudan University arxiv A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5 SafetyEvaluation&MultimodalLLM&AdversarialTesting
26.01 Fujitsu Research of Europe arxiv AgenTRIM: Tool Risk Mitigation for Agentic AI AgenticAI&ToolSecurity&LeastPrivilege
26.01 Nanjing University of Aeronautics and Astronautics arxiv SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment LLM Safety&Jailbreak Defense&Adaptive Alignment
26.01 Unknown arxiv Breaking the Protocol: Security Analysis of the Model Context Protocol Specification and Prompt Injection Vulnerabilities in Tool-Integrated LLM Agents Model Context Protocol&Prompt Injection&Agent Security
26.01 School of Automation, Northwestern Polytechnical University, Xi'an, China arxiv FNF: Functional Network Fingerprint for Large Language Models Model Fingerprinting&Intellectual Property&Functional Networks
26.01 University of Science and Technology of China arxiv Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures Misalignment&Persona&Safety
26.02 eBay Inc arxiv ZERO-TRUST RUNTIME VERIFICATION FOR AGENTIC PAYMENT PROTOCOLS: MITIGATING REPLAY AND CONTEXT-BINDING FAILURES IN AP2 Agentic Payments&Runtime Verification&Replay Attacks
26.02 Huazhong University of Science and Technology arxiv Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models Vulnerability Reasoning&Benchmarking&RLVR
26.02 Technical University of Darmstadt arxiv GoodVibe: Security-by-Vibe for LLM-Based Code Generation Code Security&Neuron-Level Tuning&Code Generation
26.02 Canadian Institute for Cybersecurity (CIC), University of New Brunswick, New Brunswick, Canada arxiv Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP AI-Agent Protocols&Threat Modeling&MCP
26.02 DiSTA, University of Insubria, Italy arxiv LoRA-based Parameter-Efficient LLMs for Continuous Learning in Edge-based Malware Detection Malware Detection&Edge Computing&Continuous Learning
26.02 Unknown arxiv Agentic AI for Cybersecurity: A Meta-Cognitive Architecture for Governable Autonomy Cybersecurity&Agentic AI&Governable Autonomy
26.02 University of Luxembourg, Interdisciplinary Center for Security, Reliability, and Trust (SnT), Trustworthy Software Engineering Group (TruX), Luxembourg arxiv Assessing Spear-Phishing Website Generation in Large Language Model Coding Agents Spear-Phishing&Coding Agents&Cyber Misuse
26.02 ShanghaiTech University, Shanghai, China arxiv A Trajectory-Based Safety Audit of Clawdbot (OpenClaw) Agent Safety Audit&Trajectory Analysis&OpenClaw
26.02 State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China arxiv Intellicise Wireless Networks Meet Agentic AI: A Security and Privacy Perspective Agentic AI&Wireless Security&Privacy
26.02 Applied Machine Learning Research arxiv Intent Laundering: AI Safety Datasets Are Not What They Seem Safety Datasets&Intent Laundering&Evaluation Robustness
26.02 Fraunhofer ISST arxiv DAVE: A Policy-Enforcing LLM Spokesperson for Secure Multi-Document Data Sharing Data Sharing&Policy Enforcement&LLM Spokesperson
26.02 Department of Computer Science, National University of Singapore arxiv LLM-enabled Applications Require System-Level Threat Monitoring threat monitoring&incident response&LLM systems
26.02 University of Technology Sydney arxiv SoK: Agentic Skills — Beyond Tool Use in LLM Agents agentic skills&supply chain&survey
26.02 Amazon Web Services arxiv Manifold of Failure: Behavioral Attraction Basins in Language Models failure manifold&MAP-Elites&alignment deviation
26.02 National University of Singapore arxiv IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation auditing&verifiable computation&API integrity
26.03 Shanghai Innovation Institute arxiv From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions agentic AI&web security&survey
26.03 Sahara AI arxiv Proof-of-Guardrail in AI Agents and What (Not) to Trust from It agent guardrails&TEE attestation&verifiable safety
26.03 Beihang University arxiv Evolving Deception: When Agents Evolve, Deception Wins deceptive agents&self-evolution&alignment drift
26.03 Communication and Distributed Systems, RWTH Aachen University, Germany arxiv Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers artifact evaluation&reproducibility&cybersecurity
26.03 Shandong University, Qingdao, Shandong, China arxiv Give Them an Inch and They Will Take a Mile: Understanding and Measuring Caller Identity Confusion in MCP-Based AI Systems MCP security&caller identity&authorization
26.03 Crew Scaler arxiv Security Considerations for Multi-agent Systems multi-agent systems&security frameworks&threat taxonomy
26.03 Institute of Information Engineering, Chinese Academy of Sciences arxiv ProvAgent: Threat Detection Based on Identity-Behavior Binding and Multi-Agent Collaborative Attack Investigation threat detection&provenance graphs&multi-agent investigation
26.03 Shandong University arxiv Don’t Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw code agents&OpenClaw&human-in-the-loop defense
26.03 School of Interactive Computing, Georgia Institute of Technology arxiv Safe and Scalable Web Agent Learning via Recreated Websites web agents&synthetic environments&self-evolution
26.03 Ant Group & Tsinghua University, China arxiv Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats autonomous agents&OpenClaw&lifecycle security
26.03 College of Intelligent Science and Engineering, Jinan University arxiv Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework agent memory&governance&semantic drift
26.03 State Key Laboratory of Complex & Critical Software Environment, Beihang University arxiv Uncovering Security Threats and Architecting Defenses in Autonomous Agents: A Case Study of OpenClaw Autonomous Agents&Threat Modeling&Defense Architecture
26.03 Unknown arxiv Evaluation of Audio Language Models for Fairness, Safety, and Security Audio LLMs&Safety Evaluation&Structural Taxonomy
26.03 Centre for Philosophy and AI Research, Friedrich-Alexander-University Erlangen-Nuremberg arxiv Questionnaire Responses Do Not Capture the Safety of AI Agents AI Agents&Safety Assessment&Construct Validity
26.03 University of Connecticut arxiv Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning Code LLMs&Data Poisoning&Vulnerability Scanning
26.03 Dartmouth College arxiv Retrieval-Augmented LLMs for Security Incident Analysis Security Incident Analysis&RAG&MITRE ATTACK
26.03 Purdue University arxiv Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety Agent Safety&Benchmark Auditing&Tool Calls
26.03 University of Electronic Science and Technology of China, Chengdu, China arxiv Functional Subspace Watermarking for Large Language Models Model Watermarking&Functional Subspace&Ownership Verification
26.03 UC Santa Cruz arxiv A Framework for Formalizing LLM Agent Security Agent security&Contextual security&Authorization
26.03 City University of Hong Kong arxiv PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management Mobile power management&LLM agents&Personalization
26.03 Shanghai Jiao Tong University, Shanghai, China arxiv Trojan’s Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance OpenClaw&Guidance injection&Autonomous coding agents
26.03 BigCommerce arxiv An Agentic Multi-Agent Architecture for Cybersecurity Risk Management⋆ Cybersecurity risk&Multi-agent systems&Risk assessment
26.03 Lulea tekniska universitet, Sweden arxiv Agentproof: Static Verification of Agent Workflow Graphs Static verification&Workflow graphs&Temporal safety
26.03 Department of Computing Science, Umea University, Umea, Sweden arxiv Memory poisoning and secure multi-agent systems Memory poisoning&Multi-agent systems&Cryptographic mitigation
26.03 University of Washington, USA arxiv AC4A: Access Control for Agents Access control&Agent permissions&API security
26.03 Sun Yat-sen University arxiv SkillProbe: Security Auditing for Emerging Agent Skill Marketplaces via Multi-Agent Collaboration Skill marketplaces&Security auditing&Multi-agent collaboration
26.03 Department of Computer Science, New York Institute of Technology, Vancouver, BC, Canada arxiv Auditing MCP Servers for Over-Privileged Tool Capabilities MCP servers&Capability auditing&Tool privileges
26.03 New York Institute of Technology, Vancouver, BC, Canada arxiv Are AI-assisted Development Tools Immune to Prompt Injection? Prompt injection&MCP clients&Development tools
26.03 Department of Computer Science, New York Institute of Technology, Canada arxiv Model Context Protocol Threat Modeling and Analyzing Vulnerabilities to Prompt Injection with Tool Poisoning MCP security&Tool poisoning&Threat modeling
26.03 Beijing University of Posts and Telecommunications arxiv ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers OpenClaw security&Watcher middleware&Runtime protection
26.03 Department of Computer Science, Institute of Artificial Intelligence, University of Central Florida Transactions on Machine Learning Research 2026 AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective Foundation model security&Threat taxonomy&Cross-modal defense
26.03 CSIRO Data61 arxiv Clawed and Dangerous: Can We Trust Open Agentic Systems? Agent security&Open agentic systems&Software engineering
26.03 Rensselaer Polytechnic Institute arxiv SAFETYDRIFT: Predicting When AI Agents Cross the Line Before They Actually Do Agent safety&Trajectory prediction&Runtime monitoring
26.03 SUCCESS Lab, Texas A&M University arxiv A Systematic Taxonomy of Security Vulnerabilities in the OpenClaw AI Agent Framework Agent vulnerabilities&Security taxonomy&OpenClaw
26.03 The Hong Kong University of Science and Technology (Guangzhou), China arxiv “What Did It Actually Do?”: Understanding Risk Awareness and Traceability for Computer-Use Agents Risk awareness&Agent traceability&Computer-use agents
26.03 Department of Electrical and Computer Engineering, Tandon School of Engineering, New York University, USA arxiv Safeguarding LLMs Against Misuse and AI-Driven Malware Using Steganographic Canaries Malware defense&Steganographic canaries&LLM misuse
26.03 Singapore Management University, Singapore arxiv SafeClaw-R: Towards Safe and Secure Multi-Agent Personal Assistants Multi-agent assistants&Runtime enforcement&Skill security
26.03 Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy arxiv Security in LLM-as-a-Judge: A Comprehensive SoK LLM-as-a-Judge&Security survey&Evaluation robustness
26.04 Mattersec Labs arxiv SecLens: Role-Specific Evaluation of LLMs for Security Vulnerability Detection Vulnerability detection&Stakeholder evaluation&Role-specific scoring
26.04 University of Melbourne arxiv Combating Data Laundering in LLM Training Data laundering&Training data detection&Synthesis data reversion
26.04 Fudan University, China arxiv From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers MCP security&Malicious servers&Behavioral deviation detection
26.04 University of California, Los Angeles EACL 2026 Open-Domain Safety Policy Construction Safety policy construction&Agentic research&Content moderation
26.04 Constellation arxiv An Independent Safety Evaluation of Kimi K2.5 Bias Reduction&Safety Evaluation&Open-Weight Models
26.04 UC Santa Cruz arxiv Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw Personal AI&Threat Taxonomy&Safety Evaluation
26.04 aKASTEL Security Research Labs, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany arxiv Foundations for Agentic AI Investigations from the Forensic Analysis of OpenClaw Threat Taxonomy&AI Forensics&OpenClaw
26.04 University of the Cumberlands arxiv A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms Threat Taxonomy&Benchmarking&MCP Security
26.04 Computer Science & Engineering, Mississippi State University arxiv Attribution-Driven Explainable Intrusion Detection with Encoder-Based Large Language Models Security Analysis&Intrusion Detection&Explainability
26.04 Research Institute of Trustworthy Autonomous arxiv ClawLess: A Security Model of AI Agents Clawless&Security&Model
26.04 Department of Earth Science and Engineering, Imperial College London, London, United Kingdom arxiv SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills Agent Skills&OpenClaw&Benchmarking
26.04 The Pennsylvania State University arxiv TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation Tool Poisoning&TRUSTDESC&Preventing
26.04 The Hong Kong Polytechnic University arxiv Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions Threat Taxonomy&Benchmarking&RAG Security
26.04 Arizona State University arxiv Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit Large Language&Security Operations Center&LLM tools
26.04 University of Delaware arxiv Towards Personalizing Secure Programming Education with LLM-Injected Vulnerabilities Personalizing Secure Programming&Secure Programming Education&Personalizing Secure
26.04 DEXAI - Icaro Lab arxiv Agentic Microphysics: A Manifesto for Generative AI Safety Agentic Microphysics&Manifesto for Generative&Manifesto
26.04 Chongqing University ACL 2026 DEEPGUARD: Secure Code Generation via Multi-Layer Semantic Aggregation DEEPGUARD&Multi-Layer Semantic Aggregation&Semantic Aggregation
26.04 New York University FORC 2026 Can we Watermark Low-Entropy LLM Outputs? Low-Entropy LLM Outputs&Watermark Low-Entropy LLM&output
26.04 MemTensor, China arxiv A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty Agent Memory&Memory Security&Governance
26.04 Tandon School of Engineering, New York University arxiv Surgical Repair of Insecure Code Generation in LLMs Code Security&Model Repair&Mechanistic Diagnosis
26.04 ETH Zurich arxiv Using large language models for embodied planning introduces systematic safety risks Embodied Planning&Robotics Safety&LLM Agents
26.04 BlueFocus Communication Group arxiv Owner-Harm: A Missing Threat Model for AI Agent Safety Agent Safety&Threat Model&Owner Harm
26.04 University of Oslo arxiv Towards Agentic Investigation of Security Alerts Security Alerts&Agentic Investigation&Cybersecurity
26.05 Fudan University arxiv Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses Embodied AI&Safety Survey&Attacks
26.05 Shanghai Jiao Tong University arxiv ClawGuard: Out-of-Band Detection of LLM Agent Workflow Hijacking via EM Side Channel Workflow Hijacking&Side Channel&Agent Security

💻Presentations & Talks

📖Tutorials & Workshops

Date Type Title URL
23.10 Tutorials Awesome-LLM-Safety link

📰News & Articles

Date Type Title URL
23.01 video ChatGPT and InstructGPT: Aligning Language Models to Human Intention link
23.06 Report “Dual-use dilemma” for GenAI Workshop Summarization link
23.10 News Joint Statement on AI Safety and Openness link

🧑‍🏫Scholars