Red Teaming AI Customer Service Agents for a SaaS Company
A B2B SaaS company had deployed GPT-4-powered customer service agents across their platform, handling 50K+ conversations daily. They needed to understand the real-world attack surface before expanding the AI agents to handle sensitive account operations.
The Challenge
The client's AI agents were built on OpenAI's GPT-4 API with function calling, connected to internal systems for order lookups, account modifications, and refund processing. The agents had access to customer PII, billing data, and could execute actions like issuing credits and modifying subscriptions.
While the engineering team had implemented basic input filtering and system prompt protections, they had no structured methodology for testing LLM-specific attack vectors. With plans to expand the AI agents into sales and onboarding, they needed a security baseline before granting the agents more capabilities.
Our Approach
We designed a 2-week AI red teaming engagement based on the OWASP Top 10 for LLM Applications and our proprietary attack taxonomy:
- 1. Prompt Injection Testing — Direct injection, indirect injection via user-controlled content fields (order notes, profile bios), and multi-turn conversation manipulation.
- 2. Data Exfiltration — Attempting to extract other customers' PII through conversation manipulation, system prompt leakage, and function calling abuse.
- 3. Privilege Escalation — Testing whether the agent could be manipulated into performing actions beyond its intended scope, such as bulk refunds or admin-level account changes.
- 4. Tool/Function Abuse — Analyzing the function calling schema for over-permissioned tools and testing for parameter injection in API calls the agent makes.
Key Findings
We identified 12 distinct attack vectors, 3 of which could lead to customer PII exfiltration:
The most significant finding was an indirect prompt injection path: attackers could inject instructions into the "order notes" field of an e-commerce order. When a customer service agent later retrieved that order via function calling, the injected instructions were executed in the agent's context, allowing the attacker to redirect the conversation, extract the current customer's email and billing info, and instruct the agent to summarize it in a follow-up message.
We also discovered that the agent's system prompt was extractable through a multi-turn jailbreak technique, revealing internal API endpoints, function schemas, and business logic rules. The refund function had no server-side cap, meaning a manipulated agent could issue refunds exceeding the order value.
Additionally, the agent's conversation memory had no proper isolation between sessions — by referencing conversation IDs, it was possible to access context from other users' interactions in certain edge cases.
Outcome
We delivered a comprehensive hardening playbook covering input sanitization, output filtering, function calling constraints, and conversation isolation. Key recommendations implemented:
- Server-side guardrails on all agent-callable functions (amount caps, rate limits, scope restrictions)
- Content Security Policy for LLM inputs — sanitizing user-controlled fields before they enter the agent's context
- Conversation memory isolation with cryptographic session binding
- Automated prompt injection detection layer using a classifier model
- Red team testing integrated into the CI/CD pipeline for prompt template changes
The playbook was adopted across all 3 of the client's AI-powered product lines. They now run quarterly AI red teaming exercises as part of their security program.
Services Used
Deploying AI agents?
Make sure they can't be turned against your customers.
Get an AI Security Assessment