Adding Detectors¶
Signal detectors analyze agent traces for evidence of security compromise. Each detector targets specific vulnerability classes and emits Signal objects with a tier, confidence, and evidence.
The SignalDetector Protocol¶
Every detector implements this interface:
class SignalDetector(Protocol):
@property
def name(self) -> str: ...
@property
def targets(self) -> list[VulnClass]: ...
async def analyze(
self, trace: AgentTrace, attack: Attack | None = None
) -> list[Signal]: ...
Step-by-Step¶
1. Create the Detector File¶
2. Implement the Detector¶
"""MyDetector — detects [what it detects]."""
from __future__ import annotations
from agent_redteam.core.enums import EventType, SignalTier, TrustBoundary, VulnClass
from agent_redteam.core.models import AgentTrace, Attack, Signal
class MyDetector:
"""Detects [specific security behavior]."""
@property
def name(self) -> str:
return "my_detector"
@property
def targets(self) -> list[VulnClass]:
return [VulnClass.V1_INDIRECT_INJECTION]
async def analyze(
self, trace: AgentTrace, attack: Attack | None = None
) -> list[Signal]:
signals: list[Signal] = []
for event in trace.events:
if self._is_suspicious(event):
signals.append(
Signal(
tier=SignalTier.SUSPICIOUS_BEHAVIOR,
vuln_class=VulnClass.V1_INDIRECT_INJECTION,
trust_boundary=TrustBoundary.B2_EXTERNAL_TO_AGENT,
detector_name=self.name,
description="Detected [specific issue]",
confidence=0.75,
evidence_event_ids=[event.id],
evidence_summary=f"Event: {event.tool_name}",
)
)
return signals
def _is_suspicious(self, event) -> bool:
# Your detection logic here
return False
3. Register the Detector¶
Add it to agent_redteam/detectors/base.py:
from agent_redteam.detectors.my_detector import MyDetector
class DetectorRegistry:
def defaults(self, ...):
# ... existing detectors ...
self.register(MyDetector())
return self
4. Write Tests¶
Create tests/detectors/test_my_detector.py:
import pytest
from agent_redteam.core.enums import EventType
from agent_redteam.core.models import AgentTask, AgentTrace, Event
from agent_redteam.detectors.my_detector import MyDetector
@pytest.fixture
def detector() -> MyDetector:
return MyDetector()
class TestMyDetector:
@pytest.mark.asyncio
async def test_detects_suspicious_case(self, detector):
trace = AgentTrace(
task=AgentTask(instruction="test"),
events=[
Event(
event_type=EventType.TOOL_CALL,
tool_name="shell",
tool_args={"command": "suspicious command"},
),
],
)
signals = await detector.analyze(trace)
assert len(signals) >= 1
@pytest.mark.asyncio
async def test_no_signal_for_safe_case(self, detector):
trace = AgentTrace(
task=AgentTask(instruction="test"),
events=[
Event(
event_type=EventType.TOOL_CALL,
tool_name="file_read",
tool_args={"path": "/safe/file.txt"},
),
],
)
signals = await detector.analyze(trace)
assert len(signals) == 0
Signal Tiers¶
Choose the appropriate tier based on evidence strength:
| Tier | When to Use | Confidence |
|---|---|---|
POLICY_VIOLATION |
Agent did something outside expected scope but not clearly malicious | 0.3--0.5 |
SUSPICIOUS_BEHAVIOR |
Strong indicator of compromise, but could be a false positive | 0.5--0.8 |
DEFINITIVE_COMPROMISE |
Confirmed breach (e.g., canary token in outbound request) | 0.8--1.0 |
Working with AgentTrace¶
The AgentTrace provides helper methods:
trace.events # All events
trace.tool_calls() # Only TOOL_CALL events
trace.events_by_type(EventType.NETWORK_REQUEST)
trace.task # The AgentTask
trace.task.expected_tools # Tools the agent was supposed to use
trace.final_output # Agent's final response text
trace.environment # The Environment used for this run (optional)
trace.environment.network_rules # Network allow/deny/canary rules
trace.environment.canary_tokens # Canary tokens seeded in this environment
Existing Detectors as Examples¶
| Detector | Complexity | Good Example Of |
|---|---|---|
SecretAccessDetector |
Simple | Pattern matching against canary tokens |
ExfiltrationDetector |
Medium | Cross-referencing events with trace.environment network rules and canary domains |
InjectionSuccessDetector |
Medium | Comparing agent behavior against expected behavior |
ToolMisuseDetector |
Complex | Regex-based command analysis (23 patterns) |
ScopeViolationDetector |
Simple | Comparing actual vs expected tool usage |
ExcessiveAgencyDetector |
Medium | Detecting high-impact actions without user confirmation |
InsecureOutputDetector |
Complex | Multi-pattern regex scanning for XSS, SQLi, shell injection |
MemoryPoisonDetector |
Medium | Detecting instruction injection in memory writes |
McpSecurityDetector |
Medium | MCP tool args, descriptions, and results for V12/V5 supply-chain signals |
SemanticJudgeDetector |
Advanced | LLM-as-judge over traces; optional; requires JudgeConfig on Scanner |