
How to Write an Effective AI Agent Skill: The Four-Layer Architecture
Most people treat skills like a folder of scripts. Drop in some Python, write a SKILL.md that lists the commands, and call it done. The agent runs the tools, but it does not understand what it is doing. It picks the wrong script. It skips steps. It does not know when to stop.
The difference between a mediocre skill and a great one is not the code. The real difference is the methodology baked into the SKILL.md. A skill without a good SKILL.md is just a toolbox with no instructions.
At Strobes, we leverage skills as a core architectural building block to execute cybersecurity tasks effectively and efficiently. Skills are how our AI agents go from “general-purpose assistant” to “domain expert that follows a proven methodology.” Every security workflow, from web penetration testing to cloud security audits, is encoded as a skill that an agent can pick up and execute autonomously. Here is how to build skills that actually work.
The Four-Layer Architecture
Every effective skill has four layers, each with a distinct job:
┌─────────────────────────────────────────────┐
│ SKILL.md │ Methodology Layer
│ (phases, playbooks, decision trees, rules) │ “What to do and when”
├─────────────────────────────────────────────┤
│ scripts/ │ Scripts Layer
│ (CLI tools the agent executes) │ “How to do it”
├─────────────────────────────────────────────┤
│ scripts/lib/ │ Shared Library Layer
│ (db, output, parsing, utilities) │ “Reusable foundations”
├─────────────────────────────────────────────┤
│ project.db │ Data Layer
│ (SQLite: state, results, evidence) │ “Persistent memory”
└─────────────────────────────────────────────┘
The methodology layer is the brain. The scripts are the hands. The library keeps the hands consistent. And the database is the shared memory that ties it all together.
Most skill authors spend 90% of their time on the scripts and 10% on the SKILL.md. Flip that ratio.
The web penetration testing use case shows exactly why this matters. A web pentest is not one task. It is hundreds of coordinated decisions.
The agent must track which endpoints have been discovered, which ones have been tested, and which still need attention. It manages a sitemap that grows as it crawls, runs testcases from the OWASP Top 10 against each endpoint, and marks them as pass, fail, or skipped. It fuzzes parameters with payloads, records which payloads triggered interesting responses, and links those back to formal findings with evidence. It also has to know that /api/users/{id} was tested for IDOR but not yet for SQL injection, and that the authentication token it is using expires in 20 minutes.
Without structure, the agent drowns. It retests endpoints it already covered. It skips entire vulnerability categories. It loses track of what it found. It generates a report that is missing half the work.
This is fundamentally a planning and execution problem. The skill needs to handle four things:
- Plan what to test by scoping the target, enumerating the attack surface, and importing the relevant checklist
- Track execution and know which testcases are pending, in-progress, passed, or failed
- Maintain state across endpoints discovered, request and response history, and findings with linked evidence
- Measure completeness by surfacing coverage percentages, untested areas, and gaps in the assessment
The four-layer architecture solves this. The methodology layer defines the phases and the order of operations. The scripts layer provides tools for each action, covering scoping, discovering, fuzzing, and recording findings. The shared library keeps database access and output formatting consistent. The data layer, a SQLite database, is the single source of truth that holds everything together across phases.
Orchestrator and Subagents
The skill does not do everything itself, and that is by design. In the Strobes AI architecture, the orchestrator agent uses the skill to plan, track, and manage the overall assessment. It scopes the target, imports the OWASP checklist, discovers endpoints, and builds the test plan. It knows which testcases exist, which are pending, and which have results.
When it is time to actually test something, say, run all the injection testcases against a payment endpoint or deep-dive into an authentication bypass, the orchestrator spins up a subagent. That subagent is not constrained by the skill’s methodology. It operates with full autonomy, using whatever tools and techniques it needs to thoroughly test the specific task it was given. It can craft custom payloads, chain multiple requests, explore unexpected behavior, and pivot based on what it finds. Essentially, it can deep-dive pentest the way an experienced human tester would, following the thread wherever it leads.
The orchestrator does not micromanage. It delegates a task (“test this endpoint for SQL injection”) and waits for the result. When the subagent finishes, the orchestrator takes back control. It records the outcome, updates the testcase status, and links any findings to evidence. If a testcase came back inconclusive or the subagent hit a dead end, the orchestrator can retry it with different parameters or a different approach. If a finding needs deeper validation, it can spin up another subagent to confirm.
This separation matters because it draws a clean line of responsibility.
The orchestrator owns the plan. It uses the skill’s scripts and methodology to manage scope, track progress, measure coverage, and generate reports.
The subagents own the execution. They are unconstrained specialists that go deep on a specific task, free to use their full capabilities without being boxed in by the skill’s workflow.
The skill architecture makes this possible because it provides the shared state layer. The orchestrator and subagents do not need to pass context back and forth through conversation. The database holds the endpoints, testcases, findings, and evidence. A subagent writes its results to the database and the orchestrator reads them, then decides what to do next.
This pattern applies far beyond pentesting. Infrastructure audits, compliance assessments, incident response, and code review all share the same planning and execution problem. The skill architecture is how you tame it, and the orchestrator-subagent split is how you get both structure and depth.
Start with the Methodology, Not the Code
Before you write a single line of Python, answer five questions.
- What are the phases of this activity? Every skill follows an ordered workflow, usually 4 to 6 phases from setup to output. Define them.
- What are the key actions in each phase? What does the agent actually do at each step?
- What data needs to persist between phases? Phase 2’s output is Phase 3’s input. What is the handoff?
- What are the decision points? Where does the agent need to choose between approaches?
- What are the non-negotiable rules? What must always happen? What must never happen?
Write the phase table first. Everything else follows from it.
Phase 1: Setup & Scope → Initialize project, define boundaries
Phase 2: Discovery → Enumerate and catalog targets
Phase 3: Analysis → Understand structure, identify areas of interest
Phase 4: Execution → Run the core task
Phase 5: Documentation → Record results with evidence
Phase 6: Reporting → Generate output, verify completeness
Your domain will shape the specifics. A web pentest skill, a source code review skill, and a cloud security assessment skill all follow this same arc. They just fill in different details for their domain.
The SKILL.md is the Most Important File
The SKILL.md is not documentation for humans. It is a set of instructions that get loaded into the agent’s context when the skill activates. Every section serves a purpose.
Skill identity
Keep it under 10 lines. Name, purpose, how to invoke scripts, where the database lives.
Quick reference table
A single lookup table the agent scans to find the right tool.
| Script | Subcommands | Purpose | Phase |
|---------------|--------------------------|--------------------------|-------|
| scope.py | init, add, list, check | Define project boundaries| 1 |
| discover.py | scan, results, export | Find and catalog targets | 2 |
| analyze.py | run, compare, report | Core analysis logic | 3-4 |
This table is load-bearing. If it’s wrong, the agent picks the wrong tool.
Phase methodology
Each phase gets a one-sentence goal, numbered steps with exact commands, decision points with an if-this-then-that structure, and a clear statement of what to feed into the next phase. Do not be vague. “Analyze the results” is useless. “Run python3 skill/scripts/analyze.py run --input results.json --threshold 0.8 and check if any items have a score above the threshold” is useful.
Playbooks
Playbooks are end-to-end recipes for specific scenarios. Each one walks through identifying targets, executing with realistic inputs, verifying the result, and recording the output. The agent does not improvise well under ambiguity. Playbooks eliminate ambiguity.
Decision trees
When the agent has to choose, give it a map.
What type of input are we working with?
├── Structured data (JSON, CSV) → Use parser.py with --format flag
├── Unstructured text → Use analyzer.py with --mode text
│ ├── Short (<1000 chars) → Use --batch single
│ └── Long (>1000 chars) → Use --batch chunked
└── Binary files → Use extractor.py first, then analyzer.py
Without decision trees, the agent guesses. With them, it reasons.
Rules and Constraints
Non-negotiable guardrails. Things like:
- Always verify scope before executing
- Never overwrite existing results without confirmation
- Always record evidence alongside findings
- Run the coverage check before generating the final report
Rules are the agent’s conscience. Be explicit.
Script Design: One Script Per Domain
Don’t create a script for every action. Create one script per domain with subcommands:
# Good: one script, multiple subcommands
python3 skill/scripts/scope.py init --name "my-project"
python3 skill/scripts/scope.py add --type url --value "https://example.com"
python3 skill/scripts/scope.py list
python3 skill/scripts/scope.py check --value "https://example.com"
# Bad: separate scripts for each action
python3 skill/scripts/init_scope.py --name "my-project"
python3 skill/scripts/add_scope.py --type url --value "https://example.com"
Every script should follow the same structure. Use argparse with subcommands for a consistent CLI interface. Write a handler function per subcommand for clean separation of concerns. Route all output through a shared formatting library, never bare print(), always success(), error(), info(), or table(). Route all database access through a shared db module, never inline sqlite3.connect(). Avoid external dependencies so the script runs anywhere on Python stdlib. Add a --json flag on every subcommand for machine-readable output when the agent needs to parse results.
Keep scripts idempotent where possible. Running the same command twice should not corrupt state.
Shared State via SQLite
Scripts need to share state. Phase 2 discovers things that Phase 4 acts on. The database is the contract between them. Design your schema around three categories.
Core tables that every skill needs cover project metadata like name, owner, and start date, a scope table for what is in and out of bounds, and a findings or results table for the output of the skill’s work. Tracking tables hold testcases or checklists showing what has been done and what remains. Domain-specific tables cover whatever entities your skill operates on, whether those are files, endpoints, resources, or records.
Use SQLite with WAL mode and foreign keys. Keep the schema in lib/db.py as CREATE TABLE IF NOT EXISTS statements so initialization is idempotent.
The Shared Library Layer
Two modules are non-negotiable. The first is db.py, which handles database connection management, schema initialization, and path resolution by walking up the directory tree to find project.db. The second is output.py, which provides consistent formatting with color-coded status indicators including success(), error(), warn(), info(), header(), table(), and json_out(). Consistent output is not cosmetic. It helps the agent parse results reliably.
Beyond these two, add domain-specific modules as needed, covering HTTP clients, parsers, payload generators, and format converters. The rule is simple. If two scripts would duplicate the same logic, extract it into lib/.
Registration and Deployment
Place your skill where the agent framework can find it. The exact location depends on your setup, but the pattern is consistent. For project-level skills tied to a specific engagement, use your-project/skills/my-skill/SKILL.md. For shared skills available across all projects, use ~/.config/agent/skills/my-skill/SKILL.md.
Add YAML frontmatter to register it.
name: my-skill
description: >
One-paragraph description of what the skill does AND when to use it.
Be specific — this is how the agent decides whether to load the skill.
allowed-tools: Bash(python3 skill/scripts/*)
The description field is critical. It is not a docstring for humans. It is the trigger condition for the agent. Write it like an if-statement: “Use when X, Y, or Z.” In the Strobes AI platform, this is how the orchestration layer decides which skill to activate for a given cybersecurity task. A well-written description means the right skill fires at the right time. The allowed-tools field pre-approves script execution so the agent does not need permission for every command.
The Development Workflow
Build in this order. Define the methodology first, covering phases, actions, decision points, and rules. Then design the database schema around entities, relationships, and state. Build the library layer next with db.py, output.py, and any domain modules. Build scripts in phase order, starting with scope, then discovery, then analysis, then reporting. Write the SKILL.md once you know what the tools actually do. Verify that everything matches, with every script referenced, every flag name correct, and every example runnable. Finally, test with the agent. Run end-to-end, note where the agent gets confused, add clarity to the SKILL.md, and repeat.
That last step is where most of the real work happens. The agent is your QA team for the methodology. If it gets confused, your SKILL.md is not clear enough.
Common Mistakes
Writing the SKILL.md as documentation is the most common error. It is not a README. It is a set of instructions optimized for an AI agent to execute. Be precise. Be imperative. Include exact commands.
Vague phase descriptions are just as damaging. “Analyze the data” tells the agent nothing. “Run analyze.py scan --target X and check if any results have severity above high” tells it everything.
Missing decision trees leave the agent guessing at forks. If you do not tell it how to choose, it will guess and often guess wrong.
Skipping a rules section means the agent will take shortcuts. If something must always happen or must never happen, write it down.
Script flags that do not match the SKILL.md will cause failures. The agent reads the SKILL.md and runs the commands it finds there. If --filter in the docs is actually --query in the code, the command fails. Verify every flag name.
Skipping the test-with-agent step will surprise you. Run the full workflow at least once, probably three times.
The Takeaway
A great skill teaches the agent a methodology, not just a set of tools. The SKILL.md is the brain. Phases tell it what order to work in. Playbooks give it recipes for specific scenarios. Decision trees help it choose between approaches. Rules keep it on track.
The scripts are important, but they are the easy part. Encoding your expertise into a format that an agent can follow autonomously is the hard part, and it is where the real difference is made.
At Strobes AI, every cybersecurity workflow ships as a skill with this exact methodology-first design. The result is measurable. Agents run complete security assessments, including web penetration testing, API security assessment, cloud configuration audits, and source code review, with the rigor and structure of an experienced practitioner, without hand-holding.
Build the methodology first. The code follows.
Want to see the full architecture in a working codebase? The open-source web pentest skill on GitHub is a complete Burp Suite-inspired CLI toolkit with 12 scripts, built-in wordlists, SQLite state management, and a six-phase methodology that an AI agent can pick up and run end-to-end. It is the fastest way to understand how all four layers fit together in practice. View it on GitHub
See a real AI agent run a full web pentest using this exact architecture