Create new eval suites for the deepagentsjs monorepo. Handles dataset design, test case scaffolding, scoring logic, vitest configuration, and LangSmith integration. Use when the user asks to: (1) create an eval, (2) write an evaluation, (3) add a benchmark, (4) build an eval suite, (5) evaluate agent behaviour, (6) add test cases for a capability, or (7) implement an existing benchmark (e.g. oolong, AgentBench, SWE-bench). Trigger on phrases like 'create eval', 'new eval', 'add eval', 'benchmark', 'evaluate', 'eval suite', 'write evals for'.
Add assessment annotations to a Semiont resource — flag scheduling risks, dangers, inaccuracies, logical gaps, or other evaluative concerns using AI-assisted or manual assessment
- 📁 examples/
- 📄 README.md
- 📄 SKILL.md
Qualify trade show leads from badge scans, booth notes, or voice memos into scored CRM-ready cards. \"Score my booth leads\" / \"给展会线索打分\" / \"Leads qualifizieren\" / \"リードを評価する\" / \"calificar leads de feria\". 展会线索/资质审核/线索分级 Leadqualifizierung Messeleads 展示会リード評価 calificación de leads
Create trigger evaluation setup for a toolkit skill. Use when the user wants to test whether a skill's description triggers correctly, set up eval workspaces, or generate trigger test queries for a skill. Use when user says 'create eval', 'test triggers', 'eval skill', or wants to measure skill triggering accuracy.
Classify and score business risks so agents produce consistent, comparable assessments.
Deep product audit. Brutal honest assessment of what you're building, who for, the biggest strategic gap, and the question you're avoiding. Produces AUDIT.md.
Diagnose and test Claude Code skills against Anthropic's 7 principles. Scans SKILL.md files, checks 8 rules (gotchas, description, allowed-tools, file-size, structure, frontmatter, conflicts, usage-hooks), classifies skill types, generates prescriptions, and runs eval tests. Use when checking skill quality, auditing skills, testing skills, or before publishing skills. Triggers on "스킬 점검", "스킬 진단", "스킬 테스트", "check skills", "audit skills", "test skills", "skill health", "pulser", "pulser eval".
- 📁 references/
- 📄 .gitkeep
- 📄 SKILL.md
Playbook for authoring, running, evaluating, and improving Gina sandbox workflows with safe defaults and repeatable operations.
- 📁 assets/
- 📁 references/
- 📄 SKILL.md
Conducts a structured gap assessment of an organization's readiness against ISO 42001:2023 (AI Management System standard). Runs an interview-style evaluation across all mandatory clauses (4-10) and applicable Annex A controls. Produces a scored gap assessment report saved to the vault, a draft Statement of Applicability, and a prioritized list of gaps to address before certification. Requires a vault created by /setup-iso42001-vault.
- 📄 SKILL.md
- 📄 SKILL.md.tmpl
HIPAA compliance interview. Processes one NIST 800-53 control at a time — reads the official NIST assessment method and asks relevant questions. Covers vendors (SA-9), risk (RA-3), training (AT-2), and all other interview-only controls.