Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation. --- # Auto Arena Skill End-to-end automated model comparison using the OpenJudge `AutoArenaPipeline`: 1. **Generate queries** — LLM creates diverse test queries from task description 2. **Collect responses** — query all target endpoints concurrently 3. **Generate rubrics** — LLM produces evaluation criteria from task + sample queries 4. **Pairwise evaluation** — judge model compares every model pair (with position-bias swap) 5. **Analyze & rank** — compute win rates, win matrix, and rankings 6. **Report & charts** — Markdown report + win-rate bar chart + optional matrix heatmap ## Prerequisites ```bash # Install OpenJudge pip install py-openjudge # Extra dependency for auto_arena (chart generation) pip install matplotlib ``` ## Gather from user before running | Info | Required? | Notes | |------|-----------|-------| | Task description | Yes | What the models/agents should do (set in config YAML) | | Target endpoints | Yes | At least 2 OpenAI-compatible endpoints to compare | | Judge endpoint | Yes | Strong model for pairwise evaluation (e.g. `gpt-4`, `qwen-max`) | | API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc. | | Number of queries | No | Default: `20` | | Seed queries | No | Example queries to guide generation style | | System prompts | No | Per-endpoint system prompts | | Output directory | No | Default: `./evaluation_results` | | Report language | No | `"zh"` (default) or `"en"` | ## Quick start ### CLI `
Skill files are scattered across GitHub and communities, difficult to search, and hard to evaluate. SkillWink organizes open-source skills into a searchable, filterable library you can directly download and use.
We provide keyword search, version updates, multi-metric ranking (downloads / likes / comments / updates), and open SKILL.md standards. You can also discuss usage and improvements on skill detail pages.
Sort by downloads/likes/comments/updated to find higher-quality skills.
4. Which import methods are supported?
Upload archive: .zip / .skill (recommended)
Upload skills folder
Import from GitHub repository
Note: file size for all methods should be within 10MB.
5. How to use in Claude / Codex?
Typical paths (may vary by local setup):
Claude Code:~/.claude/skills/
Codex CLI:~/.codex/skills/
One SKILL.md can usually be reused across tools.
6. Can one skill be shared across tools?
Yes. Most skills are standardized docs + assets, so they can be reused where format is supported.
Example: retrieval + writing + automation scripts as one workflow.
7. Are these skills safe to use?
Some skills come from public GitHub repositories and some are uploaded by SkillWink creators. Always review code before installing and own your security decisions.
8. Why does it not work after import?
Most common reasons:
Wrong folder path or nested one level too deep
Invalid/incomplete SKILL.md fields or format
Dependencies missing (Python/Node/CLI)
Tool has not reloaded skills yet
9. Does SkillWink include duplicates/low-quality skills?
We try to avoid that. Use ranking + comments to surface better skills: