mirror of
https://github.com/Donchitos/Claude-Code-Game-Studios.git
synced 2026-06-27 13:01:50 +00:00
* Add /vertical-slice skill, prototype overhaul, and workflow integration - Add /vertical-slice skill for pre-production validation (Phase 4 gate) - Overhaul /prototype skill with two-mode design: concept prototype (Phase 1) vs vertical slice (Phase 4), with clearer differentiation and higher standards for VS - Update prototyper agent to own both prototype and vertical-slice workflows - Add prototype-report.md and vertical-slice-report.md output templates - Update WORKFLOW-GUIDE, quick-start, skills-reference, agent-coordination-map, and skill-flow-diagrams to fully integrate both skills into the 7-phase pipeline - Remove orphaned empty quick-prototype/ directory Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * sync v1 counts + polish Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add entity inventory flow, relax vertical-slice gate, improve UX authoring prompts - /asset-spec: new Phase 0b entity & screen inventory when no argument and no existing inventory — reads GDDs/art-bible, proposes categorized list, writes design/assets/entity-inventory.md collaboratively - /asset-spec: entity/character target falls back to inline user description when no source doc exists, rather than failing - /gate-check: vertical slice changed from blocking to CONCERNS-only when absent; built-but-broken slice still fails; adds entity inventory as gate artifact - /ux-design: convert inline approval prompts to AskUserQuestion for structured option capture at key authoring decision points - workflow-catalog.yaml: entity-inventory step added to pre-production; UX spec min_count raised to 3; vertical-slice and prototype marked required: false with updated descriptions - .gitignore: exclude marrow/ eval tooling directory Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add missing AskUserQuestion widgets to 7 skills Audit found 11 decision points across 7 skills where structured option prompts were missing — using plain text, auto-selection, or no gate at all. Skills patched: - create-epics: per-epic approval + producer CONCERNS verdict - sprint-plan: producer CONCERNS verdict with scope/timeline options - milestone-review: AT RISK / OFF TRACK producer verdicts require acknowledgement - retrospective: existing-retro handling converted from plain text [A]/[B] - quick-design: classification confirmation + draft approve/revise/redirect - tech-debt add mode: category (6 options) + effort (S/M/L/XL) structured capture - regression-suite: no-arg mode selection instead of silent auto-detect - hotfix: severity confirmation gate before workflow begins Also added AskUserQuestion to allowed-tools headers for retrospective, quick-design, tech-debt, regression-suite, and hotfix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Prep v1 stable: fix WORKFLOW-GUIDE counts, stale agent names, and skill model fields - WORKFLOW-GUIDE.md: correct agent count (48→49), skill count (66/68→73), add 6 missing skills to Appendix B, fix Creative category count (2→4), replace 3 non-existent agent names with correct ue-*/unity-* specialists, add missing godot-csharp/gdextension specialists to hierarchy, fix production/stories/ paths → production/epics/ - coordination-rules.md: replace "not yet used" with opt-in env var note - quick-start.md: rename duplicate "Validate the concept" label → "Prototype the mechanic" - skill-flow-diagrams.md: remove duplicate legacy UX pipeline section - All 62 skills missing model: field now have explicit model: sonnet Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: comprehensive skill audit — consistency, UX, and flow gaps Two-pass audit fixing ~35 bugs across 41 files. Pre-production flow: - Brainstorm next-steps split into Path A (design-first) and Path B (prototype-first) — eliminates "prototype after architecture" confusion - /architecture-review added to pre-production flow in brainstorm and create-architecture handoffs - gate-check traceability check corrected to requirements-traceability.md - dev-story TR registry error now points to /architecture-review (not /create-epics) - start now writes production/stage.txt on first onboarding AskUserQuestion gaps filled: - balance-check, code-review, hotfix, day-one-patch, consistency-check all gain closing widgets and/or missing allowed-tools declarations - hotfix git branch creation now requires user confirmation - sprint-plan review-mode setup moved to Phase 0 (before gates run) - team-combat gains architecture→implementation approval gate - design-review APPROVED path consolidated from 3 widgets to 1 multiSelect All 9 team-* skills: - Phase 0 review-mode resolution added (solo/lean/full now respected) - team-audio output path fixed (design/gdd/ → design/audio/) - team-level final doc compilation delegated to level-designer subagent - team-narrative localization-lead added to composition list - team-qa sprint path fixed (flat files, not directories) - team-release NO-GO override captures written justification - team-live-ops Cancel verdict now explicitly BLOCKED Other fixes: - Art bible path standardized to design/art/art-bible.md (3 wrong refs) - AD-PHASE-GATE added to lean-mode skip list in director-gates.md - design-system duplicate 5d heading fixed; skeleton decline path added; mandatory agent spawns now respect review mode - story-readiness acceptance criteria thresholds now type-aware - create-stories gains multi-ADR and no-ADR handling guidance - consistency-check creates docs/consistency-failures.md on first run - retrospective frontmatter bash injection replaced with explicit Bash call - smoke-check ls -t gains PowerShell fallback - Conventional Commits format documented in coding-standards.md - gate-check: ADR acceptance gate, QA plan check, chain-of-verification tool-action requirement all added Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: expose --review flag in argument-hints for all team-* skills All 9 team-* skills already implement Phase 0 review-mode resolution internally (full/lean/solo), but none advertised [--review full|lean|solo] in their argument-hint. Users had no way to discover the per-run override. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: add SECURITY.md with coordinated disclosure policy Defines scope, reporting process (GitHub private vulnerability reporting), contributor security guidelines for hooks/skills/agents, and 90-day coordinated disclosure timeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: add CONTRIBUTING.md with framework contribution guidelines Covers what PRs are welcome, skill/hook/agent technical requirements, the collaborative principle, testing expectations, commit format, and platform compatibility requirements. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: add v1.0.0-beta → v1.0 upgrade section to UPGRADING.md Documents the 17 commits since the beta tag: new /vertical-slice gate, entity inventory flow in /map-systems, AskUserQuestion widgets across 7 skills, --review flag exposure on team-* skills, bug fixes (#21, #36, #42, #43, #45), and the new CONTRIBUTING.md and SECURITY.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
212 lines
8.0 KiB
Markdown
212 lines
8.0 KiB
Markdown
---
|
||
name: test-flakiness
|
||
description: "Detect non-deterministic (flaky) tests by reading CI run logs or test result history. Aggregates pass rates per test, identifies intermittent failures, recommends quarantine or fix, and maintains a flaky test registry. Best run during Polish phase or after multiple CI runs."
|
||
argument-hint: "[ci-log-path | scan | registry]"
|
||
user-invocable: true
|
||
allowed-tools: Read, Glob, Grep, Write, Edit, Bash
|
||
model: sonnet
|
||
---
|
||
|
||
# Test Flakiness Detection
|
||
|
||
A flaky test is one that sometimes passes and sometimes fails without any code
|
||
change. Flaky tests are worse than no tests in some ways — they train the team
|
||
to ignore red CI runs, masking genuine failures. This skill identifies them,
|
||
explains likely causes, and recommends whether to quarantine or fix each one.
|
||
|
||
**Output:** Updated `tests/regression-suite.md` quarantine section + optional
|
||
`production/qa/flakiness-report-[date].md`
|
||
|
||
**When to run:**
|
||
- Polish phase (tests have had many runs; statistical signal is reliable)
|
||
- When developers start dismissing CI failures as "probably flaky"
|
||
- After `/regression-suite` identifies quarantined tests that need diagnosis
|
||
|
||
---
|
||
|
||
## 1. Parse Arguments
|
||
|
||
**Modes:**
|
||
- `/test-flakiness [ci-log-path]` — analyse a specific CI run log file
|
||
- `/test-flakiness scan` — scan all available CI logs in `.github/` or
|
||
standard log output directories
|
||
- `/test-flakiness registry` — read existing regression-suite.md quarantine
|
||
section and provide remediation guidance for already-known flaky tests
|
||
- No argument — auto-detect: run `scan` if CI logs are accessible, else
|
||
`registry`
|
||
|
||
---
|
||
|
||
## 2. Locate CI Log Data
|
||
|
||
### Option A — GitHub Actions (preferred)
|
||
|
||
Check for test result artifacts:
|
||
```bash
|
||
ls -t .github/ 2>/dev/null
|
||
ls -t test-results/ 2>/dev/null
|
||
```
|
||
|
||
For Godot projects: GdUnit4 outputs XML results compatible with JUnit format.
|
||
Check `test-results/` for `.xml` files.
|
||
|
||
For Unity projects: game-ci test runner outputs NUnit XML to `test-results/`
|
||
by default.
|
||
|
||
For Unreal projects: automation logs go to `Saved/Logs/`. Grep for
|
||
`Result: Success` and `Result: Fail` patterns.
|
||
|
||
### Option B — Local log files
|
||
|
||
If a path argument is provided, read that file directly.
|
||
|
||
### Option C — No log data available
|
||
|
||
If no logs found:
|
||
> "No CI log data found. To detect flaky tests, this skill needs test result
|
||
> history from multiple runs. Options:
|
||
> 1. Run the test suite at least 3 times and collect the output logs
|
||
> 2. Check CI pipeline output and save a log to `test-results/`
|
||
> 3. Run `/test-flakiness registry` to review tests already flagged as flaky
|
||
> in `tests/regression-suite.md`"
|
||
|
||
Stop and ask the user which option to pursue.
|
||
|
||
---
|
||
|
||
## 3. Parse Test Results
|
||
|
||
For each CI log or result file found, parse:
|
||
|
||
**JUnit XML format** (GdUnit4 / Unity):
|
||
- Grep for `<testcase name=` to get test names
|
||
- Grep for `<failure` or `<error` to identify failures
|
||
- Parse `classname` and `name` attributes for full test identifiers
|
||
|
||
**Plain text logs**:
|
||
- Grep for pass/fail patterns:
|
||
- Godot: `PASSED` / `FAILED` adjacent to test names
|
||
- Unreal: `Result: Success` / `Result: Fail`
|
||
- Unity: `Test passed` / `Test failed`
|
||
|
||
Build a table: `test_id → [run1_result, run2_result, run3_result, ...]`
|
||
|
||
---
|
||
|
||
## 4. Identify Flaky Tests
|
||
|
||
A test is **flaky** if it appears in the result history with both PASS and
|
||
FAIL outcomes across runs with no code changes between them.
|
||
|
||
Flakiness thresholds:
|
||
- **High flakiness**: Fails in >25% of runs — quarantine immediately
|
||
- **Moderate flakiness**: Fails in 5–25% of runs — investigate and fix soon
|
||
- **Low/suspected flakiness**: Fails in 1–5% of runs — monitor; may be
|
||
genuinely rare failure
|
||
|
||
For each flaky test, classify the likely cause:
|
||
|
||
### Cause classification
|
||
|
||
| Cause | Symptoms | Fix direction |
|
||
|-------|----------|---------------|
|
||
| **Timing / async** | Fails after awaiting signals or timers; pass rate correlates with system load | Add explicit await/synchronisation; avoid time-based delays |
|
||
| **Order dependency** | Fails when run after specific other tests; passes in isolation | Add proper setup/teardown; ensure test isolation |
|
||
| **Random seed** | Fails intermittently with no pattern; involves RNG | Pass explicit seed; don't use `randf()` in tests |
|
||
| **Resource leak** | Fails more often later in a test run | Fix cleanup in teardown; check orphan nodes (Godot) or object disposal (Unity) |
|
||
| **External state** | Fails when a file, scene, or global exists from a prior test | Isolate test from file system; use in-memory mocks |
|
||
| **Floating point** | Fails on comparisons like `== 0.5` | Use epsilon comparison (`is_equal_approx`, `Assert.AreApproximately`) |
|
||
| **Scene/prefab load race** | Fails when scenes are not yet ready | Await one frame after instantiation; use `await get_tree().process_frame` |
|
||
|
||
Use Grep to check the test file for timing calls, randf, global state access,
|
||
or equality comparisons on floats to narrow down the cause.
|
||
|
||
---
|
||
|
||
## 5. Recommend Action
|
||
|
||
For each flaky test:
|
||
|
||
**Quarantine (High flakiness):**
|
||
> "Quarantine this test immediately. Disable it in CI by adding
|
||
> `@pytest.mark.skip` / `[Ignore]` / `GdUnitSkip` annotation. Log it in
|
||
> `tests/regression-suite.md` quarantine section. The test is now opt-in only.
|
||
> Fix the root cause before removing quarantine."
|
||
|
||
**Investigate and fix soon (Moderate):**
|
||
> "This test is intermittently unreliable. Root cause appears to be [cause].
|
||
> Suggested fix: [specific fix based on cause classification]. Do not quarantine
|
||
> yet — fix the test directly."
|
||
|
||
**Monitor (Low/suspected):**
|
||
> "This test shows suspected flakiness. Collect more run data before
|
||
> quarantining. Note it as 'suspected' in the regression suite."
|
||
|
||
---
|
||
|
||
## 6. Generate Reports
|
||
|
||
### In-conversation summary
|
||
|
||
```
|
||
## Flakiness Detection Results
|
||
|
||
**Runs analysed**: [N]
|
||
**Tests tracked**: [N]
|
||
|
||
### Flaky Tests Found
|
||
|
||
| Test | System | Fail Rate | Likely Cause | Recommendation |
|
||
|------|--------|-----------|--------------|----------------|
|
||
| [test_name] | [system] | [N]% | Timing | Quarantine + fix async |
|
||
| [test_name] | [system] | [N]% | Float comparison | Fix: use epsilon compare |
|
||
| [test_name] | [system] | [N]% | Order dependency | Investigate teardown |
|
||
|
||
### Clean Tests (no flakiness detected)
|
||
|
||
[N] tests ran across [N] runs with consistent results — no flakiness detected.
|
||
|
||
### Data Limitations
|
||
|
||
[Note if fewer than 5 runs were available — fewer runs = less statistical confidence]
|
||
```
|
||
|
||
---
|
||
|
||
## 7. Update Regression Suite + Optional Report File
|
||
|
||
Ask: "May I update the quarantine section of `tests/regression-suite.md`
|
||
with the flaky tests found?"
|
||
|
||
If yes: use `Edit` to append entries to the Quarantined Tests table.
|
||
Never remove existing quarantine entries — only add new ones.
|
||
|
||
Ask (separately): "May I write a full flakiness report to
|
||
`production/qa/flakiness-report-[date].md`?"
|
||
|
||
The full report includes per-test analysis with cause details and
|
||
engine-specific fix snippets.
|
||
|
||
After writing:
|
||
|
||
- For each quarantined test: "Add the engine-specific skip annotation to
|
||
disable this test in CI. Re-enable after the root cause is fixed."
|
||
- For fix-eligible tests: "The fix for [test] is straightforward —
|
||
change the equality comparison on line [N] to use `is_equal_approx`."
|
||
- Summary: "Once all quarantine annotations are applied, CI should run green.
|
||
Schedule fix work for the [N] quarantined tests before the release gate."
|
||
|
||
---
|
||
|
||
## Collaborative Protocol
|
||
|
||
- **Never delete test files** — quarantine means annotate + list, not remove
|
||
- **Statistical confidence matters** — with < 3 runs, flag findings as
|
||
"suspected" not "confirmed"; ask if more run data is available
|
||
- **Fix is always the goal** — quarantine is temporary; surface the fix
|
||
direction even when recommending quarantine
|
||
- **Ask before writing** — both the regression-suite update and the report
|
||
file require explicit approval. On write: Verdict: **COMPLETE** — flakiness report written. On decline: Verdict: **BLOCKED** — user declined write.
|
||
- **Flakiness in CI is a team problem** — surface the list and recommended
|
||
actions clearly; do not just silently quarantine without the team knowing
|