Files
Claude-Code-Game-Studios/.claude/skills/test-flakiness/SKILL.md
Donchitos 984023ddac Release v1.0.0 — concept-prototype/vertical-slice split, workflow restructure, polish (#50)
* Add /vertical-slice skill, prototype overhaul, and workflow integration

- Add /vertical-slice skill for pre-production validation (Phase 4 gate)
- Overhaul /prototype skill with two-mode design: concept prototype (Phase 1)
  vs vertical slice (Phase 4), with clearer differentiation and higher standards for VS
- Update prototyper agent to own both prototype and vertical-slice workflows
- Add prototype-report.md and vertical-slice-report.md output templates
- Update WORKFLOW-GUIDE, quick-start, skills-reference, agent-coordination-map,
  and skill-flow-diagrams to fully integrate both skills into the 7-phase pipeline
- Remove orphaned empty quick-prototype/ directory

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* sync v1 counts + polish

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add entity inventory flow, relax vertical-slice gate, improve UX authoring prompts

- /asset-spec: new Phase 0b entity & screen inventory when no argument and no
  existing inventory — reads GDDs/art-bible, proposes categorized list, writes
  design/assets/entity-inventory.md collaboratively
- /asset-spec: entity/character target falls back to inline user description
  when no source doc exists, rather than failing
- /gate-check: vertical slice changed from blocking to CONCERNS-only when
  absent; built-but-broken slice still fails; adds entity inventory as gate artifact
- /ux-design: convert inline approval prompts to AskUserQuestion for structured
  option capture at key authoring decision points
- workflow-catalog.yaml: entity-inventory step added to pre-production; UX spec
  min_count raised to 3; vertical-slice and prototype marked required: false with
  updated descriptions
- .gitignore: exclude marrow/ eval tooling directory

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add missing AskUserQuestion widgets to 7 skills

Audit found 11 decision points across 7 skills where structured option
prompts were missing — using plain text, auto-selection, or no gate at all.

Skills patched:
- create-epics: per-epic approval + producer CONCERNS verdict
- sprint-plan: producer CONCERNS verdict with scope/timeline options
- milestone-review: AT RISK / OFF TRACK producer verdicts require acknowledgement
- retrospective: existing-retro handling converted from plain text [A]/[B]
- quick-design: classification confirmation + draft approve/revise/redirect
- tech-debt add mode: category (6 options) + effort (S/M/L/XL) structured capture
- regression-suite: no-arg mode selection instead of silent auto-detect
- hotfix: severity confirmation gate before workflow begins

Also added AskUserQuestion to allowed-tools headers for retrospective,
quick-design, tech-debt, regression-suite, and hotfix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Prep v1 stable: fix WORKFLOW-GUIDE counts, stale agent names, and skill model fields

- WORKFLOW-GUIDE.md: correct agent count (48→49), skill count (66/68→73),
  add 6 missing skills to Appendix B, fix Creative category count (2→4),
  replace 3 non-existent agent names with correct ue-*/unity-* specialists,
  add missing godot-csharp/gdextension specialists to hierarchy,
  fix production/stories/ paths → production/epics/
- coordination-rules.md: replace "not yet used" with opt-in env var note
- quick-start.md: rename duplicate "Validate the concept" label → "Prototype the mechanic"
- skill-flow-diagrams.md: remove duplicate legacy UX pipeline section
- All 62 skills missing model: field now have explicit model: sonnet

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: comprehensive skill audit — consistency, UX, and flow gaps

Two-pass audit fixing ~35 bugs across 41 files.

Pre-production flow:
- Brainstorm next-steps split into Path A (design-first) and Path B
  (prototype-first) — eliminates "prototype after architecture" confusion
- /architecture-review added to pre-production flow in brainstorm and
  create-architecture handoffs
- gate-check traceability check corrected to requirements-traceability.md
- dev-story TR registry error now points to /architecture-review (not /create-epics)
- start now writes production/stage.txt on first onboarding

AskUserQuestion gaps filled:
- balance-check, code-review, hotfix, day-one-patch, consistency-check
  all gain closing widgets and/or missing allowed-tools declarations
- hotfix git branch creation now requires user confirmation
- sprint-plan review-mode setup moved to Phase 0 (before gates run)
- team-combat gains architecture→implementation approval gate
- design-review APPROVED path consolidated from 3 widgets to 1 multiSelect

All 9 team-* skills:
- Phase 0 review-mode resolution added (solo/lean/full now respected)
- team-audio output path fixed (design/gdd/ → design/audio/)
- team-level final doc compilation delegated to level-designer subagent
- team-narrative localization-lead added to composition list
- team-qa sprint path fixed (flat files, not directories)
- team-release NO-GO override captures written justification
- team-live-ops Cancel verdict now explicitly BLOCKED

Other fixes:
- Art bible path standardized to design/art/art-bible.md (3 wrong refs)
- AD-PHASE-GATE added to lean-mode skip list in director-gates.md
- design-system duplicate 5d heading fixed; skeleton decline path added;
  mandatory agent spawns now respect review mode
- story-readiness acceptance criteria thresholds now type-aware
- create-stories gains multi-ADR and no-ADR handling guidance
- consistency-check creates docs/consistency-failures.md on first run
- retrospective frontmatter bash injection replaced with explicit Bash call
- smoke-check ls -t gains PowerShell fallback
- Conventional Commits format documented in coding-standards.md
- gate-check: ADR acceptance gate, QA plan check, chain-of-verification
  tool-action requirement all added

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: expose --review flag in argument-hints for all team-* skills

All 9 team-* skills already implement Phase 0 review-mode resolution
internally (full/lean/solo), but none advertised [--review full|lean|solo]
in their argument-hint. Users had no way to discover the per-run override.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: add SECURITY.md with coordinated disclosure policy

Defines scope, reporting process (GitHub private vulnerability reporting),
contributor security guidelines for hooks/skills/agents, and 90-day
coordinated disclosure timeline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: add CONTRIBUTING.md with framework contribution guidelines

Covers what PRs are welcome, skill/hook/agent technical requirements,
the collaborative principle, testing expectations, commit format,
and platform compatibility requirements.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: add v1.0.0-beta → v1.0 upgrade section to UPGRADING.md

Documents the 17 commits since the beta tag: new /vertical-slice gate,
entity inventory flow in /map-systems, AskUserQuestion widgets across
7 skills, --review flag exposure on team-* skills, bug fixes
(#21, #36, #42, #43, #45), and the new CONTRIBUTING.md and SECURITY.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 20:15:08 +10:00

8.0 KiB
Raw Blame History

name, description, argument-hint, user-invocable, allowed-tools, model
name description argument-hint user-invocable allowed-tools model
test-flakiness Detect non-deterministic (flaky) tests by reading CI run logs or test result history. Aggregates pass rates per test, identifies intermittent failures, recommends quarantine or fix, and maintains a flaky test registry. Best run during Polish phase or after multiple CI runs. [ci-log-path | scan | registry] true Read, Glob, Grep, Write, Edit, Bash sonnet

Test Flakiness Detection

A flaky test is one that sometimes passes and sometimes fails without any code change. Flaky tests are worse than no tests in some ways — they train the team to ignore red CI runs, masking genuine failures. This skill identifies them, explains likely causes, and recommends whether to quarantine or fix each one.

Output: Updated tests/regression-suite.md quarantine section + optional production/qa/flakiness-report-[date].md

When to run:

  • Polish phase (tests have had many runs; statistical signal is reliable)
  • When developers start dismissing CI failures as "probably flaky"
  • After /regression-suite identifies quarantined tests that need diagnosis

1. Parse Arguments

Modes:

  • /test-flakiness [ci-log-path] — analyse a specific CI run log file
  • /test-flakiness scan — scan all available CI logs in .github/ or standard log output directories
  • /test-flakiness registry — read existing regression-suite.md quarantine section and provide remediation guidance for already-known flaky tests
  • No argument — auto-detect: run scan if CI logs are accessible, else registry

2. Locate CI Log Data

Option A — GitHub Actions (preferred)

Check for test result artifacts:

ls -t .github/ 2>/dev/null
ls -t test-results/ 2>/dev/null

For Godot projects: GdUnit4 outputs XML results compatible with JUnit format. Check test-results/ for .xml files.

For Unity projects: game-ci test runner outputs NUnit XML to test-results/ by default.

For Unreal projects: automation logs go to Saved/Logs/. Grep for Result: Success and Result: Fail patterns.

Option B — Local log files

If a path argument is provided, read that file directly.

Option C — No log data available

If no logs found:

"No CI log data found. To detect flaky tests, this skill needs test result history from multiple runs. Options:

  1. Run the test suite at least 3 times and collect the output logs
  2. Check CI pipeline output and save a log to test-results/
  3. Run /test-flakiness registry to review tests already flagged as flaky in tests/regression-suite.md"

Stop and ask the user which option to pursue.


3. Parse Test Results

For each CI log or result file found, parse:

JUnit XML format (GdUnit4 / Unity):

  • Grep for <testcase name= to get test names
  • Grep for <failure or <error to identify failures
  • Parse classname and name attributes for full test identifiers

Plain text logs:

  • Grep for pass/fail patterns:
    • Godot: PASSED / FAILED adjacent to test names
    • Unreal: Result: Success / Result: Fail
    • Unity: Test passed / Test failed

Build a table: test_id → [run1_result, run2_result, run3_result, ...]


4. Identify Flaky Tests

A test is flaky if it appears in the result history with both PASS and FAIL outcomes across runs with no code changes between them.

Flakiness thresholds:

  • High flakiness: Fails in >25% of runs — quarantine immediately
  • Moderate flakiness: Fails in 525% of runs — investigate and fix soon
  • Low/suspected flakiness: Fails in 15% of runs — monitor; may be genuinely rare failure

For each flaky test, classify the likely cause:

Cause classification

Cause Symptoms Fix direction
Timing / async Fails after awaiting signals or timers; pass rate correlates with system load Add explicit await/synchronisation; avoid time-based delays
Order dependency Fails when run after specific other tests; passes in isolation Add proper setup/teardown; ensure test isolation
Random seed Fails intermittently with no pattern; involves RNG Pass explicit seed; don't use randf() in tests
Resource leak Fails more often later in a test run Fix cleanup in teardown; check orphan nodes (Godot) or object disposal (Unity)
External state Fails when a file, scene, or global exists from a prior test Isolate test from file system; use in-memory mocks
Floating point Fails on comparisons like == 0.5 Use epsilon comparison (is_equal_approx, Assert.AreApproximately)
Scene/prefab load race Fails when scenes are not yet ready Await one frame after instantiation; use await get_tree().process_frame

Use Grep to check the test file for timing calls, randf, global state access, or equality comparisons on floats to narrow down the cause.


5. Recommend Action

For each flaky test:

Quarantine (High flakiness):

"Quarantine this test immediately. Disable it in CI by adding @pytest.mark.skip / [Ignore] / GdUnitSkip annotation. Log it in tests/regression-suite.md quarantine section. The test is now opt-in only. Fix the root cause before removing quarantine."

Investigate and fix soon (Moderate):

"This test is intermittently unreliable. Root cause appears to be [cause]. Suggested fix: [specific fix based on cause classification]. Do not quarantine yet — fix the test directly."

Monitor (Low/suspected):

"This test shows suspected flakiness. Collect more run data before quarantining. Note it as 'suspected' in the regression suite."


6. Generate Reports

In-conversation summary

## Flakiness Detection Results

**Runs analysed**: [N]
**Tests tracked**: [N]

### Flaky Tests Found

| Test | System | Fail Rate | Likely Cause | Recommendation |
|------|--------|-----------|--------------|----------------|
| [test_name] | [system] | [N]% | Timing | Quarantine + fix async |
| [test_name] | [system] | [N]% | Float comparison | Fix: use epsilon compare |
| [test_name] | [system] | [N]% | Order dependency | Investigate teardown |

### Clean Tests (no flakiness detected)

[N] tests ran across [N] runs with consistent results — no flakiness detected.

### Data Limitations

[Note if fewer than 5 runs were available — fewer runs = less statistical confidence]

7. Update Regression Suite + Optional Report File

Ask: "May I update the quarantine section of tests/regression-suite.md with the flaky tests found?"

If yes: use Edit to append entries to the Quarantined Tests table. Never remove existing quarantine entries — only add new ones.

Ask (separately): "May I write a full flakiness report to production/qa/flakiness-report-[date].md?"

The full report includes per-test analysis with cause details and engine-specific fix snippets.

After writing:

  • For each quarantined test: "Add the engine-specific skip annotation to disable this test in CI. Re-enable after the root cause is fixed."
  • For fix-eligible tests: "The fix for [test] is straightforward — change the equality comparison on line [N] to use is_equal_approx."
  • Summary: "Once all quarantine annotations are applied, CI should run green. Schedule fix work for the [N] quarantined tests before the release gate."

Collaborative Protocol

  • Never delete test files — quarantine means annotate + list, not remove
  • Statistical confidence matters — with < 3 runs, flag findings as "suspected" not "confirmed"; ask if more run data is available
  • Fix is always the goal — quarantine is temporary; surface the fix direction even when recommending quarantine
  • Ask before writing — both the regression-suite update and the report file require explicit approval. On write: Verdict: COMPLETE — flakiness report written. On decline: Verdict: BLOCKED — user declined write.
  • Flakiness in CI is a team problem — surface the list and recommended actions clearly; do not just silently quarantine without the team knowing