Turning Antigravity CLI Into a Self-Verifying Plan Executor

Table of Contents

The Problem: When the Executor Rebels Against the Plan
The Solution: Enforcing Absolute Execution Discipline
Why this combination works better than any single change
Final thoughts

If you use Antigravity CLI for real implementation work, the hardest problem is usually not raw model quality. It is execution discipline.

A fast model can still drift out of scope, improvise around a plan, or blur the line between “implementation” and “creative interpretation.” Before I tightened my setup, my biggest frustration was when the executor treated a carefully crafted, step-by-step plan as a mere "creative suggestion," introducing speculative abstractions and "while I'm here" refactoring that made the changes difficult to audit and verify.

To solve this, I tightened three parts of my Antigravity CLI setup:

the instruction layer (GEMINI.md, which plays the role that AGENTS.md often plays in other tools),
a dedicated implement-plan skill,
and a separate reviewer agent running Gemini 3.1 Pro, wired into an automated review loop that runs until it returns PASS.

The first two pieces made execution auditable: rules and a skill force the fast executor to stay anchored to a saved plan and to report what it did in a fixed structure. The third piece is what changed the whole thing. Once a different, higher-capability model does the judging, the setup stops being merely auditable and becomes self-verifying. The executor can no longer sign off on its own work.

This post walks through the problem of plan drift, how the global instruction layer and a custom plan-focused skill enforce discipline during coding tasks, and why moving the final verdict onto a second model was the change that mattered most.

The Problem: When the Executor Rebels Against the Plan

In real engineering work, an agent that decides to be a creative designer instead of a disciplined builder is a liability.

Without strict guardrails, a highly capable model running on Antigravity CLI easily falls into several problematic patterns:

Scope Drift: The model spots nearby code that could "look cleaner" or "benefit from a quick refactor" and alters files completely unrelated to the active ticket.
Creative Interpretation: Instead of writing the exact surgical diff outlined in the plan, it improvises and introduces speculative APIs or structural changes "just in case."
Messy Commits and Tool Churn: Instead of sticking to one target change, it jumps across tasks, modifying dependencies, or doing unrelated lint cleanups.

When this happens, you lose the ability to audit what changes were made and why. The implementation loop becomes a source of friction rather than acceleration. We don't need the executor to plan; we need it to execute.

The Solution: Enforcing Absolute Execution Discipline

To force the model back into its lanes, I implemented a strict system: global instruction guardrails (GEMINI.md) and a plan-bound execution skill.

1. Hard Rules in the Instruction Layer (`GEMINI.md`)

The first step was to treat the global instruction file as a contract rather than general guidance.

At ~/.gemini/GEMINI.md, I added a global execution layer that pushes Antigravity CLI toward a constrained “plan executor” mode for implementation tasks.

🛠️ Click to expand the global GEMINI.md file

# Gemini 3.5 Flash Executor Guardrails

Apply these rules when the user asks for implementation, code changes, debugging, refactors, tests, reviews, or repository operations.

For purely conversational questions, explanation-only requests, brainstorming, or open-ended discussion, answer normally and do not force this workflow.

## 1. Operating Mode

You are a constrained code executor.

- Do exactly what was requested.
- Prefer the smallest correct diff.
- Preserve existing behavior unless the task explicitly changes it.
- Treat the task description or saved plan as the contract.
- Do not widen scope just because nearby code looks related.

## 2. Scope Guardrails

Before editing, identify:

- allowed files
- allowed symbols/functions
- required behavior changes
- non-goals

If the task or plan already provides an allowed file list, treat it as hard scope.

If an edit would require touching files outside the stated scope:

- STOP
- explain why
- ask for approval or clarification

Do not silently expand into:

- refactors
- cleanup
- renames
- helper extraction
- test rewrites unrelated to the requested behavior
- architecture changes

## 3. Ambiguity Protocol

If any of these occur, do not guess:

- the plan references code/tests that do not exist in the current snapshot
- there are multiple plausible implementations with materially different scope
- the requested change conflicts with current code structure
- the change appears to require out-of-scope edits

In that case:

1. stop editing
2. state the exact ambiguity
3. propose the narrowest safe interpretation
4. wait for clarification unless the user clearly preferred that narrow interpretation already

## 4. Forbidden Patterns

Never do the following unless the user explicitly requests it:

- edit files outside the allowed scope
- modify BUILD.gn, CI, lockfiles, package manifests, or workspace config as a shortcut
- add timers, sleeps, polling, delayed tasks, or retry loops as a band-aid
- delete, weaken, or rewrite unrelated tests to make the change pass
- introduce placeholder code, TODO-only code, or speculative abstractions
- do “while I’m here” cleanup
- change drag lifecycle, event plumbing, or adjacent systems unless the task explicitly requires it

If you are tempted to do one of these, STOP and report it instead.

## 5. Preflight Before Editing

Before making changes, print a compact PREFLIGHT section containing:

- files you will edit
- exact symbols/areas you will change
- why each edit is needed
- any ambiguity found

If the task is based on a saved plan, map your intended edits to the plan requirements before editing.

## 6. Implementation Rules

- Prefer direct, local fixes over generalized abstractions.
- Reuse existing patterns already present in the codebase.
- If a referenced test already exists, update only the assertion or setup needed.
- If a referenced test does not exist, do not invent broader production changes to compensate.
- Keep comments factual and minimal.

## 7. Self-Check After Editing

After editing, always print a SELF_CHECK section containing:

- changed files
- confirmation that only in-scope files changed
- confirmation that no banned workaround was added
- confirmation that unrelated tests were not deleted or weakened
- a plan-to-diff mapping for each meaningful hunk

If any changed file is out of scope, treat the attempt as failed and fix the scope violation before presenting the result.

## 8. Review Loop Behavior

For implementation tasks, use this loop:

1. implement minimal diff
2. review the diff against the task/plan
3. fix only blocking mismatches
4. stop after the requested scope is satisfied

Do not use open-ended “improve more” iterations.

Each review/fix cycle must answer:

- what blocking issue was found
- what exact file/symbol fixes it
- whether the new diff is still in scope

If the same class of violation repeats twice, stop and escalate instead of continuing to churn.

## 9. Testing and Verification

When tests are part of the requested change:

- add or update only the tests needed for the requested behavior
- keep assertions aligned to the new contract, not the old incidental behavior
- do not change unrelated fixtures or timing unless required

If you cannot verify something, say so explicitly. Do not pretend a check was run.

## 10. Output Style for Implementation Tasks

Use this structure when doing code changes:

- PREFLIGHT
- CHANGED_FILES
- PLAN_TO_DIFF_MAPPING
- SELF_CHECK
- OPEN_QUESTIONS

Be concise, but always make scope compliance obvious.

The shape is deliberate: implementation should become auditable, not just plausible.

2. The `implement-plan` skill

The second piece is a dedicated skill at:

~/.gemini/antigravity-cli/skills/implement-plan/SKILL.md

This skill is intentionally narrow. Its job is not to invent a plan. Its job is to execute an existing one faithfully.

In practice, I kick off a run with a single command that loads the skill and points it at a saved plan:

/goal Use the implement-plan skill to implement this plan. Per its
instructions, keep iterating until the 3.1 Pro reviewer verification
loop finally returns PASS.
.omo/plans/agent-hot-update-and-rag-completion-hyperplan-2026-07-15.md

The parts are simple. /goal is the Antigravity CLI slash command that loads a skill by name and hands it a task. implement-plan is the skill described here. The trailing path is the plan I want executed, saved under .omo/plans/ in my setup. The "keep iterating until PASS" clause is not decoration; it maps directly to the review loop the skill now runs, which I get to further down.

The skill starts by defining the role clearly:

You are an executor, not a planner.

That one sentence matters more than it looks. A lot of agent drift starts when the model decides that the plan is just a suggestion.

What the skill now does

The current implement-plan skill defines twelve numbered behaviors. In order, it tells Antigravity CLI to:

resolve the provided plan path first, and echo it before doing anything else,
check and report basic preconditions (plan file readable, current branch and latest commit, any plan-declared preconditions),
read the entire plan before editing,
extract the plan's goal, non-goals (verbatim), ordered steps, target files, required commands, and numbered acceptance criteria (AC-1, AC-2, …), deriving and labelling them when the plan is silent,
refuse to silently re-plan, collapse steps, or skip them,
emit a visible PREFLIGHT block before making changes, including a behavioral verification plan for each acceptance criterion,
read root and nearest AGENTS.md context before execution,
hold scope discipline: every file change must map back to a plan item, with no opportunistic refactors or cleanup,
run three-layer, evidence-based verification (commands, per-criterion, behavioral),
run a diff audit before claiming COMPLETE, mapping every changed hunk to a plan item and checking each non-goal was not violated,
stop and report the moment a real blocker appears,
run a pre-COMPLETE self-review, then delegate the final verdict to the reviewer subagent (Rule 12) and loop until it returns PASS,

and finally produce a structured report in a fixed format.

📋 Click to expand the Structured Final Report Template

## Plan
- Path: <resolved path>
- Summary: <1-2 lines>
- Non-goals: <verbatim list>

## Preconditions
- Plan file: pass/fail
- Branch: <name>
- Head: <sha> <subject>
- Plan-specific preconditions: <notes>

## PREFLIGHT
- Files to touch
- Checklist items
- Acceptance criteria (AC-1, AC-2, ...)
- Verification commands
- Behavioral verification plan (per AC)

## Steps Executed
- [plan item] -> [files changed] -> [result]

## Verification — Layer A (Commands)
| Command | Result | Evidence (raw output / last lines) |
|---|---|---|

## Verification — Layer B (Acceptance Criteria)
| AC | Statement | Status (PASS/FAIL/UNVERIFIED/SKIPPED) | Concrete evidence |
|---|---|---|---|

## Verification — Layer C (Behavior)
| AC | Action taken | Observed result | Verdict |
|---|---|---|---|

## Diff Audit
- `git diff --stat`
- Hunk → plan item mapping
| File / Hunk | Plan item or AC | Justified? |
|---|---|---|
- Plan files NOT touched (with reason):
- Files touched NOT in plan (with reason):
- Non-goal check:
  - <non-goal 1>: not violated — <evidence>
  - <non-goal 2>: not violated — <evidence>

## Pre-COMPLETE Self-Review
1. All plan steps executed with result: Y/N
2. All ACs PASS with concrete evidence: Y/N
3. Behavioral verification ran where applicable: Y/N
4. Diff audit clean, no non-goal violations: Y/N
5. Unsupported judgments: <list or none>
6. Most likely thing a hostile reviewer would catch: <answer>

## Out of Scope Findings
- <list or none>

## Blockers / Open Questions
- <list or none>

## Status
- COMPLETE | PARTIAL | BLOCKED
- One-sentence honest summary

This is one of the highest-leverage changes in the whole setup. Even when the implementation is imperfect, the reporting format makes it much easier to audit what happened. The report is no longer a summary the model writes to look thorough; it is a set of tables that force the model to attach evidence to every claim.

Evidence-based verification, in three layers

The skill is blunt about what counts as proof. Two lines set the tone:

A green checkmark with no output is not evidence.

Soft claims ("looks good", "should work", "probably fine") are not evidence.

Verification is split into three mandatory layers, none of which can be skipped silently:

Layer A, commands. Run the plan's canonical commands (or the repo-standard ones discovered from AGENTS.md, the README, or the build files) and capture raw output. If a step cannot be run, it is marked SKIPPED with a reason, never PASS.
Layer B, per acceptance criterion. Each AC-N gets a row stating the exact evidence that proves it. A passing build does not, by itself, satisfy any criterion; ACs are about behavior, not green CI.
Layer C, behavioral and end-to-end. New features get exercised for real (curl, a REPL call, a browser action, a CLI run) and the observable result is pasted in. Bug fixes reproduce the original failure first, then show it passing. Refactors show the prior behavior is preserved.

The point of the three layers is to make it hard to declare victory on a vibe. If the evidence is not there, the honest status is PARTIAL or BLOCKED, and the report has to say which condition failed.

One notable removal: no dirty-working-tree check

The current version of implement-plan does not include any git dirty-working-tree check.

Earlier iterations did, but I removed that requirement. The skill now checks:

whether the plan file exists and is readable,
the current branch and latest commit,
and whether the plan itself declares extra preconditions.

It no longer blocks or even checks execution based on working tree dirtiness.

That makes the skill less rigid in a monorepo where unrelated workspace state is often present and where “clean tree only” can become more noise than safety.

3. The missing piece: the executor was grading its own homework

For a while I thought a strong report template was the finish line. It was not.

The problem is structural, not cosmetic. The same fast model that writes the code also fills in the verification tables. And a fast model that has just spent its whole context convincing itself the change is correct is exactly the wrong judge of whether the change is correct. The failure mode is not laziness; it is confidence. A model can produce a tidy AC table, every row marked PASS, every claim phrased with authority, and still be wrong, because it is grading its own homework and it already believes the answer.

No amount of "be honest" wording fully fixes this. Self-assessment has a ceiling. To get past it, the verdict has to move to a different model that did not write the code.

The 3.1 Pro reviewer and the automated review loop

The fix is a dedicated reviewer agent that runs on a higher-capability model and never touches the implementation. It only judges.

🛠️ Click to expand the reviewer agent config

---
name: reviewer
displayName: Code Reviewer (3.1 Pro)
model: "Gemini 3.1 Pro"
description: Expert reviewer agent that validates implementation against the plan.
---

# Code Reviewer

You are the Code Reviewer, running on a high-capability Pro model. Your role is to carefully audit the implementation changes made by the implementation agent.

## Instructions
1. **Compare with Plan**: Read the original plan file and verify that every step in the plan has been implemented.
2. **Review Code Changes**: Check the modified/created files for correctness, style, and potential bugs.
3. **Verify Compliance**: Ensure that the changes do not violate any conventions outlined in `AGENTS.md`.
4. **Report Findings**: Output a structured review report:
   - **Verification Status**: PASS or FAIL
   - **Plan Match**: Confirmation that all planned items are implemented.
   - **Code Quality**: Observations on the written code.
   - **Suggestions**: Any recommended improvements or fixes if needed.

The reviewer reads the original plan, checks that every planned step is actually implemented, reviews the modified files for correctness, style, and bugs, verifies the changes do not violate AGENTS.md conventions, and then returns a structured verdict: a PASS/FAIL verification status, a plan-match confirmation, code-quality observations, and concrete suggestions.

Rule 12 of the skill wires this into a loop. Before the executor is allowed to claim COMPLETE, it must delegate the final verification to the reviewer subagent by invoking it through invoke_subagent, passing the plan file and the modified files as context. Then it waits for the reviewer's report. If the verdict is FAIL, the executor analyzes the suggestions and bugs, implements the fixes, and re-invokes the reviewer. That implement → review → fix cycle repeats until the reviewer returns PASS, or until three failed iterations, at which point the executor stops and reports a blocker instead of declaring success. COMPLETE is forbidden while the reviewer still returns FAIL, and the reviewer's final report is included in the executor's output.

The result is an explicit two-model contract:

Executor, Gemini 3.5 Flash (High). Cheap and fast. It does the work: reads the plan, makes the smallest correct diff, runs the three verification layers, audits its own diff.
Reviewer, Gemini 3.1 Pro. Slower and more careful. It never writes code; it only decides whether the work passes.

Cheap-and-fast for execution, smart-and-careful for the gate. The fast model is free to be fast because it is no longer the last word, and the expensive model is only paid for the one job it is best at: catching the thing the fast model talked itself into.

Why this combination works better than any single change

None of these changes matters much in isolation.

A skill alone does not encode safe boundaries.
Rules alone do not force a structured execution/reporting loop.
A structured report alone still lets the author grade its own work.

The value comes from the combination:

The global GEMINI.md: teaches the agent to act like a constrained executor.
The implement-plan skill: forces implementation to stay anchored to a saved plan and to attach evidence to every claim.
The reviewer agent (Gemini 3.1 Pro): moves the final verdict onto a second model, so PASS means something.

That is what turns Antigravity CLI from “a fast coding assistant” into a plan-bound executor that verifies itself against a separate judge.

Final thoughts

What I wanted from this setup was not more intelligence in the abstract. I wanted less improvisation, and a verdict I could trust.

The resulting system is not fancy:

write down the global execution rules,
force implementation to follow a saved plan,
and hand the final PASS/FAIL to a different model.

But that is enough to change the operating feel of the tool.

Antigravity CLI is still the same CLI. Execution still runs on Gemini 3.5 Flash (High). What changed is that the fast model no longer gets to sign off on its own work: verification is delegated to Gemini 3.1 Pro, and the review loop keeps running until that separate judge returns PASS. The two-model split, and the self-verifying loop it enables, is what actually moved the needle.

And for real engineering work, that contract matters more than most people think.

Credit goes to oh-my-openagent — thanks to its great harness, which the reviewer prompts here draw from.