# 🧭 Agents Stability & Friendliness Guide

This guide focuses on how to improve **reliability** and **user experience** of agents in `github_copilot_sdk.py`.

---

## 1) Goals

- Reduce avoidable failures (timeouts, tool-call dead ends, invalid outputs).
- Keep responses predictable under stress (large context, unstable upstream, partial tool failures).
- Make interaction friendly (clear progress, clarification before risky actions, graceful fallback).
- Preserve backwards compatibility while introducing stronger defaults.

---

## 2) Stability model (4 layers)

## Layer A — Input safety

- Validate essential runtime context early (user/chat/model/tool availability).
- Use strict parsing for JSON-like user/task config (never trust raw free text).
- Add guardrails for unsupported mode combinations (e.g., no tools + tool-required task).

**Implementation hints**
- Add preflight validator before `create_session`.
- Return fast-fail structured errors with recovery actions.

## Layer B — Session safety

- Use profile-driven defaults (`model`, `reasoning_effort`, `infinite_sessions` thresholds).
- Auto-fallback to safe profile when unknown profile is requested.
- Isolate each chat in a deterministic workspace path.

**Implementation hints**
- Add `AGENT_PROFILE` + fallback to `default`.
- Keep `infinite_sessions` enabled by default for long tasks.

## Layer C — Tool-call safety

- Add `on_pre_tool_use` to validate and sanitize args.
- Add denylist/allowlist checks for dangerous operations.
- Add timeout budget per tool class (file/network/shell).

**Implementation hints**
- Keep current `on_post_tool_use` behavior.
- Extend hooks gradually: `on_pre_tool_use` first, then `on_error_occurred`.

## Layer D — Recovery safety

- Retry only idempotent operations with capped attempts.
- Distinguish recoverable vs non-recoverable failures.
- Add deterministic fallback path (summary answer + explicit limitation).

**Implementation hints**
- Retry policy table by event type.
- Emit "what succeeded / what failed / what to do next" blocks.

---

## 3) Friendliness model (UX contract)

## A. Clarification first for ambiguity

Use `on_user_input_request` for:
- Missing constraints (scope, target path, output format)
- High-risk actions (delete/migrate/overwrite)
- Contradictory instructions

**Rule**: ask once with concise choices; avoid repeated back-and-forth.

## B. Progress visibility

Emit status in major phases:
1. Context check
2. Planning/analysis
3. Tool execution
4. Verification
5. Final result

**Rule**: no silent waits > 8 seconds.

## C. Friendly failure style

Every failure should include:
- what failed
- why (short)
- what was already done
- next recommended action

## D. Output readability

Standardize final response blocks:
- `Outcome`
- `Changes`
- `Validation`
- `Limitations`
- `Next Step`

---

## 4) High-value features to add (priority)

## P0 (immediate)

1. `on_user_input_request` handler with default answer strategy
2. `on_pre_tool_use` for argument checks + risk gates
3. Structured progress events (phase-based)

## P1 (short-term)

4. Error taxonomy + retry policy (`network`, `provider`, `tool`, `validation`)
5. Profile-based session factory with safe fallback
6. Auto quality gate for final output sections

## P2 (mid-term)

7. Transport flexibility (`cli_url`, `use_stdio`, `port`) for deployment resilience
8. Azure provider path completion
9. Foreground session lifecycle support for advanced multi-session control

---

## 5) Suggested valves for stability/friendliness

- `AGENT_PROFILE`: `default | builder | analyst | reviewer`
- `ENABLE_USER_INPUT_REQUEST`: `bool`
- `DEFAULT_USER_INPUT_ANSWER`: `str`
- `TOOL_CALL_TIMEOUT_SECONDS`: `int`
- `MAX_RETRY_ATTEMPTS`: `int`
- `ENABLE_SAFE_TOOL_GUARD`: `bool`
- `ENABLE_PHASE_STATUS_EVENTS`: `bool`
- `ENABLE_FRIENDLY_FAILURE_TEMPLATE`: `bool`

---

## 6) Failure playbooks (practical)

## Playbook A — Provider timeout

- Retry once if request is idempotent.
- Downgrade reasoning effort if timeout persists.
- Return concise fallback and preserve partial result.

## Playbook B — Tool argument mismatch

- Block execution in `on_pre_tool_use`.
- Ask user one clarification question if recoverable.
- Otherwise skip tool and explain impact.

## Playbook C — Large output overflow

- Save large output to workspace file.
- Return file path + short summary.
- Avoid flooding chat with huge payload.

## Playbook D — Conflicting user instructions

- Surface conflict explicitly.
- Offer 2-3 fixed choices.
- Continue only after user selection.

---

## 7) Metrics to track

- Session success rate
- Tool-call success rate
- Average recovery rate after first failure
- Clarification rate vs hallucination rate
- Mean time to first useful output
- User follow-up dissatisfaction signals (e.g., “not what I asked”)

---

## 8) Minimal rollout plan

1. Add `on_user_input_request` + `on_pre_tool_use` (feature-gated).
2. Add phase status events and friendly failure template.
3. Add retry policy + error taxonomy.
4. Add profile fallback and deployment transport options.
5. Observe metrics for 1-2 weeks, then tighten defaults.

---

## 9) Quick acceptance checklist

- Agent asks clarification only when necessary.
- No long silent period without status updates.
- Failures always include next actionable step.
- Unknown profile/provider config does not crash session.
- Large outputs are safely redirected to file.
- Final response follows a stable structure.