fix: Optimize Stale Agent with GraphQL and Search API to resolve 429 Quota errors

Merge https://github.com/google/adk-python/pull/3700 ### Description This PR refactors the `adk_stale_agent` to address `429 RESOURCE_EXHAUSTED` errors encountered during workflow execution. The previous implementation was inefficient in fetching issue history (using pagination over the REST API) and lacked server-side filtering, causing excessive API calls and huge token consumption that breached Gemini API quotas. The new implementation switches to a **GraphQL-first approach**, implements server-side filtering via the Search API, adds robust concurrency controls, and significantly improves code maintainability through modular refactoring. ### Root Cause of Failure The previous workflow failed with the following error due to passing too much context to the LLM and processing too many irrelevant issues: ```text google.genai.errors.ClientError: 429 RESOURCE_EXHAUSTED. Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_paid_tier_input_token_count ``` ### Key Changes #### 1. Optimization: REST → GraphQL (`agent.py`) * **Old:** Fetched issue comments and timeline events using multiple paginated REST API calls (`/timeline`). * **New:** Implemented `get_issue_state` using a single **GraphQL** query. This fetches comments, `userContentEdits`, and specific timeline events (Labels, Renames) in one network request. * **Refactoring:** The complex analysis logic has been decomposed into focused helper functions (_fetch_graphql_data, _build_history_timeline, _replay_history_to_find_state) for better readability and testing. * **Configurable:** Added GRAPHQL_COMMENT_LIMIT and GRAPHQL_TIMELINE_LIMIT settings to tune context depth * **Impact:** Drastically reduces the data payload size and eliminates multiple API round-trips, significantly lowering the token count sent to the LLM. #### 2. Optimization: Server-Side Filtering (`utils.py`) * **Old:** Fetched *all* open issues via REST and filtered them in Python memory. * **New:** Uses the GitHub Search API (`get_old_open_issue_numbers`) with `created:<DATE` syntax. * **Impact:** Only fetches issue numbers that actually meet the age threshold, preventing the agent from wasting cycles and tokens on brand-new issues. #### 3. Concurrency & Rate Limiting (`main.py` & `settings.py`) * **Old:** Sequential execution loop. * **New:** Implemented `asyncio.gather` with a configurable `CONCURRENCY_LIMIT` (set to 3). * **New:** Added `urllib3` retry strategies (exponential backoff) in `utils.py` to handle GitHub API rate limits (HTTP 429) gracefully. #### 4. Logic Improvements ("Ghost Edits") * **New Feature:** The agent now detects "Ghost Edits" (where an author updates the issue description without posting a new comment). * **Action:** If a silent edit is detected on a stale candidate, the agent now alerts maintainers instead of marking it stale, preventing false positives. ### File Comparison Summary | File | Change | | :--- | :--- | | `main.py` | Switched from `InMemoryRunner` loop to `asyncio` chunked processing. Added execution timing and API usage logging. | | `agent.py` | Replaced REST logic with GraphQL query. Added logic to handle silent body edits. Decomposed giant get_issue_state into helper functions with docstrings. Added _format_days helper. | | `utils.py` | Added `HTTPAdapter` with Retries. Added `get_old_open_issue_numbers` using Search API. | | `settings.py` | Removed `ISSUES_PER_RUN`; added configuration for CONCURRENCY_LIMIT, SLEEP_BETWEEN_CHUNKS, and GraphQL limits. | | `PROMPT_INSTRUCTIONS.txt` | Simplified decision tree; removed date calculation responsibility from LLM. | ### Verification The new logic minimizes token usage by offloading date calculations to Python and strictly limiting the context passed to the LLM to semantic intent analysis (e.g., "Is this a question?"). * **Metric Check:** The workflow now tracks API calls per issue to ensure we stay within limits. * **Safety:** Silent edits by users now correctly reset the "Stale" timer. * **Maintainability:** All complex logic is now isolated in typed helper functions with comprehensive docstrings. Co-authored-by: Xuan Yang <xygoogle@google.com> COPYBARA_INTEGRATE_REVIEW=https://github.com/google/adk-python/pull/3700 from ryanaiagent:feat/improve-stale-agent 888064eff125ae74f7c3a9ad6c74f98de80243a2 PiperOrigin-RevId: 838885530
2026-03-30 10:57:20 -07:00 · 2025-12-01 12:25:22 -08:00
parent 2a1a41d3ec
commit cb19d0714c
7 changed files with 944 additions and 407 deletions
@@ -1,57 +1,43 @@
-# .github/workflows/stale-issue-auditor.yml
-
-# Best Practice: Always have a 'name' field at the top.
 name: ADK Stale Issue Auditor

-# The 'on' block defines the triggers.
 on:
-  # The 'workflow_dispatch' trigger allows manual runs.
  workflow_dispatch:

-  # The 'schedule' trigger runs the bot on a timer.
  schedule:
-    # This runs at 6:00 AM UTC (e.g., 10 PM PST).
+    # This runs at 6:00 AM UTC (10 PM PST)
    - cron: '0 6 * * *'

-# The 'jobs' block contains the work to be done.
 jobs:
-  # A unique ID for the job.
  audit-stale-issues:
-    # The runner environment.
    runs-on: ubuntu-latest
+    timeout-minutes: 60

-    # Permissions for the job's temporary GITHUB_TOKEN.
-    # These are standard and syntactically correct.
    permissions:
      issues: write
      contents: read

-    # The sequence of steps for the job.
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v4
+        uses: actions/checkout@v5

      - name: Set up Python
-        uses: actions/setup-python@v5
+        uses: actions/setup-python@v6
        with:
          python-version: '3.11'

      - name: Install dependencies
-        # The '|' character allows for multi-line shell commands.
        run: |
          python -m pip install --upgrade pip
          pip install requests google-adk

      - name: Run Auditor Agent Script
-        # The 'env' block for setting environment variables.
        env:
          GITHUB_TOKEN: ${{ secrets.ADK_TRIAGE_AGENT }}
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
-          OWNER: google
+          OWNER: ${{ github.repository_owner }}
          REPO: adk-python
-          ISSUES_PER_RUN: 100
+          CONCURRENCY_LIMIT: 3
          LLM_MODEL_NAME: "gemini-2.5-flash"
          PYTHONPATH: contributing/samples

-        # The final 'run' command.
        run: python -m adk_stale_agent.main
@@ -1,40 +1,68 @@
-You are a highly intelligent and transparent repository auditor for '{OWNER}/{REPO}'.
-Your job is to analyze all open issues and report on your findings before taking any action.
+You are a highly intelligent repository auditor for '{OWNER}/{REPO}'.
+Your job is to analyze a specific issue and report findings before taking action.

 **Primary Directive:** Ignore any events from users ending in `[bot]`.
-**Reporting Directive:** For EVERY issue you analyze, you MUST output a concise, human-readable summary, starting with "Analysis for Issue #[number]:".
+**Reporting Directive:** Output a concise summary starting with "Analysis for Issue #[number]:".
+
+**THRESHOLDS:**
+- Stale Threshold: {stale_threshold_days} days.
+- Close Threshold: {close_threshold_days} days.

 **WORKFLOW:**
-1.  **Context Gathering**: Call `get_repository_maintainers` and `get_all_open_issues`.
-2.  **Per-Issue Analysis**: For each issue, call `get_issue_state`, passing in the maintainers list.
-3.  **Decision & Reporting**: Based on the summary from `get_issue_state`, follow this strict decision tree in order.
+1.  **Context Gathering**: Call `get_issue_state`.
+2.  **Decision**: Follow this strict decision tree using the data returned by the tool.

--- **DECISION TREE & REPORTING TEMPLATES** ---
+--- **DECISION TREE** ---

-**STEP 1: CHECK FOR ACTIVITY (IS THE ISSUE ACTIVE?)**
- **Condition**: Was the last human action NOT from a maintainer? (i.e., `last_human_commenter_is_maintainer` is `False`).
- **Action**: The author or a third party has acted. The issue is ACTIVE.
-  - **Report and Action**: If '{STALE_LABEL_NAME}' is present, report: "Analysis for Issue #[number]: Issue is ACTIVE. The last action was a [action type] by a non-maintainer. To get the [action type], you MUST use the value from the 'last_human_action_type' field in the summary you received from the tool." Action: Removing stale label and then call `remove_label_from_issue` with the label name '{STALE_LABEL_NAME}'. Otherwise, report: "Analysis for Issue #[number]: Issue is ACTIVE. No stale label to remove. Action: None."
- **If this condition is met, stop processing this issue.**
+**STEP 1: CHECK IF ALREADY STALE**
+- **Condition**: Is `is_stale` (from tool) **True**?
+- **Action**:
+    - **Check Role**: Look at `last_action_role`.

-**STEP 2: IF PENDING, MANAGE THE STALE LIFECYCLE.**
- **Condition**: The last human action WAS from a maintainer (`last_human_commenter_is_maintainer` is `True`). The issue is PENDING.
- **Action**: You must now determine the correct state.
+    - **IF 'author' OR 'other_user'**:
+        - **Context**: The user has responded. The issue is now ACTIVE.
+        - **Action 1**: Call `remove_label_from_issue` with '{STALE_LABEL_NAME}'.
+        - **Action 2 (ALERT CHECK)**: Look at `maintainer_alert_needed`.
+            - **IF True**: User edited description silently.
+              -> **Action**: Call `alert_maintainer_of_edit`.
+            - **IF False**: User commented normally. No alert needed.
+        - **Report**: "Analysis for Issue #[number]: ACTIVE. User activity detected. Removed stale label."

-  - **First, check if the issue is already STALE.**
-    - **Condition**: Is the `'{STALE_LABEL_NAME}'` label present in `current_labels`?
-    - **Action**: The issue is STALE. Your only job is to check if it should be closed.
-      - **Get Time Difference**: Call `calculate_time_difference` with the `stale_label_applied_at` timestamp.
-      - **Decision & Report**: If `hours_passed` > **{CLOSE_HOURS_AFTER_STALE_THRESHOLD}**: Report "Analysis for Issue #[number]: STALE. Close threshold met ({CLOSE_HOURS_AFTER_STALE_THRESHOLD} hours) with no author activity." Action: Closing issue and then call `close_as_stale`. Otherwise, report "Analysis for Issue #[number]: STALE. Close threshold not yet met. Action: None."
+    - **IF 'maintainer'**:
+        - **Check Time**: Check `days_since_stale_label`.
+            - **If `days_since_stale_label` > {close_threshold_days}**:
+                - **Action**: Call `close_as_stale`.
+                - **Report**: "Analysis for Issue #[number]: STALE. Close threshold met. Closing."
+            - **Else**:
+                - **Report**: "Analysis for Issue #[number]: STALE. Waiting for close threshold. No action."

-  - **ELSE (the issue is PENDING but not yet stale):**
-    - **Analyze Intent**: Semantically analyze the `last_maintainer_comment_text`. Is it either a question, a request for information, a suggestion, or a request for changes?
-    - **If YES (it is either a question, a request for information, a suggestion, or a request for changes)**:
-      - **CRITICAL CHECK**: Now, you must verify the author has not already responded. Compare the `last_author_event_time` and the `last_maintainer_comment_time`.
-      - **IF the author has NOT responded** (i.e., `last_author_event_time` is older than `last_maintainer_comment_time` or is null):
-        - **Get Time Difference**: Call `calculate_time_difference` with the `last_maintainer_comment_time`.
-        - **Decision & Report**: If `hours_passed` > **{STALE_HOURS_THRESHOLD}**: Report "Analysis for Issue #[number]: PENDING. Stale threshold met ({STALE_HOURS_THRESHOLD} hours)." Action: Marking as stale and then call `add_stale_label_and_comment` and if label name '{REQUEST_CLARIFICATION_LABEL}' is missing then call `add_label_to_issue` with the label name '{REQUEST_CLARIFICATION_LABEL}'. Otherwise, report: "Analysis for Issue #[number]: PENDING. Stale threshold not met. Action: None."
-      - **ELSE (the author HAS responded)**:
-        - **Report**: "Analysis for Issue #[number]: PENDING, but author has already responded to the last maintainer request. Action: None."
-    - **If NO (it is not a request):**
-      - **Report**: "Analysis for Issue #[number]: PENDING. Maintainer's last comment was not a request. Action: None."
+**STEP 2: CHECK IF ACTIVE (NOT STALE)**
+- **Condition**: `is_stale` is **False**.
+- **Action**:
+    - **Check Role**: If `last_action_role` is 'author' or 'other_user':
+        - **Context**: The issue is Active.
+        - **Action (ALERT CHECK)**: Look at `maintainer_alert_needed`.
+            - **IF True**: The user edited the description silently, and we haven't alerted yet.
+              -> **Action**: Call `alert_maintainer_of_edit`.
+              -> **Report**: "Analysis for Issue #[number]: ACTIVE. Silent update detected (Description Edit). Alerted maintainer."
+            - **IF False**: 
+              -> **Report**: "Analysis for Issue #[number]: ACTIVE. Last action was by user. No action."
+
+    - **Check Role**: If `last_action_role` is 'maintainer':
+      - **Proceed to STEP 3.**
+
+**STEP 3: ANALYZE MAINTAINER INTENT**
+- **Context**: The last person to act was a Maintainer.
+- **Action**: Read the text in `last_comment_text`.
+    - **Question Check**: Does the text ask a question, request clarification, ask for logs, or suggest trying a fix?
+    - **Time Check**: Is `days_since_activity` > {stale_threshold_days}?
+
+    - **DECISION**:
+        - **IF (Question == YES) AND (Time == YES)**:
+            - **Action**: Call `add_stale_label_and_comment`.
+            - **Check**: If '{REQUEST_CLARIFICATION_LABEL}' is not in `current_labels`, call `add_label_to_issue` for it.
+            - **Report**: "Analysis for Issue #[number]: STALE. Maintainer asked question [days_since_activity] days ago. Marking stale."
+        - **IF (Question == YES) BUT (Time == NO)**:
+            - **Report**: "Analysis for Issue #[number]: PENDING. Maintainer asked question, but threshold not met yet. No action."
+        - **IF (Question == NO)** (e.g., "I am working on this"):
+            - **Report**: "Analysis for Issue #[number]: ACTIVE. Maintainer gave status update (not a question). No action."
@@ -1,65 +1,89 @@
 # ADK Stale Issue Auditor Agent

-This directory contains an autonomous agent designed to audit a GitHub repository for stale issues, helping to maintain repository hygiene and ensure that all open items are actionable.
+This directory contains an autonomous, **GraphQL-powered** agent designed to audit a GitHub repository for stale issues. It maintains repository hygiene by ensuring all open items are actionable and responsive.

-The agent operates as a "Repository Auditor," proactively scanning all open issues rather than waiting for a specific trigger. It uses a combination of deterministic Python tools and the semantic understanding of a Large Language Model (LLM) to make intelligent decisions about the state of a conversation.
+Unlike traditional "Stale Bots" that only look at timestamps, this agent uses a **Unified History Trace** and an **LLM (Large Language Model)** to understand the *context* of a conversation. It distinguishes between a maintainer asking a question (stale candidate) vs. a maintainer providing a status update (active).

 ---

 ## Core Logic & Features

-The agent's primary goal is to identify issues where a maintainer has requested information from the author, and to manage the lifecycle of that issue based on the author's response (or lack thereof).
+The agent operates as a "Repository Auditor," proactively scanning open issues using a high-efficiency decision tree.

-**The agent follows a precise decision tree:**
+### 1. Smart State Verification (GraphQL)
+Instead of making multiple expensive API calls, the agent uses a single **GraphQL** query per issue to reconstruct the entire history of the conversation. It combines:
+*   **Comments**
+*   **Description/Body Edits** ("Ghost Edits")
+*   **Title Renames**
+*   **State Changes** (Reopens)

-1.  **Audits All Open Issues:** On each run, the agent fetches a batch of the oldest open issues in the repository.
-2.  **Identifies Pending Issues:** It analyzes the full timeline of each issue to see if the last human action was a comment from a repository maintainer.
-3.  **Semantic Intent Analysis:** If the last comment was from a maintainer, the agent uses the LLM to determine if the comment was a **question or a request for clarification**.
-4.  **Marks as Stale:** If the maintainer's question has gone unanswered by the author for a configurable period (e.g., 7 days), the agent will:
-    *   Apply a `stale` label to the issue.
-    *   Post a comment notifying the author that the issue is now considered stale and will be closed if no further action is taken.
-    *   Proactively add a `request clarification` label if it's missing, to make the issue's state clear.
-5.  **Handles Activity:** If any non-maintainer (the author or a third party) comments on an issue, the agent will automatically remove the `stale` label, marking the issue as active again.
-6.  **Closes Stale Issues:** If an issue remains in the `stale` state for another configurable period (e.g., 7 days) with no new activity, the agent will post a final comment and close the issue.
+It sorts these events chronologically to determine the **Last Active Actor**.

-### Self-Configuration
+### 2. The "Last Actor" Rule
+The agent follows a precise logic flow based on who acted last:

-A key feature of this agent is its ability to self-configure. It does not require a hard-coded list of maintainer usernames. On each run, it uses the GitHub API to dynamically fetch the list of users with write access to the repository, ensuring its logic is always based on the current team.
+*   **If Author/User acted last:** The issue is **ACTIVE**.
+    *   This includes comments, title changes, and *silent* description edits.
+    *   **Action:** The agent immediately removes the `stale` label.
+    *   **Silent Update Alert:** If the user edited the description but *did not* comment, the agent posts a specific alert: *"Notification: The author has updated the issue description..."* to ensure maintainers are notified (since GitHub does not trigger notifications for body edits).
+    *   **Spam Prevention:** The agent checks if it has already alerted about a specific silent edit to avoid spamming the thread.
+
+*   **If Maintainer acted last:** The issue is **POTENTIALLY STALE**.
+    *   The agent passes the text of the maintainer's last comment to the LLM.
+
+### 3. Semantic Intent Analysis (LLM)
+If the maintainer was the last person to speak, the LLM analyzes the comment text to determine intent:
+*   **Question/Request:** "Can you provide logs?" / "Please try v2.0."
+    *   **Verdict:** **STALE** (Waiting on Author).
+    *   **Action:** If the time threshold is met, the agent adds the `stale` label. It also checks for the `request clarification` label and adds it if missing.
+*   **Status Update:** "We are working on a fix." / "Added to backlog."
+    *   **Verdict:** **ACTIVE** (Waiting on Maintainer).
+    *   **Action:** No action taken. The issue remains open without stale labels.
+
+### 4. Lifecycle Management
+*   **Marking Stale:** After `STALE_HOURS_THRESHOLD` (default: 7 days) of inactivity following a maintainer's question.
+*   **Closing:** After `CLOSE_HOURS_AFTER_STALE_THRESHOLD` (default: 7 days) of continued inactivity while marked stale.
+
+---
+
+## Performance & Safety
+
+*   **GraphQL Optimized:** Fetches comments, edits, labels, and timeline events in a single network request to minimize latency and API quota usage.
+*   **Search API Filtering:** Uses the GitHub Search API to pre-filter issues created recently, ensuring the bot doesn't waste cycles analyzing brand-new issues.
+*   **Rate Limit Aware:** Includes intelligent sleeping and retry logic (exponential backoff) to handle GitHub API rate limits (HTTP 429) gracefully.
+*   **Execution Metrics:** Logs the time taken and API calls consumed for every issue processed.

 ---

 ## Configuration

-The agent is configured entirely via environment variables, which should be set as secrets in the GitHub Actions workflow environment.
+The agent is configured via environment variables, typically set as secrets in GitHub Actions.

 ### Required Secrets

 | Secret Name | Description |
 | :--- | :--- |
-| `GITHUB_TOKEN` | A GitHub Personal Access Token (PAT) with the required permissions. It's recommended to use a PAT from a dedicated "bot" account.
-| `GOOGLE_API_KEY` | An API key for the Google AI (Gemini) model used for the agent's reasoning.
-
-### Required PAT Permissions
-
-The `GITHUB_TOKEN` requires the following **Repository Permissions**:
-*   **Issues**: `Read & write` (to read issues, add labels, comment, and close)
-*   **Administration**: `Read-only` (to read the list of repository collaborators/maintainers)
+| `GITHUB_TOKEN` | A GitHub Personal Access Token (PAT) or Service Account Token with `repo` scope. |
+| `GOOGLE_API_KEY` | An API key for the Google AI (Gemini) model used for reasoning. |

 ### Optional Configuration

-These environment variables can be set in the workflow file to override the defaults in `settings.py`.
+These variables control the timing thresholds and model selection.

 | Variable Name | Description | Default |
 | :--- | :--- | :--- |
-| `STALE_HOURS_THRESHOLD` | The number of hours of inactivity after a maintainer's question before an issue is marked as `stale`. | `168` (7 days) |
-| `CLOSE_HOURS_AFTER_STALE_THRESHOLD` | The number of hours after being marked `stale` before an issue is closed. | `168` (7 days) |
-| `ISSUES_PER_RUN` | The maximum number of oldest open issues to process in a single workflow run. | `100` |
-| `LLM_MODEL_NAME`| LLM model to use. | `gemini-2.5-flash` |
+| `STALE_HOURS_THRESHOLD` | Hours of inactivity after a maintainer's question before marking as `stale`. | `168` (7 days) |
+| `CLOSE_HOURS_AFTER_STALE_THRESHOLD` | Hours after being marked `stale` before the issue is closed. | `168` (7 days) |
+| `LLM_MODEL_NAME`| The specific Gemini model version to use. | `gemini-2.5-flash` |
+| `OWNER` | Repository owner (auto-detected in Actions). | (Environment dependent) |
+| `REPO` | Repository name (auto-detected in Actions). | (Environment dependent) |

 ---

 ## Deployment

-To deploy this agent, a GitHub Actions workflow file (`.github/workflows/stale-bot.yml`) is included. This workflow runs on a daily schedule and executes the agent's main script.
+To deploy this agent, a GitHub Actions workflow file (`.github/workflows/stale-bot.yml`) is recommended.
+
+### Directory Structure Note
+Because this agent resides within the `adk-python` package structure, the workflow must ensure the script is executed correctly to handle imports.

-Ensure the necessary repository secrets are configured and the `stale` and `request clarification` labels exist in the repository.
@@ -15,10 +15,17 @@
 import asyncio
 import logging
 import time
+from typing import Tuple

 from adk_stale_agent.agent import root_agent
+from adk_stale_agent.settings import CONCURRENCY_LIMIT
 from adk_stale_agent.settings import OWNER
 from adk_stale_agent.settings import REPO
+from adk_stale_agent.settings import SLEEP_BETWEEN_CHUNKS
+from adk_stale_agent.settings import STALE_HOURS_THRESHOLD
+from adk_stale_agent.utils import get_api_call_count
+from adk_stale_agent.utils import get_old_open_issue_numbers
+from adk_stale_agent.utils import reset_api_call_count
 from google.adk.cli.utils import logs
 from google.adk.runners import InMemoryRunner
 from google.genai import types
@@ -26,49 +33,163 @@ from google.genai import types
 logs.setup_adk_logger(level=logging.INFO)
 logger = logging.getLogger("google_adk." + __name__)

-APP_NAME = "adk_stale_agent_app"
-USER_ID = "adk_stale_agent_user"
+APP_NAME = "stale_bot_app"
+USER_ID = "stale_bot_user"
+
+
+async def process_single_issue(issue_number: int) -> Tuple[float, int]:
+  """
+  Processes a single GitHub issue using the AI agent and logs execution metrics.
+
+  Args:
+      issue_number (int): The GitHub issue number to audit.
+
+  Returns:
+      Tuple[float, int]: A tuple containing:
+          - duration (float): Time taken to process the issue in seconds.
+          - api_calls (int): The number of API calls made during this specific execution.
+
+  Raises:
+      Exception: catches generic exceptions to prevent one failure from stopping the batch.
+  """
+  start_time = time.perf_counter()
+
+  start_api_calls = get_api_call_count()
+
+  logger.info(f"Processing Issue #{issue_number}...")
+  logger.debug(f"#{issue_number}: Initializing runner and session.")
+
+  try:
+    runner = InMemoryRunner(agent=root_agent, app_name=APP_NAME)
+    session = await runner.session_service.create_session(
+        user_id=USER_ID, app_name=APP_NAME
+    )
+
+    prompt_text = f"Audit Issue #{issue_number}."
+    prompt_message = types.Content(
+        role="user", parts=[types.Part(text=prompt_text)]
+    )
+
+    logger.debug(f"#{issue_number}: Sending prompt to agent.")
+
+    async for event in runner.run_async(
+        user_id=USER_ID, session_id=session.id, new_message=prompt_message
+    ):
+      if (
+          event.content
+          and event.content.parts
+          and hasattr(event.content.parts[0], "text")
+      ):
+        text = event.content.parts[0].text
+        if text:
+          clean_text = text[:150].replace("\n", " ")
+          logger.info(f"#{issue_number} Decision: {clean_text}...")
+
+  except Exception as e:
+    logger.error(f"Error processing issue #{issue_number}: {e}", exc_info=True)
+
+  duration = time.perf_counter() - start_time
+
+  end_api_calls = get_api_call_count()
+  issue_api_calls = end_api_calls - start_api_calls
+
+  logger.info(
+      f"Issue #{issue_number} finished in {duration:.2f}s "
+      f"with ~{issue_api_calls} API calls."
+  )
+
+  return duration, issue_api_calls


 async def main():
-  """Initializes and runs the stale issue agent."""
-  logger.info("--- Starting Stale Agent Run ---")
-  runner = InMemoryRunner(agent=root_agent, app_name=APP_NAME)
-  session = await runner.session_service.create_session(
-      user_id=USER_ID, app_name=APP_NAME
+  """
+  Main entry point to run the stale issue bot concurrently.
+
+  Fetches old issues and processes them in batches to respect API rate limits
+  and concurrency constraints.
+  """
+  logger.info(f"--- Starting Stale Bot for {OWNER}/{REPO} ---")
+  logger.info(f"Concurrency level set to {CONCURRENCY_LIMIT}")
+
+  reset_api_call_count()
+
+  filter_days = STALE_HOURS_THRESHOLD / 24
+  logger.debug(f"Fetching issues older than {filter_days:.2f} days...")
+
+  try:
+    all_issues = get_old_open_issue_numbers(OWNER, REPO, days_old=filter_days)
+  except Exception as e:
+    logger.critical(f"Failed to fetch issue list: {e}", exc_info=True)
+    return
+
+  total_count = len(all_issues)
+
+  search_api_calls = get_api_call_count()
+
+  if total_count == 0:
+    logger.info("No issues matched the criteria. Run finished.")
+    return
+
+  logger.info(
+      f"Found {total_count} issues to process. "
+      f"(Initial search used {search_api_calls} API calls)."
  )

-  prompt_text = (
-      "Find and process all open issues to manage staleness according to your"
-      " rules."
-  )
-  logger.info(f"Initial Agent Prompt: {prompt_text}\n")
-  prompt_message = types.Content(
-      role="user", parts=[types.Part(text=prompt_text)]
+  total_processing_time = 0.0
+  total_issue_api_calls = 0
+  processed_count = 0
+
+  # Process the list in chunks of size CONCURRENCY_LIMIT
+  for i in range(0, total_count, CONCURRENCY_LIMIT):
+    chunk = all_issues[i : i + CONCURRENCY_LIMIT]
+    current_chunk_num = i // CONCURRENCY_LIMIT + 1
+
+    logger.info(
+        f"--- Starting chunk {current_chunk_num}: Processing issues {chunk} ---"
+    )
+
+    tasks = [process_single_issue(issue_num) for issue_num in chunk]
+
+    results = await asyncio.gather(*tasks)
+
+    for duration, api_calls in results:
+      total_processing_time += duration
+      total_issue_api_calls += api_calls
+
+    processed_count += len(chunk)
+    logger.info(
+        f"--- Finished chunk {current_chunk_num}. Progress:"
+        f" {processed_count}/{total_count} ---"
+    )
+
+    if (i + CONCURRENCY_LIMIT) < total_count:
+      logger.debug(
+          f"Sleeping for {SLEEP_BETWEEN_CHUNKS}s to respect rate limits..."
+      )
+      await asyncio.sleep(SLEEP_BETWEEN_CHUNKS)
+
+  total_api_calls_for_run = search_api_calls + total_issue_api_calls
+  avg_time_per_issue = (
+      total_processing_time / total_count if total_count > 0 else 0
  )

-  async for event in runner.run_async(
-      user_id=USER_ID, session_id=session.id, new_message=prompt_message
-  ):
-    if (
-        event.content
-        and event.content.parts
-        and hasattr(event.content.parts[0], "text")
-    ):
-      # Print the agent's "thoughts" and actions for logging purposes
-      logger.debug(f"** {event.author} (ADK): {event.content.parts[0].text}")
-
-  logger.info(f"--- Stale Agent Run Finished---")
+  logger.info("--- Stale Agent Run Finished ---")
+  logger.info(f"Successfully processed {processed_count} issues.")
+  logger.info(f"Total API calls made this run: {total_api_calls_for_run}")
+  logger.info(
+      f"Average processing time per issue: {avg_time_per_issue:.2f} seconds."
+  )


 if __name__ == "__main__":
-  start_time = time.time()
-  logger.info(f"Initializing stale agent for repository: {OWNER}/{REPO}")
-  logger.info("-" * 80)
+  start_time = time.perf_counter()

-  asyncio.run(main())
+  try:
+    asyncio.run(main())
+  except KeyboardInterrupt:
+    logger.warning("Bot execution interrupted manually.")
+  except Exception as e:
+    logger.critical(f"Unexpected fatal error: {e}", exc_info=True)

-  logger.info("-" * 80)
-  end_time = time.time()
-  duration = end_time - start_time
-  logger.info(f"Script finished in {duration:.2f} seconds.")
+  duration = time.perf_counter() - start_time
+  logger.info(f"Full audit finished in {duration/60:.2f} minutes.")
@@ -33,7 +33,6 @@ STALE_LABEL_NAME = "stale"
 REQUEST_CLARIFICATION_LABEL = "request clarification"

 # --- THRESHOLDS IN HOURS ---
-# These values can be overridden in a .env file for rapid testing (e.g., STALE_HOURS_THRESHOLD=1)
 # Default: 168 hours (7 days)
 # The number of hours of inactivity after a maintainer comment before an issue is marked as stale.
 STALE_HOURS_THRESHOLD = float(os.getenv("STALE_HOURS_THRESHOLD", 168))
@@ -44,6 +43,21 @@ CLOSE_HOURS_AFTER_STALE_THRESHOLD = float(
    os.getenv("CLOSE_HOURS_AFTER_STALE_THRESHOLD", 168)
 )

-# --- BATCH SIZE CONFIGURATION ---
-# The maximum number of oldest open issues to process in a single run of the bot.
-ISSUES_PER_RUN = int(os.getenv("ISSUES_PER_RUN", 100))
+# --- Performance Configuration ---
+# The number of issues to process concurrently.
+# Higher values are faster but increase the immediate rate of API calls
+CONCURRENCY_LIMIT = int(os.getenv("CONCURRENCY_LIMIT", 3))
+
+# --- GraphQL Query Limits ---
+# The number of most recent comments to fetch for context analysis.
+GRAPHQL_COMMENT_LIMIT = int(os.getenv("GRAPHQL_COMMENT_LIMIT", 30))
+
+# The number of most recent description edits to fetch.
+GRAPHQL_EDIT_LIMIT = int(os.getenv("GRAPHQL_EDIT_LIMIT", 10))
+
+# The number of most recent timeline events (labels, renames, reopens) to fetch.
+GRAPHQL_TIMELINE_LIMIT = int(os.getenv("GRAPHQL_TIMELINE_LIMIT", 20))
+
+# --- Rate Limiting ---
+# Time in seconds to wait between processing chunks.
+SLEEP_BETWEEN_CHUNKS = float(os.getenv("SLEEP_BETWEEN_CHUNKS", 1.5))
@@ -12,48 +12,249 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+from datetime import datetime
+from datetime import timedelta
+from datetime import timezone
+import logging
+import threading
 from typing import Any
+from typing import Dict
+from typing import List
+from typing import Optional

 from adk_stale_agent.settings import GITHUB_TOKEN
+from adk_stale_agent.settings import STALE_HOURS_THRESHOLD
+import dateutil.parser
 import requests
+from requests.adapters import HTTPAdapter
+from urllib3.util.retry import Retry

+logger = logging.getLogger("google_adk." + __name__)
+
+# --- API Call Counter for Monitoring ---
+_api_call_count = 0
+_counter_lock = threading.Lock()
+
+
+def get_api_call_count() -> int:
+  """
+  Returns the total number of API calls made since the last reset.
+
+  Returns:
+      int: The global count of API calls.
+  """
+  with _counter_lock:
+    return _api_call_count
+
+
+def reset_api_call_count() -> None:
+  """Resets the global API call counter to zero."""
+  global _api_call_count
+  with _counter_lock:
+    _api_call_count = 0
+
+
+def _increment_api_call_count() -> None:
+  """
+  Atomically increments the global API call counter.
+  Required because the agent may run tools in parallel threads.
+  """
+  global _api_call_count
+  with _counter_lock:
+    _api_call_count += 1
+
+
+# --- Production-Ready HTTP Session with Exponential Backoff ---
+
+# Configure the retry strategy:
+retry_strategy = Retry(
+    total=6,
+    backoff_factor=2,
+    status_forcelist=[429, 500, 502, 503, 504],
+    allowed_methods=[
+        "HEAD",
+        "GET",
+        "POST",
+        "PUT",
+        "DELETE",
+        "OPTIONS",
+        "TRACE",
+        "PATCH",
+    ],
+)
+
+adapter = HTTPAdapter(max_retries=retry_strategy)
+
+# Create a single, reusable Session object for connection pooling
 _session = requests.Session()
+_session.mount("https://", adapter)
+_session.mount("http://", adapter)
+
 _session.headers.update({
    "Authorization": f"token {GITHUB_TOKEN}",
    "Accept": "application/vnd.github.v3+json",
 })


-def get_request(url: str, params: dict[str, Any] | None = None) -> Any:
-  """Sends a GET request to the GitHub API."""
-  response = _session.get(url, params=params or {}, timeout=60)
-  response.raise_for_status()
-  return response.json()
+def get_request(url: str, params: Optional[Dict[str, Any]] = None) -> Any:
+  """
+  Sends a GET request to the GitHub API with automatic retries.
+
+  Args:
+      url (str): The URL endpoint.
+      params (Optional[Dict[str, Any]]): Query parameters.
+
+  Returns:
+      Any: The JSON response parsed into a dict or list.
+
+  Raises:
+      requests.exceptions.RequestException: If retries are exhausted.
+  """
+  _increment_api_call_count()
+  try:
+    response = _session.get(url, params=params or {}, timeout=60)
+    response.raise_for_status()
+    return response.json()
+  except requests.exceptions.RequestException as e:
+    logger.error(f"GET request failed for {url}: {e}")
+    raise


 def post_request(url: str, payload: Any) -> Any:
-  """Sends a POST request to the GitHub API."""
-  response = _session.post(url, json=payload, timeout=60)
-  response.raise_for_status()
-  return response.json()
+  """
+  Sends a POST request to the GitHub API with automatic retries.
+
+  Args:
+      url (str): The URL endpoint.
+      payload (Any): The JSON payload.
+
+  Returns:
+      Any: The JSON response.
+  """
+  _increment_api_call_count()
+  try:
+    response = _session.post(url, json=payload, timeout=60)
+    response.raise_for_status()
+    return response.json()
+  except requests.exceptions.RequestException as e:
+    logger.error(f"POST request failed for {url}: {e}")
+    raise


 def patch_request(url: str, payload: Any) -> Any:
-  """Sends a PATCH request to the GitHub API."""
-  response = _session.patch(url, json=payload, timeout=60)
-  response.raise_for_status()
-  return response.json()
+  """
+  Sends a PATCH request to the GitHub API with automatic retries.
+
+  Args:
+      url (str): The URL endpoint.
+      payload (Any): The JSON payload.
+
+  Returns:
+      Any: The JSON response.
+  """
+  _increment_api_call_count()
+  try:
+    response = _session.patch(url, json=payload, timeout=60)
+    response.raise_for_status()
+    return response.json()
+  except requests.exceptions.RequestException as e:
+    logger.error(f"PATCH request failed for {url}: {e}")
+    raise


 def delete_request(url: str) -> Any:
-  """Sends a DELETE request to the GitHub API."""
-  response = _session.delete(url, timeout=60)
-  response.raise_for_status()
-  if response.status_code == 204:
-    return {"status": "success"}
-  return response.json()
+  """
+  Sends a DELETE request to the GitHub API with automatic retries.
+
+  Args:
+      url (str): The URL endpoint.
+
+  Returns:
+      Any: A success dict if 204, else the JSON response.
+  """
+  _increment_api_call_count()
+  try:
+    response = _session.delete(url, timeout=60)
+    response.raise_for_status()
+    if response.status_code == 204:
+      return {"status": "success", "message": "Deletion successful."}
+    return response.json()
+  except requests.exceptions.RequestException as e:
+    logger.error(f"DELETE request failed for {url}: {e}")
+    raise


-def error_response(error_message: str) -> dict[str, Any]:
-  """Creates a standardized error dictionary for the agent."""
+def error_response(error_message: str) -> Dict[str, Any]:
+  """
+  Creates a standardized error response dictionary for tool outputs.
+
+  Args:
+      error_message (str): The error details.
+
+  Returns:
+      Dict[str, Any]: Standardized error object.
+  """
  return {"status": "error", "message": error_message}
+
+
+def get_old_open_issue_numbers(
+    owner: str, repo: str, days_old: Optional[float] = None
+) -> List[int]:
+  """
+  Finds open issues older than the specified threshold using server-side filtering.
+
+  OPTIMIZATION:
+  Instead of fetching ALL issues and filtering in Python (which wastes API calls),
+  this uses the GitHub Search API `created:<DATE` syntax.
+
+  Args:
+      owner (str): Repository owner.
+      repo (str): Repository name.
+      days_old (Optional[float]): Filter issues older than this many days.
+                                  Defaults to STALE_HOURS_THRESHOLD / 24.
+
+  Returns:
+      List[int]: A list of issue numbers matching the criteria.
+  """
+  if days_old is None:
+    days_old = STALE_HOURS_THRESHOLD / 24
+
+  now_utc = datetime.now(timezone.utc)
+  cutoff_dt = now_utc - timedelta(days=days_old)
+
+  cutoff_str = cutoff_dt.strftime("%Y-%m-%dT%H:%M:%SZ")
+
+  query = f"repo:{owner}/{repo} is:issue state:open created:<{cutoff_str}"
+
+  logger.info(
+      f"Searching for issues in '{owner}/{repo}' created before {cutoff_str}..."
+  )
+
+  issue_numbers = []
+  page = 1
+  url = "https://api.github.com/search/issues"
+
+  while True:
+    params = {"q": query, "per_page": 100, "page": page}
+    try:
+      data = get_request(url, params=params)
+      items = data.get("items", [])
+
+      if not items:
+        break
+
+      for item in items:
+        if "pull_request" not in item:
+          issue_numbers.append(item["number"])
+
+      if len(items) < 100:
+        break
+
+      page += 1
+
+    except requests.exceptions.RequestException as e:
+      logger.error(f"GitHub search failed on page {page}: {e}")
+      break
+
+  logger.info(f"Found {len(issue_numbers)} stale issues.")
+  return issue_numbers