Skip to content

Amazon Ad Spend Pipeline Timeout Fix

Summary

The eComHD nightly Amazon pipeline runs fine, but the advertising-spend feed had gone silently stale for 9 days (last good data 2026-05-26). Root cause: Amazon's Ads report queue sits in PENDING past the script's hardcoded 900s poll cap. Raised the cap to 1800s, parallelized the three ad pulls, backfilled the gap. Verified fixed across two live nightly runs. Two hardening items (alerting, settlement false alarm) remain open pending Ace's input.

Verified (cite the source before reusing any of this)

  • Nightly job is com.stickymetrics.daily launchd plist, runs ~/stickymetrics/scripts/nightly_refresh.sh at 06:00 daily (source: ~/Library/LaunchAgents/com.stickymetrics.daily.plist, read this session).
  • ~/stickymetrics is NOT a git repo, so edits to the scripts go live directly with no commit/CI step (source: git rev-parse returned exit 128 this session).
  • Amazon Ads async report latency measured at ~850s end-to-end: PENDING until 830s, PROCESSING at 830s, COMPLETED at 850s (source: manual SP test run /tmp/ad_sp_test.log / task bbz2en5i3 output, 2026-06-04). Today's failed nightly report crossed 907s still PENDING then timed out (source: ~/.stickymetrics/logs/nightly_2026-06-04.log).
  • The 900s cap was hardcoded in pull_ad_spend.py:146 (poll_ads_report default max_wait=900); main() called it without overriding (source: read pull_ad_spend.py this session, pre-edit).
  • Nightly re-pulls --days 7 (last 7 days ending yesterday), which is the only self-heal mechanism; gaps older than 7 days age out of the window permanently (source: pull_ad_spend.py:231-233).
  • No active alerting exists on this pipeline: freshness_check.py only sys.exit(1) + prints to log (source: grep of freshness_check.py, only hit was line 108 sys.exit(1)). nightly_refresh.sh runs the check via an EXIT trap and prints to log, no webhook (source: nightly_refresh.sh lines 30-40). w4-health-reporter.sh does not reference stickymetrics (source: grep returned nothing). last_pipeline_status.json is written and consumed by nothing.
  • Backfill loaded for the gap window 2026-05-27 -> 2026-06-03: SP 3,547 rows, SD 404 rows, SB 20 rows (source: psql COUNT by ad_product this session, 2026-06-04).
  • Fix verified in two live nightly runs after the edit: 2026-06-05 loaded SP 3,144 / SD 378 / SB 15; 2026-06-06 loaded SP 3,125 / SD 378 / SB 19; zero timeouts in either (source: ~/.stickymetrics/logs/nightly_2026-06-05.log and nightly_2026-06-06.log, read 2026-06-06).
  • All three ad products now latest = 2026-06-05; freshness check reports ad_spend_daily OK age=1d (source: freshness_check.py run this session, 2026-06-06).
  • A separate change to the bottom of nightly_refresh.sh (baseline self-heal via tag_baseline.py, lines 147-154) was made by someone else, not part of the ad fix (source: nightly_refresh.sh current contents).

Assumptions - DO NOT cite as fact

  • The persistent every-night failure since June 1 is Amazon-side queue latency hovering at the 900s boundary, not a code/auth regression. (Inferred from report creation succeeding + 429 handling present; not independently confirmed against Amazon Ads API status docs.)
  • settlement_events showing 2 days stale is normal Amazon settlement finalization lag, not a break. (Inferred from continuous daily data through 6/4 with declining recent-day counts; not confirmed against Amazon settlement posting cadence docs.)
  • 1800s gives "comfortable headroom" over the ~850-910s observed latency. (Holds for the latency range measured over a few days; not a guarantee against future Amazon queue degradation.)

Open verifications (next session must close these)

  • [ ] Confirm 1800s cap holds over a longer window (watch the next ~5-7 nightly logs for any not done in 1800s).
  • [ ] Confirm whether settlement_events ever lands within the 26h threshold, or is structurally always 1-2 days late (query distinct posted dates vs run dates over a month before tuning the threshold).
  • [ ] Identify a LIVE alert channel + working webhook before wiring detection (memory note flags the SIFT Discord token as 401/dead; do not assume any existing webhook works).

Decisions

  • Raise poll cap to 1800s rather than redesign to request-now/fetch-later: smallest change that clears the measured latency with margin; async refactor was higher risk for an eComHD-side script.
  • Run SP/SB/SD pulls in parallel (bash & + wait) instead of serial: independent reports + distinct ad_product rows means no DB conflict; cuts wall time from ~45 min to ~15 min so the 6 AM batch is not blocked.
  • Did NOT change alerting thresholds or wire alerts without Ace's call: avoids silently altering what pings him and picking a channel he did not choose.

Remaining work

  1. (Needs no input) Add per-source freshness thresholds so settlement_events gets ~72h and stops printing "PIPELINE NOT HEALTHY" daily (cry-wolf kills the signal).
  2. (Needs Ace input: which channel) Wire a real stale-feed alert to a working webhook so a multi-day outage pings within ~2 days instead of running dark for 9.
  3. (Optional) Widen the self-heal window from --days 7 to --days 14 (or 30 for ad spend) so a longer outage still auto-backfills before any gap goes permanent.

Source artifacts

  • ~/stickymetrics/scripts/pull_ad_spend.py (edited: added --max-wait / --poll-interval, wired into main())
  • ~/stickymetrics/scripts/nightly_refresh.sh (edited: AD_MAX_WAIT=1800, parallel SP/SB/SD pulls, lines 129-145)
  • ~/stickymetrics/scripts/freshness_check.py (read only)
  • ~/Library/LaunchAgents/com.stickymetrics.daily.plist
  • ~/.stickymetrics/logs/nightly_2026-06-04.log, nightly_2026-06-05.log, nightly_2026-06-06.log
  • ~/.stickymetrics/last_pipeline_status.json
  • DB: stickymetrics Postgres, raw.ad_spend_daily
  • [[2026-05-25 SM Dev Handoff And Supabase Question]]
  • Memory: project_sift_sync_outage_may2026 (same class of silent-failure gap, dead Discord alert token)