Amazon Ad Spend Pipeline Timeout Fix¶
Summary¶
The eComHD nightly Amazon pipeline runs fine, but the advertising-spend feed had gone silently stale for 9 days (last good data 2026-05-26). Root cause: Amazon's Ads report queue sits in PENDING past the script's hardcoded 900s poll cap. Raised the cap to 1800s, parallelized the three ad pulls, backfilled the gap. Verified fixed across two live nightly runs. Two hardening items (alerting, settlement false alarm) remain open pending Ace's input.
Verified (cite the source before reusing any of this)¶
- Nightly job is
com.stickymetrics.dailylaunchd plist, runs~/stickymetrics/scripts/nightly_refresh.shat 06:00 daily (source:~/Library/LaunchAgents/com.stickymetrics.daily.plist, read this session). ~/stickymetricsis NOT a git repo, so edits to the scripts go live directly with no commit/CI step (source:git rev-parsereturned exit 128 this session).- Amazon Ads async report latency measured at ~850s end-to-end: PENDING until 830s, PROCESSING at 830s, COMPLETED at 850s (source: manual SP test run
/tmp/ad_sp_test.log/ task bbz2en5i3 output, 2026-06-04). Today's failed nightly report crossed 907s still PENDING then timed out (source:~/.stickymetrics/logs/nightly_2026-06-04.log). - The 900s cap was hardcoded in
pull_ad_spend.py:146(poll_ads_reportdefaultmax_wait=900);main()called it without overriding (source: readpull_ad_spend.pythis session, pre-edit). - Nightly re-pulls
--days 7(last 7 days ending yesterday), which is the only self-heal mechanism; gaps older than 7 days age out of the window permanently (source:pull_ad_spend.py:231-233). - No active alerting exists on this pipeline:
freshness_check.pyonlysys.exit(1)+ prints to log (source: grep offreshness_check.py, only hit was line 108sys.exit(1)).nightly_refresh.shruns the check via an EXIT trap and prints to log, no webhook (source:nightly_refresh.shlines 30-40).w4-health-reporter.shdoes not reference stickymetrics (source: grep returned nothing).last_pipeline_status.jsonis written and consumed by nothing. - Backfill loaded for the gap window 2026-05-27 -> 2026-06-03: SP 3,547 rows, SD 404 rows, SB 20 rows (source:
psqlCOUNT by ad_product this session, 2026-06-04). - Fix verified in two live nightly runs after the edit: 2026-06-05 loaded SP 3,144 / SD 378 / SB 15; 2026-06-06 loaded SP 3,125 / SD 378 / SB 19; zero timeouts in either (source:
~/.stickymetrics/logs/nightly_2026-06-05.logandnightly_2026-06-06.log, read 2026-06-06). - All three ad products now latest = 2026-06-05; freshness check reports
ad_spend_daily OK age=1d(source:freshness_check.pyrun this session, 2026-06-06). - A separate change to the bottom of
nightly_refresh.sh(baseline self-heal viatag_baseline.py, lines 147-154) was made by someone else, not part of the ad fix (source: nightly_refresh.sh current contents).
Assumptions - DO NOT cite as fact¶
- The persistent every-night failure since June 1 is Amazon-side queue latency hovering at the 900s boundary, not a code/auth regression. (Inferred from report creation succeeding + 429 handling present; not independently confirmed against Amazon Ads API status docs.)
settlement_eventsshowing 2 days stale is normal Amazon settlement finalization lag, not a break. (Inferred from continuous daily data through 6/4 with declining recent-day counts; not confirmed against Amazon settlement posting cadence docs.)- 1800s gives "comfortable headroom" over the ~850-910s observed latency. (Holds for the latency range measured over a few days; not a guarantee against future Amazon queue degradation.)
Open verifications (next session must close these)¶
- [ ] Confirm 1800s cap holds over a longer window (watch the next ~5-7 nightly logs for any
not done in 1800s). - [ ] Confirm whether
settlement_eventsever lands within the 26h threshold, or is structurally always 1-2 days late (query distinct posted dates vs run dates over a month before tuning the threshold). - [ ] Identify a LIVE alert channel + working webhook before wiring detection (memory note flags the SIFT Discord token as 401/dead; do not assume any existing webhook works).
Decisions¶
- Raise poll cap to 1800s rather than redesign to request-now/fetch-later: smallest change that clears the measured latency with margin; async refactor was higher risk for an eComHD-side script.
- Run SP/SB/SD pulls in parallel (bash
&+wait) instead of serial: independent reports + distinct ad_product rows means no DB conflict; cuts wall time from ~45 min to ~15 min so the 6 AM batch is not blocked. - Did NOT change alerting thresholds or wire alerts without Ace's call: avoids silently altering what pings him and picking a channel he did not choose.
Remaining work¶
- (Needs no input) Add per-source freshness thresholds so
settlement_eventsgets ~72h and stops printing "PIPELINE NOT HEALTHY" daily (cry-wolf kills the signal). - (Needs Ace input: which channel) Wire a real stale-feed alert to a working webhook so a multi-day outage pings within ~2 days instead of running dark for 9.
- (Optional) Widen the self-heal window from
--days 7to--days 14(or 30 for ad spend) so a longer outage still auto-backfills before any gap goes permanent.
Source artifacts¶
~/stickymetrics/scripts/pull_ad_spend.py(edited: added--max-wait/--poll-interval, wired intomain())~/stickymetrics/scripts/nightly_refresh.sh(edited:AD_MAX_WAIT=1800, parallel SP/SB/SD pulls, lines 129-145)~/stickymetrics/scripts/freshness_check.py(read only)~/Library/LaunchAgents/com.stickymetrics.daily.plist~/.stickymetrics/logs/nightly_2026-06-04.log,nightly_2026-06-05.log,nightly_2026-06-06.log~/.stickymetrics/last_pipeline_status.json- DB:
stickymetricsPostgres,raw.ad_spend_daily
Related¶
- [[2026-05-25 SM Dev Handoff And Supabase Question]]
- Memory: project_sift_sync_outage_may2026 (same class of silent-failure gap, dead Discord alert token)