Skip to content

Chaos — OBP Session Loss Mid-Operation

Failure injection runbook: simulate operator notebook WiFi drop mid-CLONER refresh + validate graceful error handling.

Goal

Confirm that an OBP session loss during a CLONER fn fetch (Internet egress) surfaces a clean error to the operator without corrupting bench state.

Prerequisites

  • Test bench (not production)
  • Operator notebook with OBP installed + authorized
  • An active CLONER fn running (e.g. catalog refresh fn #4)
  • Operator role with chaos permissions

Procedure

  1. Authorize an OBP session via dashboard /admin/obp/authorize
  2. Start a CLONER catalog refresh:
    curl -X POST https://dashboard.test-bench/api/cloner/refresh-catalog
    
  3. Mid-fetch (within 10s), kill the OBP daemon on the notebook:

    # macOS
    killall obp-daemon
    
    # Linux
    sudo systemctl stop obp
    

  4. Watch for bench reaction (target: < 5s detection):

    watch -n 1 'curl -fsS https://dashboard.test-bench/api/cloner/status'
    

  5. Expected behavior:

  6. CLONER egress request returns clean error (no hang)
  7. Dashboard shows "OBP session lost" with retry option
  8. No partial catalog write to local cache
  9. Audit log captures the truncated session
  10. Alert fires (OBPSessionLoss)

Rollback

  1. Restart OBP daemon:

    # macOS
    open /Applications/OBP\ Daemon.app
    
    # Linux
    sudo systemctl start obp
    

  2. Re-authorize via dashboard

  3. Retry the CLONER operation (idempotent)

Success criteria

  • OBP loss detected within 5s
  • Clean error returned to operator
  • No partial state written
  • Alert fired + audit logged
  • Retry after restart works idempotently