Skip to content

Quick start

The pipeline is three steps — discover, verify, download.

1. Discover

tpwalk scrape

This runs every passive discovery source and writes one .txt file per source into a timestamped directory under data/scrapes/. To also enable the GitHub-based sources, export a token first:

export GITHUB_TOKEN=ghp_…   # enables github_search and tplink_github
tpwalk scrape

See scrape for all options.

2. Verify

tpwalk verify

verify reads every .txt under data/scrapes/, normalizes and deduplicates the URLs, HEAD-checks each one against the S3 origin, and writes five files into data/:

File Contents
verified.json Live archives with full S3 metadata
verified.txt One s3:// URL per line
s5cmd_download.txt Runnable s5cmd manifest
dead.json / dead.txt URLs that did not resolve

See verify for details and the full output schema.

3. Download

s5cmd --no-sign-request run data/s5cmd_download.txt

This mirrors every confirmed-live archive. The manifest uses cp --if-size-differ, so re-running it only fetches new or changed files.

Going further

If passive discovery misses files you expect to exist, run an active enumeration pass with bruteforce — start with --dry-run to size the job.