Skip to content

tpwalk

scrape

StrongWind1/tpwalk

tpwalk

StrongWind1/tpwalk

Home
Getting started
Getting started
- Installation
- Quick start
Commands
Commands
- scrape scrape
  Table of contents
  - Options
  - Sources
  - Examples
- verify
- bruteforce
Guide
Guide
- How it works
- Responsible use

`tpwalk scrape`¶

Discover GPL archive URLs from every source and write one .txt file per source into a timestamped directory under data/scrapes/.

tpwalk scrape [OPTIONS]

Options¶

Option	Default	Description
`--data-dir`	`data`	Root data directory; the `scrapes/` subdirectory is written here.
`--model-sweep`	off	Also run the `phppage` model-wordlist sweep — heavy (thousands of requests to `www.tp-link.com`, several minutes).
`--sweep-max-models`	none	Cap the number of candidate models for `--model-sweep` (sorted; default no cap).

Sources¶

scrape runs each source concurrently under a shared async client; a failure in one source never aborts the others. Each source writes its own result file.

Source	Output file	Needs `GITHUB_TOKEN`
Regional crawl (two-pass)	`regional_crawl.txt`	—
Wayback Machine CDX	`wayback_cdx.txt`	—
Common Crawl	`common_crawl.txt`	—
GitHub code search	`github_search.txt`	✓
TP-Link GitHub	`tplink_github.txt`	✓
Mercusys regional	`mercusys_regional.txt`	—
Reddit	`reddit.txt`	—
Forums	`forums.txt`	—
Google	`google.txt`	—
Model sweep (opt-in)	`model_sweep.txt`	—

The GitHub sources are skipped with a notice unless GITHUB_TOKEN is set in the environment.

Examples¶

# Run all passive sources
tpwalk scrape

# Include the GitHub sources
GITHUB_TOKEN=ghp_… tpwalk scrape

# Add the heavy model-wordlist sweep, capped at 500 models
tpwalk scrape --model-sweep --sweep-max-models 500

# Write to a custom data directory
tpwalk scrape --data-dir /tmp/tpwalk-run

When the run finishes, tpwalk prints a summary with per-source unique/raw counts. Feed the results to verify next.