Discovery Engine v6.1 Docs
The Business Discovery Engine is an open-source Node.js tool that discovers businesses from 4 public sources, enriches them with emails, social profiles, and company intelligence, and pushes everything to Google Sheets in real-time. v6.1 adds round-robin discovery, inline enrichment, and crash recovery.
Quick Start
1. Install
git clone https://github.com/itallstartedwithaidea/business-discovery-engine.git
cd business-discovery-engine
npm install
2. Configure
cp .env.example .env
# Edit .env — add GOOGLE_SPREADSHEET_ID and GOOGLE_CREDENTIALS_PATH
3. Run
# All 5 states, round-robin, 1000/category
node engine.js start --state ALL --fresh
# Background run (safe to close terminal)
nohup node engine.js start --state ALL --fresh > full-run.log 2>&1 &
tail -f full-run.log
# Single state
node engine.js start --state AZ --fresh
# Custom limits
node engine.js start --state ALL --max 500 --chunk 100 --fresh
# Specific categories
node engine.js start --state OH --categories "plumber,dentist"
Google Sheets Setup
The engine requires a Google Cloud service account with Sheets API access.
- Create a Google Cloud service account
- Enable the Google Sheets API in your project
- Download the JSON key file → save as
google-credentials.jsonin the project root - Create a new Google Sheet and copy the spreadsheet ID from the URL
- Share the Sheet with the service account email (Editor access)
- Add the spreadsheet ID and credentials path to your
.envfile
.gitignore excludes .env and google-credentials.json by default. Your service account key should never be in version control.
Architecture: Round-Robin + Inline Enrichment
v6.1 The engine processes categories one at a time, discovering a chunk of businesses, enriching them immediately, and pushing to Sheets before rotating to the next category.
For each category (65 total):
1. DISCOVER 250 businesses (configurable --chunk)
— Shuffle: states × sources × cities
— YP in Phoenix → Yelp in Columbus → BBB in Boise → GMaps in Vegas
— Global dedup on every insert
2. FIND WEBSITES (DuckDuckGo + Google fallback)
3. ENRICH each business
— Visit website → crawl contact/about/team pages
— 9 email extraction methods
— Social media link extraction
— WHOIS → domain age → business age scoring
— Email pattern inference → MX verification
4. PUSH TO SHEETS (immediately)
— Batch of 10 rows per API call
— Each state gets its own tab
— Dashboard updates every 30 seconds
→ Save checkpoint → rotate to next category → repeat
Key differences from v5
- Inline enrichment — data flows to Sheets continuously, not after all discovery completes
- Round-robin — shuffles states and sources within each category for natural distribution
- 1,000 per category (not per state) — 65,000 max total across all states
- Chunk rotation — rotate to next category every 250 discoveries (configurable)
- Global dedup — zero duplicates across all states and sources
--state ALL— single command runs everything- Per-category checkpoints — crash recovery picks up exactly where it left off
4 Discovery Sources
All sources use Puppeteer with stealth plugin for JavaScript rendering, anti-bot evasion, and human-like behavior.
| Source | Method | Coverage | Notes |
|---|---|---|---|
| Yellow Pages | Puppeteer + Stealth | 3 pages per category/city | Extracts name, phone, address, website |
| Yelp | Puppeteer + anti-detection | Auto-skips after 5 blocks | Longer delays, stealth scrolling, retry logic |
| BBB | Puppeteer (React SPA) | Top 15 categories, 3 cities | Headless Chrome renders React app |
| Google Maps | Puppeteer | Top 20 categories, 5 cities | Aria-label extraction, /maps/search/ URLs |
Anti-detection features
- Puppeteer stealth plugin (
navigator.webdriver=false) - Fingerprint randomization
- Cookie acceptance simulation
- Human-like scrolling with random delays
- Random viewport sizing
- Exponential backoff on failures
- Optional proxy rotation
Deduplication
Entity resolution runs on every insert using multiple matching strategies:
- Dice coefficient fuzzy matching on business names
- Phone normalization — strips formatting, matches on digits
- Domain comparison — matches on root domain
- Global scope — dedup runs across all states and all sources, not just within a single source
Records from multiple sources are merged into a single business entry with combined source attribution.
Enrichment Pipeline
v6.1 Enrichment runs inline after each discovery chunk, not as a separate phase.
For each discovered business with a website:
- Visit the website — crawl homepage + up to 15 subpages (contact, about, team pages prioritized)
- Extract emails — 9 methods applied in sequence
- Extract social media — Facebook, Instagram, LinkedIn, Twitter/X links
- WHOIS lookup — domain registration date, registrant info, business age scoring
- Email pattern inference — detect patterns from known contacts, generate for staff without emails
- MX verification — DNS lookup to validate mail servers exist for every email domain
For businesses without websites, the engine searches DuckDuckGo and Google as fallback.
9 Email Extraction Methods
Applied in sequence to every crawled page:
| # | Method | Description |
|---|---|---|
| 1 | JSON-LD / Schema.org | Parses structured data for contact information |
| 2 | Mailto links | Extracts from href="mailto:" patterns in HTML |
| 3 | Regex on page text | Comprehensive email pattern matching across full page |
| 4 | Meta tags | Checks meta author, contact, and other tags |
| 5 | VCard / hCard | Parses microformat contact cards |
| 6 | Staff directory parsing | Finds name + title + email from team/about pages |
| 7 | WHOIS/RDAP | Registrant contact data from domain records |
| 8 | Email pattern inference | Detects patterns from known contacts, generates for others |
| 9 | MX verification | DNS lookup validates mail servers exist for every domain |
Email Pattern Inference
When the engine finds at least one verified email for a domain, it detects the naming pattern and generates likely emails for any staff members found without them.
Detected patterns
Inferred emails are marked with inferred confidence and still undergo MX verification before being included.
MX Verification
Every extracted or inferred email is validated via DNS MX record lookup. The engine uses Node.js dns.resolveMx() to check that the email domain has active mail servers.
| Confidence | Meaning |
|---|---|
| verified | Email extracted directly from website + MX records valid |
| inferred | Email generated via pattern inference + MX records valid |
| whois | Email from WHOIS/RDAP registrant data |
| no_mx | Email found but domain has no MX records — may be invalid |
Social Media Extraction
Extracts social profiles from every crawled business website:
| Platform | Detection Method | Filters |
|---|---|---|
| href scanning + HTML regex | Filters share/login/plugin URLs | |
| URL pattern matching | Filters explore/accounts/posts | |
| Company and personal profile URLs | Filters share/login pages | |
| Twitter/X | twitter.com and x.com patterns | Filters share/intent links |
WHOIS Business Age Scoring
WHOIS/RDAP lookup reveals domain registration date. The engine calculates business age and applies labels:
| Label | Age | Use Case |
|---|---|---|
| NEW | < 1 year | Recently started businesses — may need marketing services |
| NEW | 1–2 years | Early stage businesses building their presence |
| Growing | 2–5 years | Established enough to invest in growth |
| Established | 5–10 years | Stable businesses with existing operations |
| Mature | 10+ years | Long-standing businesses |
5 States (54 Cities)
| State | Cities |
|---|---|
| AZ (15) | Phoenix, Scottsdale, Tempe, Mesa, Chandler, Gilbert, Glendale, Peoria, Surprise, Tucson, Flagstaff, Yuma, Goodyear, Buckeye, Avondale |
| NV (10) | Las Vegas, Henderson, Reno, North Las Vegas, Sparks, Carson City, Mesquite, Boulder City, Elko, Fernley |
| OH (12) | Columbus, Cleveland, Cincinnati, Toledo, Akron, Dayton, Canton, Youngstown, Dublin, Westerville, Mason, Parma |
| ID (10) | Boise, Meridian, Nampa, Caldwell, Idaho Falls, Pocatello, Twin Falls, Coeur d'Alene, Lewiston, Eagle |
| WA (12) | Seattle, Spokane, Tacoma, Vancouver, Bellevue, Kent, Everett, Renton, Kirkland, Redmond, Olympia, Bellingham |
Use --state ALL for round-robin across all states, or --state AZ for a single state. Use node engine.js states to list all states and cities.
65 Categories
Local Services (30)
plumber, electrician, dentist, restaurant, auto repair, salon, law firm, accountant, real estate agent, roofing, hvac, cleaning service, landscaping, insurance agent, veterinarian, fitness, photography, marketing agency, construction, mechanic, chiropractor, bakery, florist, pet grooming, daycare, tutoring, printing, tailor, locksmith, moving company
Retail & Ecommerce (10)
boutique, jewelry store, furniture store, sporting goods, pet store, gift shop, wine shop, supplement store, thrift store, consignment shop
Food & Beverage (5)
coffee shop, brewery, catering, food truck, juice bar
Health & Wellness (6)
med spa, dermatologist, physical therapy, optometrist, mental health counselor, massage therapist
Professional Services (6)
financial advisor, mortgage broker, staffing agency, IT services, web design, commercial cleaning
Home Services (8)
garage door, pest control, fence company, pool service, solar installer, window cleaning, tree service, pressure washing
Use node engine.js cats to list all 65 categories. Filter with --categories "plumber,dentist,hvac".
18-Column Output
Each state gets its own tab in Google Sheets with these columns:
| Column | Description | Source |
|---|---|---|
| First Name | Contact first name | Website crawl |
| Last Name | Contact last name | Website crawl |
| Email address | 9 extraction methods + MX | |
| Title | Job title (Owner, Manager, etc.) | Staff card parsing |
| Company Name | Business name | Discovery sources |
| Location | City, State | Discovery sources |
| Website | Business website URL | Discovery + DuckDuckGo fallback |
| Phone | Phone number | Discovery sources |
| Facebook page URL | Website crawl | |
| Instagram profile URL | Website crawl | |
| LinkedIn profile URL | Website crawl | |
| Twitter/X | X profile URL | Website crawl |
| Source | Discovery source(s) | YP / Yelp / BBB / GMaps |
| Confidence | verified, inferred, whois, no_mx | Computed |
| Biz Age | NEW, Growing, Established, Mature | WHOIS/RDAP lookup |
| Year Founded | Domain registration year | WHOIS/RDAP lookup |
| Industry | Business category | Input parameter |
| Date | Discovery timestamp | Auto-generated |
Live Dashboard
The engine auto-creates a "Dashboard" tab in your Google Sheet, updated every 30 seconds with:
- Status — running, paused, completed
- Per-source discovery counts — Yellow Pages, Yelp, BBB, Google Maps
- Enrichment progress — websites found, emails extracted, social profiles
- Contact confidence breakdown — verified, inferred, whois, no_mx
- Social media counts — Facebook, Instagram, LinkedIn, Twitter/X
- Business age breakdown — NEW, Growing, Established, Mature
- Rows pushed — total rows written to Sheets
- Recent errors (last 20) — with timestamp, source, and message
- Per-state results — discovery counts broken down by state
Confidence Scoring
Each email is assigned a confidence level based on how it was obtained:
| Level | How Assigned |
|---|---|
| verified | Directly extracted from business website + MX records confirm working mail server |
| inferred | Generated via email pattern inference from known contacts + MX verified |
| whois | Extracted from WHOIS/RDAP domain registration data |
| no_mx | Email found but domain's DNS has no MX records — delivery uncertain |
CLI Reference
# ── RUN ──
node engine.js start --state ALL # All states, round-robin
node engine.js start --state AZ # Single state
node engine.js start --state ALL --max 500 # 500/category cap
node engine.js start --state ALL --chunk 100 # Rotate every 100
node engine.js start --state OH --categories "plumber,dentist"
node engine.js start --state ALL --fresh # Ignore saved progress
# ── JOB CONTROL ──
node engine.js pause # Pause after current business
node engine.js resume # Resume from pause
node engine.js stop # Graceful stop + checkpoint
node engine.js status # Show current state
node engine.js reset # Clear all saved progress
# ── INFO ──
node engine.js states # List all states + cities
node engine.js cats # List all 65 categories
node engine.js # Help
Flags
| Flag | Default | Description |
|---|---|---|
--state | required | State abbreviation (AZ, NV, OH, ID, WA) or ALL |
--max | 1000 | Maximum businesses per category across all states |
--chunk | 250 | Rotate to next category after N new discoveries |
--categories | all 65 | Comma-separated list of categories to run |
--fresh | false | Ignore saved progress and start from scratch |
Environment Variables
# Google Sheets — REQUIRED
GOOGLE_SPREADSHEET_ID=your_spreadsheet_id_here
GOOGLE_CREDENTIALS_PATH=google-credentials.json
# Timing
DELAY_MS=3000 # Base delay between requests (ms)
MAX_PAGES_PER_SITE=15 # Max subpages to crawl per business website
# Discovery limits
MAX_PER_CATEGORY=1000 # Max businesses per category (65 categories)
CHUNK_SIZE=250 # Rotate to next category after N discoveries
# Proxy (optional — recommended for large runs)
# PROXY_URL=http://user:pass@host:port
Crash Recovery
v6.1 The engine survives laptop sleep, terminal disconnects, and crashes.
- Signal handlers — SIGTERM, SIGINT, SIGHUP trigger emergency state save
- Per-category checkpoints — progress saved after each category chunk completes
- Global dedup on resume — rebuilds the seen-set from checkpoint data, zero duplicates
- Stats recomputation — dashboard numbers rebuild from actual data, not stale counters
State is saved to .discovery-state.json. To start fresh, use --fresh or run node engine.js reset.
Proxy Setup
For large discovery runs, a proxy is recommended to avoid IP-based rate limiting.
# In .env
PROXY_URL=http://username:password@proxy-host:port
The proxy is passed to Puppeteer via --proxy-server flag. All HTTP requests through the browser use the proxy. Axios requests (WHOIS, DuckDuckGo) also route through the proxy when configured.
Built by John Williams — Senior Paid Media Specialist at Seer Interactive.
Part of the Google Ads Agent ecosystem. Open source on GitHub.