Discovery Engine v6.1 Docs

The Business Discovery Engine is an open-source Node.js tool that discovers businesses from 4 public sources, enriches them with emails, social profiles, and company intelligence, and pushes everything to Google Sheets in real-time. v6.1 adds round-robin discovery, inline enrichment, and crash recovery.

v6.1 key change: Data flows into your Sheet continuously as each chunk completes. No more waiting hours for all discovery to finish first. Discover → Enrich → Push → Rotate.

Quick Start

1. Install

git clone https://github.com/itallstartedwithaidea/business-discovery-engine.git
cd business-discovery-engine
npm install

2. Configure

cp .env.example .env
# Edit .env — add GOOGLE_SPREADSHEET_ID and GOOGLE_CREDENTIALS_PATH

3. Run

# All 5 states, round-robin, 1000/category
node engine.js start --state ALL --fresh

# Background run (safe to close terminal)
nohup node engine.js start --state ALL --fresh > full-run.log 2>&1 &
tail -f full-run.log

# Single state
node engine.js start --state AZ --fresh

# Custom limits
node engine.js start --state ALL --max 500 --chunk 100 --fresh

# Specific categories
node engine.js start --state OH --categories "plumber,dentist"

Google Sheets Setup

The engine requires a Google Cloud service account with Sheets API access.

Create a Google Cloud service account
Enable the Google Sheets API in your project
Download the JSON key file → save as google-credentials.json in the project root
Create a new Google Sheet and copy the spreadsheet ID from the URL
Share the Sheet with the service account email (Editor access)
Add the spreadsheet ID and credentials path to your .env file

Never commit credentials. The .gitignore excludes .env and google-credentials.json by default. Your service account key should never be in version control.

Architecture: Round-Robin + Inline Enrichment

v6.1 The engine processes categories one at a time, discovering a chunk of businesses, enriching them immediately, and pushing to Sheets before rotating to the next category.

For each category (65 total):

  1. DISCOVER 250 businesses (configurable --chunk)
     — Shuffle: states × sources × cities
     — YP in Phoenix → Yelp in Columbus → BBB in Boise → GMaps in Vegas
     — Global dedup on every insert

  2. FIND WEBSITES (DuckDuckGo + Google fallback)

  3. ENRICH each business
     — Visit website → crawl contact/about/team pages
     — 9 email extraction methods
     — Social media link extraction
     — WHOIS → domain age → business age scoring
     — Email pattern inference → MX verification

  4. PUSH TO SHEETS (immediately)
     — Batch of 10 rows per API call
     — Each state gets its own tab
     — Dashboard updates every 30 seconds

  → Save checkpoint → rotate to next category → repeat

Key differences from v5

Inline enrichment — data flows to Sheets continuously, not after all discovery completes
Round-robin — shuffles states and sources within each category for natural distribution
1,000 per category (not per state) — 65,000 max total across all states
Chunk rotation — rotate to next category every 250 discoveries (configurable)
Global dedup — zero duplicates across all states and sources
--state ALL — single command runs everything
Per-category checkpoints — crash recovery picks up exactly where it left off

4 Discovery Sources

All sources use Puppeteer with stealth plugin for JavaScript rendering, anti-bot evasion, and human-like behavior.

Source	Method	Coverage	Notes
Yellow Pages	Puppeteer + Stealth	3 pages per category/city	Extracts name, phone, address, website
Yelp	Puppeteer + anti-detection	Auto-skips after 5 blocks	Longer delays, stealth scrolling, retry logic
BBB	Puppeteer (React SPA)	Top 15 categories, 3 cities	Headless Chrome renders React app
Google Maps	Puppeteer	Top 20 categories, 5 cities	Aria-label extraction, /maps/search/ URLs

Anti-detection features

Puppeteer stealth plugin (navigator.webdriver=false)
Fingerprint randomization
Cookie acceptance simulation
Human-like scrolling with random delays
Random viewport sizing
Exponential backoff on failures
Optional proxy rotation

Deduplication

Entity resolution runs on every insert using multiple matching strategies:

Dice coefficient fuzzy matching on business names
Phone normalization — strips formatting, matches on digits
Domain comparison — matches on root domain
Global scope — dedup runs across all states and all sources, not just within a single source

Records from multiple sources are merged into a single business entry with combined source attribution.

Enrichment Pipeline

v6.1 Enrichment runs inline after each discovery chunk, not as a separate phase.

For each discovered business with a website:

Visit the website — crawl homepage + up to 15 subpages (contact, about, team pages prioritized)
Extract emails — 9 methods applied in sequence
Extract social media — Facebook, Instagram, LinkedIn, Twitter/X links
WHOIS lookup — domain registration date, registrant info, business age scoring
Email pattern inference — detect patterns from known contacts, generate for staff without emails
MX verification — DNS lookup to validate mail servers exist for every email domain

For businesses without websites, the engine searches DuckDuckGo and Google as fallback.

9 Email Extraction Methods

Applied in sequence to every crawled page:

#	Method	Description
1	JSON-LD / Schema.org	Parses structured data for contact information
2	Mailto links	Extracts from `href="mailto:"` patterns in HTML
3	Regex on page text	Comprehensive email pattern matching across full page
4	Meta tags	Checks meta author, contact, and other tags
5	VCard / hCard	Parses microformat contact cards
6	Staff directory parsing	Finds name + title + email from team/about pages
7	WHOIS/RDAP	Registrant contact data from domain records
8	Email pattern inference	Detects patterns from known contacts, generates for others
9	MX verification	DNS lookup validates mail servers exist for every domain

Email Pattern Inference

When the engine finds at least one verified email for a domain, it detects the naming pattern and generates likely emails for any staff members found without them.

Detected patterns

Inferred emails are marked with inferred confidence and still undergo MX verification before being included.

MX Verification

Every extracted or inferred email is validated via DNS MX record lookup. The engine uses Node.js dns.resolveMx() to check that the email domain has active mail servers.

Confidence	Meaning
verified	Email extracted directly from website + MX records valid
inferred	Email generated via pattern inference + MX records valid
whois	Email from WHOIS/RDAP registrant data
no_mx	Email found but domain has no MX records — may be invalid

Extracts social profiles from every crawled business website:

Platform	Detection Method	Filters
Facebook	href scanning + HTML regex	Filters share/login/plugin URLs
Instagram	URL pattern matching	Filters explore/accounts/posts
LinkedIn	Company and personal profile URLs	Filters share/login pages
Twitter/X	twitter.com and x.com patterns	Filters share/intent links

WHOIS Business Age Scoring

WHOIS/RDAP lookup reveals domain registration date. The engine calculates business age and applies labels:

Label	Age	Use Case
NEW	< 1 year	Recently started businesses — may need marketing services
NEW	1–2 years	Early stage businesses building their presence
Growing	2–5 years	Established enough to invest in growth
Established	5–10 years	Stable businesses with existing operations
Mature	10+ years	Long-standing businesses

5 States (54 Cities)

State	Cities
AZ (15)	Phoenix, Scottsdale, Tempe, Mesa, Chandler, Gilbert, Glendale, Peoria, Surprise, Tucson, Flagstaff, Yuma, Goodyear, Buckeye, Avondale
NV (10)	Las Vegas, Henderson, Reno, North Las Vegas, Sparks, Carson City, Mesquite, Boulder City, Elko, Fernley
OH (12)	Columbus, Cleveland, Cincinnati, Toledo, Akron, Dayton, Canton, Youngstown, Dublin, Westerville, Mason, Parma
ID (10)	Boise, Meridian, Nampa, Caldwell, Idaho Falls, Pocatello, Twin Falls, Coeur d'Alene, Lewiston, Eagle
WA (12)	Seattle, Spokane, Tacoma, Vancouver, Bellevue, Kent, Everett, Renton, Kirkland, Redmond, Olympia, Bellingham

Use --state ALL for round-robin across all states, or --state AZ for a single state. Use node engine.js states to list all states and cities.

65 Categories

Local Services (30)

plumber, electrician, dentist, restaurant, auto repair, salon, law firm, accountant, real estate agent, roofing, hvac, cleaning service, landscaping, insurance agent, veterinarian, fitness, photography, marketing agency, construction, mechanic, chiropractor, bakery, florist, pet grooming, daycare, tutoring, printing, tailor, locksmith, moving company

Retail & Ecommerce (10)

boutique, jewelry store, furniture store, sporting goods, pet store, gift shop, wine shop, supplement store, thrift store, consignment shop

Food & Beverage (5)

coffee shop, brewery, catering, food truck, juice bar

Health & Wellness (6)

med spa, dermatologist, physical therapy, optometrist, mental health counselor, massage therapist

Professional Services (6)

financial advisor, mortgage broker, staffing agency, IT services, web design, commercial cleaning

Home Services (8)

garage door, pest control, fence company, pool service, solar installer, window cleaning, tree service, pressure washing

Use node engine.js cats to list all 65 categories. Filter with --categories "plumber,dentist,hvac".

18-Column Output

Each state gets its own tab in Google Sheets with these columns:

Column	Description	Source
First Name	Contact first name	Website crawl
Last Name	Contact last name	Website crawl
Email	Email address	9 extraction methods + MX
Title	Job title (Owner, Manager, etc.)	Staff card parsing
Company Name	Business name	Discovery sources
Location	City, State	Discovery sources
Website	Business website URL	Discovery + DuckDuckGo fallback
Phone	Phone number	Discovery sources
Facebook	Facebook page URL	Website crawl
Instagram	Instagram profile URL	Website crawl
LinkedIn	LinkedIn profile URL	Website crawl
Twitter/X	X profile URL	Website crawl
Source	Discovery source(s)	YP / Yelp / BBB / GMaps
Confidence	verified, inferred, whois, no_mx	Computed
Biz Age	NEW, Growing, Established, Mature	WHOIS/RDAP lookup
Year Founded	Domain registration year	WHOIS/RDAP lookup
Industry	Business category	Input parameter
Date	Discovery timestamp	Auto-generated

Live Dashboard

The engine auto-creates a "Dashboard" tab in your Google Sheet, updated every 30 seconds with:

Status — running, paused, completed
Per-source discovery counts — Yellow Pages, Yelp, BBB, Google Maps
Enrichment progress — websites found, emails extracted, social profiles
Contact confidence breakdown — verified, inferred, whois, no_mx
Social media counts — Facebook, Instagram, LinkedIn, Twitter/X
Business age breakdown — NEW, Growing, Established, Mature
Rows pushed — total rows written to Sheets
Recent errors (last 20) — with timestamp, source, and message
Per-state results — discovery counts broken down by state

Confidence Scoring

Each email is assigned a confidence level based on how it was obtained:

Level	How Assigned
verified	Directly extracted from business website + MX records confirm working mail server
inferred	Generated via email pattern inference from known contacts + MX verified
whois	Extracted from WHOIS/RDAP domain registration data
no_mx	Email found but domain's DNS has no MX records — delivery uncertain

CLI Reference

# ── RUN ──
node engine.js start --state ALL                    # All states, round-robin
node engine.js start --state AZ                     # Single state
node engine.js start --state ALL --max 500          # 500/category cap
node engine.js start --state ALL --chunk 100        # Rotate every 100
node engine.js start --state OH --categories "plumber,dentist"
node engine.js start --state ALL --fresh            # Ignore saved progress

# ── JOB CONTROL ──
node engine.js pause                                # Pause after current business
node engine.js resume                               # Resume from pause
node engine.js stop                                 # Graceful stop + checkpoint
node engine.js status                               # Show current state
node engine.js reset                                # Clear all saved progress

# ── INFO ──
node engine.js states                               # List all states + cities
node engine.js cats                                 # List all 65 categories
node engine.js                                      # Help

Flags

Flag	Default	Description
`--state`	required	State abbreviation (AZ, NV, OH, ID, WA) or ALL
`--max`	1000	Maximum businesses per category across all states
`--chunk`	250	Rotate to next category after N new discoveries
`--categories`	all 65	Comma-separated list of categories to run
`--fresh`	false	Ignore saved progress and start from scratch

Environment Variables

# Google Sheets — REQUIRED
GOOGLE_SPREADSHEET_ID=your_spreadsheet_id_here
GOOGLE_CREDENTIALS_PATH=google-credentials.json

# Timing
DELAY_MS=3000                    # Base delay between requests (ms)
MAX_PAGES_PER_SITE=15            # Max subpages to crawl per business website

# Discovery limits
MAX_PER_CATEGORY=1000            # Max businesses per category (65 categories)
CHUNK_SIZE=250                   # Rotate to next category after N discoveries

# Proxy (optional — recommended for large runs)
# PROXY_URL=http://user:pass@host:port

Crash Recovery

v6.1 The engine survives laptop sleep, terminal disconnects, and crashes.

Signal handlers — SIGTERM, SIGINT, SIGHUP trigger emergency state save
Per-category checkpoints — progress saved after each category chunk completes
Global dedup on resume — rebuilds the seen-set from checkpoint data, zero duplicates
Stats recomputation — dashboard numbers rebuild from actual data, not stale counters

State is saved to .discovery-state.json. To start fresh, use --fresh or run node engine.js reset.

Proxy Setup

For large discovery runs, a proxy is recommended to avoid IP-based rate limiting.

# In .env
PROXY_URL=http://username:password@proxy-host:port

The proxy is passed to Puppeteer via --proxy-server flag. All HTTP requests through the browser use the proxy. Axios requests (WHOIS, DuckDuckGo) also route through the proxy when configured.

Built by John Williams — Senior Paid Media Specialist at Seer Interactive.
Part of the Google Ads Agent ecosystem. Open source on GitHub.

Discovery Engine v6.1 Docs

Quick Start

1. Install

2. Configure

3. Run

Google Sheets Setup

Architecture: Round-Robin + Inline Enrichment

Key differences from v5

4 Discovery Sources

Anti-detection features

Deduplication

Enrichment Pipeline

9 Email Extraction Methods

Email Pattern Inference

Detected patterns

MX Verification

Social Media Extraction

WHOIS Business Age Scoring

5 States (54 Cities)

65 Categories

Local Services (30)

Retail & Ecommerce (10)

Food & Beverage (5)

Health & Wellness (6)

Professional Services (6)

Home Services (8)

18-Column Output

Live Dashboard

Confidence Scoring

CLI Reference

Flags

Environment Variables

Crash Recovery

Proxy Setup