Discovery Engine v6.1 Docs

The Business Discovery Engine is an open-source Node.js tool that discovers businesses from 4 public sources, enriches them with emails, social profiles, and company intelligence, and pushes everything to Google Sheets in real-time. v6.1 adds round-robin discovery, inline enrichment, and crash recovery.

v6.1 key change: Data flows into your Sheet continuously as each chunk completes. No more waiting hours for all discovery to finish first. Discover → Enrich → Push → Rotate.

Quick Start

1. Install

git clone https://github.com/itallstartedwithaidea/business-discovery-engine.git
cd business-discovery-engine
npm install

2. Configure

cp .env.example .env
# Edit .env — add GOOGLE_SPREADSHEET_ID and GOOGLE_CREDENTIALS_PATH

3. Run

# All 5 states, round-robin, 1000/category
node engine.js start --state ALL --fresh

# Background run (safe to close terminal)
nohup node engine.js start --state ALL --fresh > full-run.log 2>&1 &
tail -f full-run.log

# Single state
node engine.js start --state AZ --fresh

# Custom limits
node engine.js start --state ALL --max 500 --chunk 100 --fresh

# Specific categories
node engine.js start --state OH --categories "plumber,dentist"

Google Sheets Setup

The engine requires a Google Cloud service account with Sheets API access.

  1. Create a Google Cloud service account
  2. Enable the Google Sheets API in your project
  3. Download the JSON key file → save as google-credentials.json in the project root
  4. Create a new Google Sheet and copy the spreadsheet ID from the URL
  5. Share the Sheet with the service account email (Editor access)
  6. Add the spreadsheet ID and credentials path to your .env file
Never commit credentials. The .gitignore excludes .env and google-credentials.json by default. Your service account key should never be in version control.

Architecture: Round-Robin + Inline Enrichment

v6.1 The engine processes categories one at a time, discovering a chunk of businesses, enriching them immediately, and pushing to Sheets before rotating to the next category.

For each category (65 total):

  1. DISCOVER 250 businesses (configurable --chunk)
     — Shuffle: states × sources × cities
     — YP in Phoenix → Yelp in Columbus → BBB in Boise → GMaps in Vegas
     — Global dedup on every insert

  2. FIND WEBSITES (DuckDuckGo + Google fallback)

  3. ENRICH each business
     — Visit website → crawl contact/about/team pages
     — 9 email extraction methods
     — Social media link extraction
     — WHOIS → domain age → business age scoring
     — Email pattern inference → MX verification

  4. PUSH TO SHEETS (immediately)
     — Batch of 10 rows per API call
     — Each state gets its own tab
     — Dashboard updates every 30 seconds

  → Save checkpoint → rotate to next category → repeat

Key differences from v5

4 Discovery Sources

All sources use Puppeteer with stealth plugin for JavaScript rendering, anti-bot evasion, and human-like behavior.

SourceMethodCoverageNotes
Yellow PagesPuppeteer + Stealth3 pages per category/cityExtracts name, phone, address, website
YelpPuppeteer + anti-detectionAuto-skips after 5 blocksLonger delays, stealth scrolling, retry logic
BBBPuppeteer (React SPA)Top 15 categories, 3 citiesHeadless Chrome renders React app
Google MapsPuppeteerTop 20 categories, 5 citiesAria-label extraction, /maps/search/ URLs

Anti-detection features

Deduplication

Entity resolution runs on every insert using multiple matching strategies:

Records from multiple sources are merged into a single business entry with combined source attribution.

Enrichment Pipeline

v6.1 Enrichment runs inline after each discovery chunk, not as a separate phase.

For each discovered business with a website:

  1. Visit the website — crawl homepage + up to 15 subpages (contact, about, team pages prioritized)
  2. Extract emails — 9 methods applied in sequence
  3. Extract social media — Facebook, Instagram, LinkedIn, Twitter/X links
  4. WHOIS lookup — domain registration date, registrant info, business age scoring
  5. Email pattern inference — detect patterns from known contacts, generate for staff without emails
  6. MX verification — DNS lookup to validate mail servers exist for every email domain

For businesses without websites, the engine searches DuckDuckGo and Google as fallback.

9 Email Extraction Methods

Applied in sequence to every crawled page:

#MethodDescription
1JSON-LD / Schema.orgParses structured data for contact information
2Mailto linksExtracts from href="mailto:" patterns in HTML
3Regex on page textComprehensive email pattern matching across full page
4Meta tagsChecks meta author, contact, and other tags
5VCard / hCardParses microformat contact cards
6Staff directory parsingFinds name + title + email from team/about pages
7WHOIS/RDAPRegistrant contact data from domain records
8Email pattern inferenceDetects patterns from known contacts, generates for others
9MX verificationDNS lookup validates mail servers exist for every domain

Email Pattern Inference

When the engine finds at least one verified email for a domain, it detects the naming pattern and generates likely emails for any staff members found without them.

Detected patterns

Inferred emails are marked with inferred confidence and still undergo MX verification before being included.

MX Verification

Every extracted or inferred email is validated via DNS MX record lookup. The engine uses Node.js dns.resolveMx() to check that the email domain has active mail servers.

ConfidenceMeaning
verifiedEmail extracted directly from website + MX records valid
inferredEmail generated via pattern inference + MX records valid
whoisEmail from WHOIS/RDAP registrant data
no_mxEmail found but domain has no MX records — may be invalid

Social Media Extraction

Extracts social profiles from every crawled business website:

PlatformDetection MethodFilters
Facebookhref scanning + HTML regexFilters share/login/plugin URLs
InstagramURL pattern matchingFilters explore/accounts/posts
LinkedInCompany and personal profile URLsFilters share/login pages
Twitter/Xtwitter.com and x.com patternsFilters share/intent links

WHOIS Business Age Scoring

WHOIS/RDAP lookup reveals domain registration date. The engine calculates business age and applies labels:

LabelAgeUse Case
NEW< 1 yearRecently started businesses — may need marketing services
NEW1–2 yearsEarly stage businesses building their presence
Growing2–5 yearsEstablished enough to invest in growth
Established5–10 yearsStable businesses with existing operations
Mature10+ yearsLong-standing businesses

5 States (54 Cities)

StateCities
AZ (15)Phoenix, Scottsdale, Tempe, Mesa, Chandler, Gilbert, Glendale, Peoria, Surprise, Tucson, Flagstaff, Yuma, Goodyear, Buckeye, Avondale
NV (10)Las Vegas, Henderson, Reno, North Las Vegas, Sparks, Carson City, Mesquite, Boulder City, Elko, Fernley
OH (12)Columbus, Cleveland, Cincinnati, Toledo, Akron, Dayton, Canton, Youngstown, Dublin, Westerville, Mason, Parma
ID (10)Boise, Meridian, Nampa, Caldwell, Idaho Falls, Pocatello, Twin Falls, Coeur d'Alene, Lewiston, Eagle
WA (12)Seattle, Spokane, Tacoma, Vancouver, Bellevue, Kent, Everett, Renton, Kirkland, Redmond, Olympia, Bellingham

Use --state ALL for round-robin across all states, or --state AZ for a single state. Use node engine.js states to list all states and cities.

65 Categories

Local Services (30)

plumber, electrician, dentist, restaurant, auto repair, salon, law firm, accountant, real estate agent, roofing, hvac, cleaning service, landscaping, insurance agent, veterinarian, fitness, photography, marketing agency, construction, mechanic, chiropractor, bakery, florist, pet grooming, daycare, tutoring, printing, tailor, locksmith, moving company

Retail & Ecommerce (10)

boutique, jewelry store, furniture store, sporting goods, pet store, gift shop, wine shop, supplement store, thrift store, consignment shop

Food & Beverage (5)

coffee shop, brewery, catering, food truck, juice bar

Health & Wellness (6)

med spa, dermatologist, physical therapy, optometrist, mental health counselor, massage therapist

Professional Services (6)

financial advisor, mortgage broker, staffing agency, IT services, web design, commercial cleaning

Home Services (8)

garage door, pest control, fence company, pool service, solar installer, window cleaning, tree service, pressure washing

Use node engine.js cats to list all 65 categories. Filter with --categories "plumber,dentist,hvac".

18-Column Output

Each state gets its own tab in Google Sheets with these columns:

ColumnDescriptionSource
First NameContact first nameWebsite crawl
Last NameContact last nameWebsite crawl
EmailEmail address9 extraction methods + MX
TitleJob title (Owner, Manager, etc.)Staff card parsing
Company NameBusiness nameDiscovery sources
LocationCity, StateDiscovery sources
WebsiteBusiness website URLDiscovery + DuckDuckGo fallback
PhonePhone numberDiscovery sources
FacebookFacebook page URLWebsite crawl
InstagramInstagram profile URLWebsite crawl
LinkedInLinkedIn profile URLWebsite crawl
Twitter/XX profile URLWebsite crawl
SourceDiscovery source(s)YP / Yelp / BBB / GMaps
Confidenceverified, inferred, whois, no_mxComputed
Biz AgeNEW, Growing, Established, MatureWHOIS/RDAP lookup
Year FoundedDomain registration yearWHOIS/RDAP lookup
IndustryBusiness categoryInput parameter
DateDiscovery timestampAuto-generated

Live Dashboard

The engine auto-creates a "Dashboard" tab in your Google Sheet, updated every 30 seconds with:

Confidence Scoring

Each email is assigned a confidence level based on how it was obtained:

LevelHow Assigned
verifiedDirectly extracted from business website + MX records confirm working mail server
inferredGenerated via email pattern inference from known contacts + MX verified
whoisExtracted from WHOIS/RDAP domain registration data
no_mxEmail found but domain's DNS has no MX records — delivery uncertain

CLI Reference

# ── RUN ──
node engine.js start --state ALL                    # All states, round-robin
node engine.js start --state AZ                     # Single state
node engine.js start --state ALL --max 500          # 500/category cap
node engine.js start --state ALL --chunk 100        # Rotate every 100
node engine.js start --state OH --categories "plumber,dentist"
node engine.js start --state ALL --fresh            # Ignore saved progress

# ── JOB CONTROL ──
node engine.js pause                                # Pause after current business
node engine.js resume                               # Resume from pause
node engine.js stop                                 # Graceful stop + checkpoint
node engine.js status                               # Show current state
node engine.js reset                                # Clear all saved progress

# ── INFO ──
node engine.js states                               # List all states + cities
node engine.js cats                                 # List all 65 categories
node engine.js                                      # Help

Flags

FlagDefaultDescription
--staterequiredState abbreviation (AZ, NV, OH, ID, WA) or ALL
--max1000Maximum businesses per category across all states
--chunk250Rotate to next category after N new discoveries
--categoriesall 65Comma-separated list of categories to run
--freshfalseIgnore saved progress and start from scratch

Environment Variables

# Google Sheets — REQUIRED
GOOGLE_SPREADSHEET_ID=your_spreadsheet_id_here
GOOGLE_CREDENTIALS_PATH=google-credentials.json

# Timing
DELAY_MS=3000                    # Base delay between requests (ms)
MAX_PAGES_PER_SITE=15            # Max subpages to crawl per business website

# Discovery limits
MAX_PER_CATEGORY=1000            # Max businesses per category (65 categories)
CHUNK_SIZE=250                   # Rotate to next category after N discoveries

# Proxy (optional — recommended for large runs)
# PROXY_URL=http://user:pass@host:port

Crash Recovery

v6.1 The engine survives laptop sleep, terminal disconnects, and crashes.

State is saved to .discovery-state.json. To start fresh, use --fresh or run node engine.js reset.

Proxy Setup

For large discovery runs, a proxy is recommended to avoid IP-based rate limiting.

# In .env
PROXY_URL=http://username:password@proxy-host:port

The proxy is passed to Puppeteer via --proxy-server flag. All HTTP requests through the browser use the proxy. Axios requests (WHOIS, DuckDuckGo) also route through the proxy when configured.


Built by John Williams — Senior Paid Media Specialist at Seer Interactive.
Part of the Google Ads Agent ecosystem. Open source on GitHub.