OPEN SOURCE · v6.1 · MIT LICENSE · NODE.JS + PUPPETEER

Discover businesses.
From public data.

Apollo-level business intelligence from 100% public data. Round-robin discovery across 5 states, 65 categories, and 4 sources — with inline enrichment that pushes data to your Google Sheet in real-time. 9 email extraction methods, WHOIS age scoring, and crash recovery built in. The pipeline agencies charge thousands to build — yours in one command.

⚡ View on GitHub Read the Docs
business-discovery-engine — v6.1

4
Discovery Sources
65
Categories
9
Email Methods
18
Column Output
54
Cities
1,763
Lines of Code

What It Does

Four-phase discovery pipeline

v6.1's key innovation: inline enrichment. Data flows into your Sheet continuously as each chunk completes — no waiting hours for all discovery to finish first. Discover → enrich → push → rotate to next category.

🔍

Discover

Round-robin across 5 states × 4 sources × 54 cities. Shuffled order every run. Global dedup on every insert — zero duplicates across all states and sources.

📧

Enrich Inline

Each chunk: find website → visit → extract contacts → 9 email methods → social profiles → WHOIS age → MX verify. Data flows immediately.

📊

Live Dashboard

Google Sheets dashboard updated every 30 seconds. Discovery counts, enrichment progress, error rates, per-source breakdowns, and per-state results.

🔄

Crash Recovery

Per-category checkpoints, signal handlers (SIGTERM/SIGINT/SIGHUP), global dedup rebuild on resume. Just re-run the same command — picks up where it left off.


Phase 1

4 discovery sources

Each source uses Puppeteer with stealth plugin for JavaScript rendering. Smart retry with exponential backoff, automatic proxy rotation, and human-like scrolling behavior.

Yellow Pages

Puppeteer + stealth plugin. Extracts name, phone, address, website. Pagination support with configurable limits.

Yelp

Smart auto-skip after 5 consecutive blocks. Longer delays for anti-bot evasion. Full business profile extraction.

BBB

React SPA rendered via headless Chrome. Accredited business data with ratings and complaint history markers.

Google Maps

New in v5.0. Aria-label extraction from search results. /maps/search/ URL pattern for reliable results.


Phase 4 — Enrichment

9 email extraction methods

The engine crawls every discovered business website and applies 9 extraction methods in sequence. Found emails are verified via DNS MX record lookup. Pattern inference generates likely emails for staff found without them.

1. JSON-LD Schema

Parses structured data for contact info

2. Mailto Links

Extracts from href="mailto:" patterns

3. Data Attributes

Scans data-email, data-contact attrs

4. Staff Cards

Parses team/about pages for contacts

5. Full-Page Regex

Comprehensive email pattern matching

6. Obfuscated Decode

Decodes [at], (at), etc. patterns

7. Footer Parsing

Targeted extraction from page footers

8. Meta Tags

Checks meta tags for contact emails

9. Tel Link Fallback

Falls back to phone when no email found

Plus: email pattern inference from known contacts (first.last@, flast@, firstl@, etc.) — generates likely emails for staff found without them. All verified via DNS MX records.


Coverage

65 business categories

Pre-configured categories spanning local services, retail, food & beverage, health, professional services, and home services. Run one category at a time or combine them.

plumber electrician dentist restaurant auto repair salon law firm accountant real estate agent roofing hvac cleaning service landscaping insurance agent veterinarian fitness photography marketing agency construction chiropractor boutique coffee shop brewery med spa financial advisor solar installer pest control tree service web design pressure washing

Showing 30 of 65. Full list includes retail, ecommerce, food & beverage, health & wellness, professional services, and home services categories. See all 65 →


Get Started

Running in 60 seconds

# 1. Clone the repo git clone https://github.com/itallstartedwithaidea/business-discovery-engine.git cd business-discovery-engine # 2. Install dependencies npm install # 3. Configure credentials cp .env.example .env # Edit .env — add Google Sheets API credentials # 4. Run discovery (all 5 states, round-robin) node engine.js start --state ALL --fresh # Or run in background (safe to close terminal) nohup node engine.js start --state ALL --fresh > full-run.log 2>&1 & tail -f full-run.log

5 pre-configured states (54 cities): AZ (15 cities), NV (10 cities), OH (12 cities), ID (10 cities), WA (12 cities). Use --state ALL for round-robin across all states, or --state AZ for a single state. --max 500 and --chunk 100 for custom limits.


Output

18-column enriched data

Every discovered business gets a complete profile pushed to Google Sheets in real-time batches of 10.

ColumnDescriptionSource
First NameContact first nameWebsite crawl
Last NameContact last nameWebsite crawl
EmailVerified email address9 extraction methods + MX verify
TitleContact job titleStaff card parsing
Company NameBusiness nameDiscovery sources
LocationCity, StateDiscovery sources
WebsiteBusiness URLDiscovery + DuckDuckGo fallback
PhoneNormalized phone numberDiscovery sources
FacebookFacebook profile URLWebsite crawl
InstagramInstagram profile URLWebsite crawl
LinkedInLinkedIn profile URLWebsite crawl
Twitter/XX profile URLWebsite crawl
SourceWhich discovery source found itYellow Pages / Yelp / BBB / Maps
ConfidenceData quality scoreComputed
Biz AgeBusiness age labelWHOIS/RDAP lookup
Year FoundedDomain registration yearWHOIS/RDAP lookup
IndustrySearched categoryInput parameter
DateDiscovery timestampAuto-generated

Open Source

Built in public, free forever

Same playbook as the Google Ads API Agent. The entire engine — 1,763 lines in a single file — is MIT licensed. v6.1 adds round-robin discovery, inline enrichment, crash recovery, and --state ALL. Extend sources, add categories, or plug in your own CRM.

Part of the Google Ads Agent ecosystem. Built by John Williams — Senior Paid Media Specialist at Seer Interactive with 15+ years managing $48M+ in digital advertising.

⚡ View on GitHub Read the Docs