Data Pipeline + Verification Disclosure

How we verify 100,000+ veteran resources

Most nonprofit directories say "verified" without saying what that means. We publish the actual pipeline: data sources, verification cadence, scoring formula, geocoding methodology, and quality-assurance workflows. Deterministic, reproducible, and open under CC-BY 4.0.

Sections

  1. Data sources
  2. Verification cadence
  3. Completeness-score formula
  4. Geocoding pipeline
  5. Quality assurance
  6. Correction workflows
  7. Open-data license
  8. Hash integrity + reproducibility

1. Data sources

Records are aggregated from authoritative federal, state, and nonprofit sources. We never accept payment for inclusion. Original-source attribution is preserved in each record's data_source field.

SourceCoverage
VA Lighthouse Facilities APIVA Medical Centers, CBOCs, Vet Centers, Mobile Vet Centers, National Cemeteries (~1,500 records)
VA Office of General Counsel Accreditation RosterVA-accredited attorneys, claim agents, and VSO representatives (~50,000 records)
State CVSO directories (50 states + DC)County Veterans Service Officers — free claims-help offices
HRSA Federally Qualified Health Centers (FQHCs)~14,000 federally-qualified clinics that accept veterans
CMS Medicare-certified providersNursing homes, home-health, hospice, dialysis
SAMHSA-tagged behavioral health providersMental health + substance use providers serving veterans
HUD-VASH granteesSupportive Housing for veterans (joint HUD-VA program)
SSVF granteesSupportive Services for Veteran Families
VSO chapter directoriesAmerican Legion, VFW, DAV, AMVETS, PVA posts
State agency veteran-services directoriesState-level benefits offices, veteran homes, etc.
U.S. Census ACS 5-year (2022)Tract-level veteran demographics, housing burden, income
CDC PLACESTract-level health prevalences (used for context, not as resources)

2. Verification cadence

Records are continuously verified by automated pipelines plus manual editorial review. Stale data is flagged + auto-demoted; broken URLs trigger Wayback Machine snapshots; phone numbers are quarterly-validated against Twilio Lookup v2.

PipelineCadenceWhat it does
Self-healing daemonEvery 6 hoursEvaluates health_score, auto-demotes flagged-closed rows, runs HEAD probes on URLs, runs Census Geocoder on next 100 unprocessed rows
Census GeocoderEvery 6 hoursAssigns FIPS state/county/tract to records missing them. ~100 rows/run, ~400/day
Wayback Machine snapshot backfillEvery 6 hoursCaptures snapshots of broken URLs to archive.org for posterity
Haiku enrichment passWeekly (Tuesday 08:00 UTC)Claude Haiku 4.5 enriches 100 stale rows with structured eligibility tags, specialty programs, walk-in/appointment status, language, women-focused, va_accredited, and 1-line summary
ProPublica Form 990 enrichmentWeekly (Tuesday)EIN-matched records get latest 990 revenue, expenses, year-of-filing
VA Lighthouse reconcilerWeekly (Sunday 06:00 UTC)Stages differences between our records and VA's authoritative facility data to pending_updates queue for staff review
URL health scanMonthly (1st 04:00 UTC)HEAD probe + Google Safe Browsing sweep over all 20K+ URLs
Twilio Lookup v2 phone auditQuarterly (Jan/Apr/Jul/Oct 1)Line-type intelligence — flags disconnected, fax-only, or premium-rate numbers (~$90/run)
IndexNow re-push (Bing/Yandex/Seznam/Naver)Weekly post-enrichmentPushes the 200 most-recently-updated URLs for instant search-engine re-crawl

3. Completeness-score formula

The completeness_score (0-100, integer) is a deterministic function of which fields are populated + verification status + ProPublica match. No manual curation, no payment for placement. Recomputed nightly. Full formula:

completeness_score = base + bonuses base (max 60): +10 if name + address populated +10 if phone populated +10 if website populated AND website_status = 'ok' +10 if hours populated +10 if services populated (>= 1 tag) +5 if eligibility populated +5 if specialty_programs populated bonuses (max 40): +5 if FIPS state/county/tract assigned (Census Geocoder) +5 if verified_date within last 90 days +5 if EIN populated AND ProPublica 990 match +5 if phone_verified = true (Twilio Lookup v2) +5 if veteran_relevance != 'general' (focused on vets) +5 if Wayback snapshot exists for the URL +10 if 0 healing-events triggered in last 30 days (no auto-demotions)

Score is exposed in every /resource/{slug}-{id} page header (gold pill on /best-of/* pages) and in the /api/resources/bulk JSON. Reproducible — run our SQL against your own snapshot of /api/export/snapshot to get identical scores.

4. Geocoding pipeline

Every record with a postal address is geocoded to FIPS-tract level via the US Census Geocoder (Single-Address One-Line API). Tract-level granularity (median ~4,000 residents per tract) lets us join records to ACS demographics + CDC PLACES health data + ACS housing burden — surfacing service deserts for grant proposals.

Pipeline:

  1. Address normalized (USPS-compliant case + abbreviations)
  2. POST to Census Geocoder Locations API benchmark Public_AR_Census2020
  3. If match, capture lat/lng + FIPS state (2-digit) + FIPS county (3-digit) + FIPS tract (6-digit) + FIPS block group (1-digit) + match-type confidence
  4. If no match, retry with address fuzz (drop apt#, normalize street suffixes); if still no match, mark geocode_source = 'failed' for manual review
  5. Store + index FIPS columns for fast tract-level joins

Current FIPS-tract assignment: ~98%+ of records (live count on /dashboard). Failed-geocode rate: ~1.5%, mostly PO Box addresses or military APO/FPO.

5. Quality assurance

6. Correction workflows

Anyone — including the listed organization, a veteran, a journalist, or a researcher — can submit a correction or addition. Workflow:

  1. Submit via POST /api/feedback or email info@warriorsfund.org
  2. Review queue at resource_feedback table
  3. Staff review within 5 business days
  4. Verified corrections committed; original-state archived in resource_history for full audit trail
  5. If the change came from the listed organization, reply with confirmation + commit timestamp

7. Open-data license

All Warriors Fund datasets are licensed under Creative Commons Attribution 4.0 International (CC-BY 4.0). Free to share, adapt, and build upon for any purpose (commercial included), provided you credit "Wounded Warriors / Warriors Fund (EIN 86-1336741)" and link back to https://warriorsfund.org/.

Bulk data: CSV · JSON · SHA-256 integrity-stamped snapshot · Project Open Data v1.1 / DCAT 1.1 catalog

8. Hash integrity + reproducibility

Every full-database snapshot ships with a SHA-256 hash so you can verify integrity. The snapshot is reproducible: given the same verified_date as-of-time + the same source-data state, our scoring formula returns identical scores. We publish the snapshot generator + scoring SQL for full reproducibility.

How to verify:

  1. Fetch /api/export/snapshot → response includes {snapshot_hash, generated_at, ...}
  2. Compute SHA-256 of the canonicalized JSON body (sorted keys, no whitespace)
  3. Compare against snapshot_hash
  4. If match, you have a tamper-evident copy of the directory as-of generated_at

About this disclosure

This methodology page is the canonical reference for our data pipeline. It's updated whenever the pipeline changes — see Git history at github.com/EmperorMew/WWF for the audit trail. Foundation officers + academic reviewers + journalists doing source-evaluation: this page should answer your "how do you verify?" question definitively. If anything is unclear, email info@warriorsfund.org.

Live data dashboard → Cite this methodology API documentation