Data Sources

Every signal in the RankParse API is derived from authoritative, publicly available datasets — not scraped on demand, not estimated, not sold to us by a third party.

We aggregate, cross-reference, and process these sources into a unified index. The pipeline runs on each data release cycle. The result is a consistent, queryable snapshot of the web's link graph and domain intelligence — at a fraction of what proprietary data vendors charge.

The backbone of RankParse. A non-profit web crawl covering billions of pages, updated quarterly. We process the full link graph — every backlink, anchor text, outbound link, and page metadata signal in the API originates here.

A Google-sponsored project that tracks how the web is built. Runs Wappalyzer across millions of URLs monthly. We use it to enrich technology stack detection beyond what raw HTTP headers reveal.

Chrome UX Report (CrUX)

developer.chrome.com

Real-user performance data collected by Google Chrome across millions of origins. Used to cross-reference domain popularity and performance signals.

Tranco Top 1M

tranco-list.eu

A research-grade domain popularity ranking that combines data from Cloudflare Radar, Majestic, Cisco Umbrella, and Farsight. More robust than any single list. Used for domain rank signals.

Internet Archive (CDX API)

web.archive.org

25+ years of web crawl history. We use the CDX index to surface first-seen dates and historical crawl frequency for domains and URLs.

Rapid7 Project Sonar

opendata.rapid7.com

Weekly forward and reverse DNS snapshots covering hundreds of millions of hostnames. Used to enrich hosting provider, nameserver, and subdomain data.

Certificate Transparency Logs

certificate.transparency.dev

Every SSL certificate ever issued is logged publicly. We use CT logs to surface subdomain history, certificate issuers, and organization data associated with domains.

RDAP / WHOIS Registries

rdap.org

The authoritative source for domain registration data — registration date, registrar, expiry, and privacy status. We use RDAP (the structured, modern replacement for WHOIS) to compute domain age.

Freshness

The core link graph is updated quarterly, in line with Common Crawl release cadence. Enrichment signals (domain age, popularity rank, tech stack) are refreshed on the same cycle. This covers the vast majority of use cases — link building, lead enrichment, AI agent workflows, and competitive analysis. If you need daily link monitoring, Ahrefs is the right tool. Most developers don't.