Data Sources

Every signal in the RankParse API is derived from authoritative, independently operated public datasets. Not scraped on demand. Not estimated from samples. Not purchased from data brokers. We process primary sources directly.

The datasets below are cross-referenced and aggregated into a unified index on each release cycle. The result is a consistent, queryable snapshot of the web's link graph, domain intelligence, and technology landscape — covering hundreds of millions of domains. Each source has been chosen because it is either the authoritative primary record for its data type, or the most methodologically rigorous publicly available dataset in its category.

Common Crawl

Core dataset

commoncrawl.org ↗

A non-profit organization that has been crawling the public web since 2008. Each quarterly release covers billions of pages across hundreds of millions of domains — WAT files (link metadata), WARC files (full HTML), and CDX index files (URL inventory). It is the same raw material used by academic researchers, large language model trainers, and search engine benchmarks. The full quarterly corpus runs to hundreds of terabytes compressed. We process the complete release, not a sample.

HTTP Archive

Tech stack

httparchive.org ↗

A Google-sponsored project that has tracked how the web is built since 2010. It crawls roughly 10 million URLs monthly using real Chrome instances, recording every request, response header, and resource load. Critically, it runs Wappalyzer fingerprinting across each page — producing a structured technology detection dataset that would take years to replicate independently. Results are published as BigQuery public datasets and updated monthly. We use it to enrich technology stack detection with signal coverage far beyond what HTTP headers alone reveal.

Chrome UX Report (CrUX)

Performance

developer.chrome.com ↗

A dataset published by Google, drawn from real Chrome browser sessions with users who have opted into anonymous usage reporting. It covers Core Web Vitals — LCP, CLS, INP, FCP — at both origin and URL level, broken down by device type and connection speed. Updated monthly. Unlike synthetic benchmarks, CrUX represents actual user experience across the population of Chrome users. We use it to cross-reference domain quality signals and surface performance context alongside link metrics.

Tranco Top 1M

Domain rank

tranco-list.eu ↗

A research-grade domain popularity ranking developed by academics at KU Leuven and imec-DistriNet, designed specifically to address the known manipulability of single-source popularity lists. Tranco combines four independent data sources — Cloudflare Radar (DNS resolver traffic), Majestic (referring subnets), Cisco Umbrella (DNS query volume), and Farsight DNSDB — and applies a rank aggregation algorithm to produce a stable, manipulation-resistant list. Published weekly. It is the most methodologically rigorous public domain popularity dataset available and is widely used in academic security and measurement research.

Internet Archive (CDX API)

History

web.archive.org ↗

The Internet Archive has been crawling and archiving the public web since 1996 — over 800 billion pages saved. Its CDX (Capture Index) API exposes structured metadata about every crawl snapshot: URL, timestamp, HTTP status, MIME type, and content digest. We use it to surface first-seen dates for domains and URLs, and to cross-reference crawl frequency as a proxy for site activity. No other public dataset offers this depth of longitudinal web history.

Rapid7 Project Sonar

DNS

opendata.rapid7.com ↗

Rapid7 is a publicly traded cybersecurity company (NASDAQ: RPD). Project Sonar is their open data initiative, publishing weekly forward and reverse DNS snapshots covering hundreds of millions of hostnames globally — A, AAAA, MX, CNAME, TXT, and PTR records. It is among the most comprehensive publicly available DNS datasets, used extensively in academic internet measurement research. We use it to enrich hosting provider attribution, nameserver relationships, and subdomain structure.

Certificate Transparency Logs

Subdomains

certificate.transparency.dev ↗

A global infrastructure mandated by major browser vendors (Google, Apple, Mozilla) requiring every publicly-trusted SSL/TLS certificate to be logged in append-only public logs before browsers will trust it. This creates a comprehensive, tamper-evident record of every certificate ever issued — including the domain names and subdomains covered. Aggregated via crt.sh, which indexes logs from all major Certificate Authorities including DigiCert, Let's Encrypt, Sectigo, and GlobalSign. We use CT logs to surface subdomain history and organizational data associated with domains.

RDAP / WHOIS Registries

Domain age

rdap.org ↗

RDAP (Registration Data Access Protocol) is the IETF-standardized replacement for WHOIS, now operated by all major domain registries and registrars. Unlike WHOIS — which returns free-form text requiring brittle parsing — RDAP returns structured JSON directly from authoritative registry operators (Verisign for .com/.net, ARIN/RIPE/APNIC for IP space, etc.). We use RDAP to compute domain age, identify registrars, and surface registration and expiry dates. Data comes directly from the authoritative registry, not from a third-party aggregator.

Freshness and cadence

The core link graph is updated quarterly, aligned with Common Crawl release cadence. Enrichment signals — domain age, popularity rank, technology stack, DNS data — are refreshed on the same cycle. Quarterly freshness covers the overwhelming majority of use cases: link building, lead enrichment, competitive analysis, AI agent workflows, and programmatic SEO. If you need daily link change monitoring, that's a different product category — and Ahrefs is the right tool for it. Most developers building on an API don't.

What we don't do

We do not scrape websites on demand. We do not purchase data from third-party data brokers. We do not interpolate or estimate metrics where primary data is unavailable — if a domain has no backlinks in the dataset, the API returns an empty result, not a guess. Every value returned by the API traces back to one of the primary sources listed above.

See it in action

100 free requests. No credit card required.

Get API key