Page Inspect

https://commoncrawl.org/

Internal Links

External Links

Images

Headings

Page Content

Title:Common Crawl - Open Repository of Web Crawl Data

Description:We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

HTML Size:25 KB

Markdown Size:3 KB

Fetched At:November 18, 2025

Page Structure

h1Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

h2Common Crawl is a 501(c)(3) non–profit founded in 2007.‍We make wholesale extraction, transformation and analysis of open web data accessible to researchers.

h2Over 300 billion pages spanning 15 years.

h2Free and open corpus since 2007.

h2Cited in over 10,000 research papers.

h23–5 billion new pages added each month.

h3Featured Papers:

h5Lukas Kriesch, Sebastian Losacker

h3A geolocated dataset of German news articles

h5Mostafa Ansar, Anna Sperotto, Ralph Holz

h3Web Crawl Refusals: Insights From Common Crawl

h5Jeffrey Knockel, Jakub Dalek, Noura Aljizawi, Mohamed Ahmed, Levi Meletti, and Justin Lau

h3Banned Books: Analysis of Censorship on Amazon.com

h5Xian Gong, Paul X. McCarthy, Marian-Andrei Rizoiu, Paolo Boldi

h3Harmony in the Australian Domain Space

h5Kevin Saric, Felix Savins, Gowri Sankar Ramachandran, Raja Jurdak, Surya Nepal

h3Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains

h5Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo

h3DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

h5Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas

h3esCorpius: A Massive Spanish Crawling Corpus

h5Marius Løvold Jørgensen, UiT Norges Arktiske Universitet

h3BacklinkDB: A Purpose-Built Backlink Database Management System

h3Latest Blog Post:

h2Common Crawl Celebrates World Digital Preservation Day

h2The Data

h3Overview

h3Web Graphs

h3Latest Crawl

h3Crawl Stats

h3Graph Stats

h3Errata

h2Resources

h3Get Started

h3AI Agent

h3Blog

h3Examples

h3Use Cases

h3CCBot

h3Infra Status

h3Opt-out Registry

h3FAQ

h2Community

h3Research Papers

h3Mailing List Archive

h3Hugging Face

h3Discord

h3Collaborators

h2About

h3Team

Markdown Content

Common Crawl - Open Repository of Web Crawl Data

- The Data

OverviewWeb GraphsLatest CrawlCrawl StatsGraph StatsErrata
- Resources

Get StartedAI AgentBlogExamplesUse CasesCCBotInfra StatusOpt-out RegistryFAQ
- Community

Research PapersMailing List ArchiveHugging FaceDiscordCollaborators
- About

TeamJobsMissionImpactPrivacy PolicyTerms of Use
- Search

AI Agent
- Contact Us

# Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
## Common Crawl is a 501(c)(3) non–profit founded in 2007.
‍
We make wholesale extraction, transformation and analysis of open web data accessible to researchers.
Overview

## Over 300 billion pages spanning 15 years.
## Free and open corpus since 2007.
## Cited in over 10,000 research papers.
## 3–5 billion new pages added each month.

### Featured Papers:

Geolocating and embedding 50M German news articles for semantic analysis

##### Lukas Kriesch, Sebastian Losacker
### A geolocated dataset of German news articles

A study on web crawlers facing inconsistent and poorly-signalled blocking

##### Mostafa Ansar, Anna Sperotto, Ralph Holz
### Web Crawl Refusals: Insights From Common Crawl

Research on Free Expression Online

##### Jeffrey Knockel, Jakub Dalek, Noura Aljizawi, Mohamed Ahmed, Levi Meletti, and Justin Lau
### Banned Books: Analysis of Censorship on Amazon.com

Analyzing the Australian Web with Web Graphs: Harmonic Centrality at the Domain Level

##### Xian Gong, Paul X. McCarthy, Marian-Andrei Rizoiu, Paolo Boldi
### Harmony in the Australian Domain Space

The Dangers of Hijacked Hyperlinks

##### Kevin Saric, Felix Savins, Gowri Sankar Ramachandran, Raja Jurdak, Surya Nepal
### Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains

Enhancing Computational Analysis

##### Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo
### DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Computation and Language

##### Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas
### esCorpius: A Massive Spanish Crawling Corpus

The Web as a Graph (Master's Thesis)

##### Marius Løvold Jørgensen, UiT Norges Arktiske Universitet
### BacklinkDB: A Purpose-Built Backlink Database Management System

More on Google ScholarCurated BibTeX Dataset

### Latest Blog Post:

News

## Common Crawl Celebrates World Digital Preservation Day

Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

## The Data
### Overview
### Web Graphs
### Latest Crawl
### Crawl Stats
### Graph Stats
### Errata

## Resources
### Get Started
### AI Agent
### Blog
### Examples
### Use Cases
### CCBot
### Infra Status
### Opt-out Registry
### FAQ

## Community
### Research Papers
### Mailing List Archive
### Hugging Face
### Discord
### Collaborators

## About
### Team
### Jobs
### Mission
### Impact
### Privacy Policy
### Terms of Use

© 2025 Common Crawl