Page Inspect
Internal Links
24
External Links
18
Images
7
Headings
55
Page Content
Title:Common Crawl - Open Repository of Web Crawl Data
Description:We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.
HTML Size:25 KB
Markdown Size:3 KB
Fetched At:November 18, 2025
Page Structure
h1Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
h2Common Crawl is a 501(c)(3) non–profit founded in 2007.We make wholesale extraction, transformation and analysis of open web data accessible to researchers.
h2Over 300 billion pages spanning 15 years.
h2Free and open corpus since 2007.
h2Cited in over 10,000 research papers.
h23–5 billion new pages added each month.
h3Featured Papers:
h5Lukas Kriesch, Sebastian Losacker
h3A geolocated dataset of German news articles
h5Mostafa Ansar, Anna Sperotto, Ralph Holz
h3Web Crawl Refusals: Insights From Common Crawl
h5Jeffrey Knockel, Jakub Dalek, Noura Aljizawi, Mohamed Ahmed, Levi Meletti, and Justin Lau
h3Banned Books: Analysis of Censorship on Amazon.com
h5Xian Gong, Paul X. McCarthy, Marian-Andrei Rizoiu, Paolo Boldi
h3Harmony in the Australian Domain Space
h5Kevin Saric, Felix Savins, Gowri Sankar Ramachandran, Raja Jurdak, Surya Nepal
h3Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains
h5Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo
h3DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
h5Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas
h3esCorpius: A Massive Spanish Crawling Corpus
h5Marius Løvold Jørgensen, UiT Norges Arktiske Universitet
h3BacklinkDB: A Purpose-Built Backlink Database Management System
h3Latest Blog Post:
h2Common Crawl Celebrates World Digital Preservation Day
h2The Data
h3Overview
h3Web Graphs
h3Latest Crawl
h3Crawl Stats
h3Graph Stats
h3Errata
h2Resources
h3Get Started
h3AI Agent
h3Blog
h3Examples
h3Use Cases
h3CCBot
h3Infra Status
h3Opt-out Registry
h3FAQ
h2Community
h3Research Papers
h3Mailing List Archive
h3Hugging Face
h3Discord
h3Collaborators
h2About
h3Team
Markdown Content
Common Crawl - Open Repository of Web Crawl Data - The Data OverviewWeb GraphsLatest CrawlCrawl StatsGraph StatsErrata - Resources Get StartedAI AgentBlogExamplesUse CasesCCBotInfra StatusOpt-out RegistryFAQ - Community Research PapersMailing List ArchiveHugging FaceDiscordCollaborators - About TeamJobsMissionImpactPrivacy PolicyTerms of Use - Search AI Agent - Contact Us # Common Crawl maintains a free, open repository of web crawl data that can be used by anyone. ## Common Crawl is a 501(c)(3) non–profit founded in 2007. We make wholesale extraction, transformation and analysis of open web data accessible to researchers. Overview ## Over 300 billion pages spanning 15 years. ## Free and open corpus since 2007. ## Cited in over 10,000 research papers. ## 3–5 billion new pages added each month. ### Featured Papers: Geolocating and embedding 50M German news articles for semantic analysis ##### Lukas Kriesch, Sebastian Losacker ### A geolocated dataset of German news articles A study on web crawlers facing inconsistent and poorly-signalled blocking ##### Mostafa Ansar, Anna Sperotto, Ralph Holz ### Web Crawl Refusals: Insights From Common Crawl Research on Free Expression Online ##### Jeffrey Knockel, Jakub Dalek, Noura Aljizawi, Mohamed Ahmed, Levi Meletti, and Justin Lau ### Banned Books: Analysis of Censorship on Amazon.com Analyzing the Australian Web with Web Graphs: Harmonic Centrality at the Domain Level ##### Xian Gong, Paul X. McCarthy, Marian-Andrei Rizoiu, Paolo Boldi ### Harmony in the Australian Domain Space The Dangers of Hijacked Hyperlinks ##### Kevin Saric, Felix Savins, Gowri Sankar Ramachandran, Raja Jurdak, Surya Nepal ### Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains Enhancing Computational Analysis ##### Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo ### DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Computation and Language ##### Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas ### esCorpius: A Massive Spanish Crawling Corpus The Web as a Graph (Master's Thesis) ##### Marius Løvold Jørgensen, UiT Norges Arktiske Universitet ### BacklinkDB: A Purpose-Built Backlink Database Management System More on Google ScholarCurated BibTeX Dataset ### Latest Blog Post: News ## Common Crawl Celebrates World Digital Preservation Day Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve? Common Crawl Foundation Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone. ## The Data ### Overview ### Web Graphs ### Latest Crawl ### Crawl Stats ### Graph Stats ### Errata ## Resources ### Get Started ### AI Agent ### Blog ### Examples ### Use Cases ### CCBot ### Infra Status ### Opt-out Registry ### FAQ ## Community ### Research Papers ### Mailing List Archive ### Hugging Face ### Discord ### Collaborators ## About ### Team ### Jobs ### Mission ### Impact ### Privacy Policy ### Terms of Use © 2025 Common Crawl