bc

Async Web Scraping at Scale: Curating NeurIPS Papers

NeurIPS publishes thousands of accepted papers across decades of conferences. Scraping them one at a time with requests.get() would take hours. The fix is straightforward: replace blocking HTTP calls with async I/O, and run one coroutine per year concurrently.

This is the core of a scraper I built that collects metadata, abstracts, and PDFs for every NeurIPS accepted paper from 1987 through the most recent conference. The code is at github.com/bhuvan454/NeurIPS-Papers-Crawler.

The bottleneck

Standard requests.get() is synchronous. Each HTTP call blocks until the server responds. For a few pages this is fine; for 30+ years of conference listings, each with hundreds of papers, it is the entire runtime.

aiohttp solves this by wrapping HTTP as coroutines. Combined with asyncio, you can fire off requests for all years simultaneously and process responses as they arrive.

Fetching pages asynchronously

import aiohttp
 
async def fetch(session: aiohttp.ClientSession, url: str) -> str | None:
    try:
        async with session.get(url, headers=default_headers) as response:
            response.raise_for_status()
            return await response.text()
    except aiohttp.ClientError as e:
        print(f"Error fetching {url}: {e}")
        return None

The key difference from requests.get(): session.get() returns a context manager that yields control while waiting for the server. Other coroutines run in the meantime.

Extracting paper paths

Once a conference page loads, BeautifulSoup parses the HTML to extract every paper's abstract URL, then derives the metadata and PDF paths from it.

from bs4 import BeautifulSoup
 
async def get_paper_paths(
    session: aiohttp.ClientSession,
    year: int,
) -> tuple[list, list, list, list]:
    url = get_conference_url(year)
    page_content = await fetch(session, url)
    if page_content is None:
        return [], [], [], []
 
    soup = BeautifulSoup(page_content, "html.parser")
    paper_ids, abstract_paths, metadata_paths, pdf_paths = [], [], [], []
 
    for li in soup.find("div", class_="container-fluid").find_all("li"):
        paper_temp_url = li.a.get("href")
        paper_id = paper_temp_url.split("/")[-1].split("-")[0]
 
        paper_ids.append(paper_id)
        abstract_paths.append(f"{base_url}{paper_temp_url}")
 
        paper_base_url = f"{base_url}{paper_temp_url.rsplit('.', 1)[0]}"
        metadata_paths.append(
            f"{paper_base_url.replace('Abstract', 'Metadata').replace('hash', 'file')}.json"
        )
        pdf_paths.append(
            f"{paper_base_url.replace('Abstract', 'Paper').replace('hash', 'file')}.pdf"
        )
 
    return paper_ids, abstract_paths, metadata_paths, pdf_paths

The URL pattern for NeurIPS is consistent enough that string replacement produces valid metadata and PDF URLs from the abstract URL.

Running it concurrently

With get_paper_paths as a coroutine, collecting all years is a single asyncio.gather call:

import asyncio
 
async def scrape_all(start_year: int, end_year: int) -> dict:
    async with aiohttp.ClientSession() as session:
        tasks = [
            get_paper_paths(session, year)
            for year in range(start_year, end_year + 1)
        ]
        results = await asyncio.gather(*tasks)
    return dict(zip(range(start_year, end_year + 1), results))

All year pages are fetched in parallel. The total runtime is roughly the slowest single request, not the sum of all requests.

Output structure

Each paper gets its own directory under its conference year:

data/
└── 2021/
    └── {paper_id}/
        ├── {paper_id}_abstract.json
        ├── {paper_id}_metadata.json
        └── {paper_id}.pdf

This layout makes it easy to scan a specific year, check which papers are already downloaded, or feed the directory into a downstream pipeline (a RAG system, for instance).

CLI

python crawler.py \
  --start_year 2020 \
  --end_year 2023 \
  --output_dir ./data/ \
  --type all

--type accepts abstract, metadata, pdf, or all. Useful when you only want text and not the full PDFs.

What makes it fast

The speed gain is not parallelism in the CPU sense — asyncio is single-threaded. The gain comes from I/O concurrency: while one request is waiting on the network, other coroutines are running. For a workload that is almost entirely network-bound (HTTP requests, file downloads), this is equivalent in practice to running hundreds of threads with none of the overhead.

For a workload with CPU-bound processing between requests, you would want ProcessPoolExecutor or true parallelism instead.