Async Web Scraping at Scale: Curating NeurIPS Papers
NeurIPS publishes thousands of accepted papers across decades of conferences.
Scraping them one at a time with requests.get() would take hours. The fix is
straightforward: replace blocking HTTP calls with async I/O, and run one
coroutine per year concurrently.
This is the core of a scraper I built that collects metadata, abstracts, and PDFs for every NeurIPS accepted paper from 1987 through the most recent conference. The code is at github.com/bhuvan454/NeurIPS-Papers-Crawler.
The bottleneck
Standard requests.get() is synchronous. Each HTTP call blocks until the
server responds. For a few pages this is fine; for 30+ years of conference
listings, each with hundreds of papers, it is the entire runtime.
aiohttp solves this by wrapping HTTP as coroutines. Combined with asyncio,
you can fire off requests for all years simultaneously and process responses as
they arrive.
Fetching pages asynchronously
import aiohttp
async def fetch(session: aiohttp.ClientSession, url: str) -> str | None:
try:
async with session.get(url, headers=default_headers) as response:
response.raise_for_status()
return await response.text()
except aiohttp.ClientError as e:
print(f"Error fetching {url}: {e}")
return NoneThe key difference from requests.get(): session.get() returns a context
manager that yields control while waiting for the server. Other coroutines run
in the meantime.
Extracting paper paths
Once a conference page loads, BeautifulSoup parses the HTML to extract every paper's abstract URL, then derives the metadata and PDF paths from it.
from bs4 import BeautifulSoup
async def get_paper_paths(
session: aiohttp.ClientSession,
year: int,
) -> tuple[list, list, list, list]:
url = get_conference_url(year)
page_content = await fetch(session, url)
if page_content is None:
return [], [], [], []
soup = BeautifulSoup(page_content, "html.parser")
paper_ids, abstract_paths, metadata_paths, pdf_paths = [], [], [], []
for li in soup.find("div", class_="container-fluid").find_all("li"):
paper_temp_url = li.a.get("href")
paper_id = paper_temp_url.split("/")[-1].split("-")[0]
paper_ids.append(paper_id)
abstract_paths.append(f"{base_url}{paper_temp_url}")
paper_base_url = f"{base_url}{paper_temp_url.rsplit('.', 1)[0]}"
metadata_paths.append(
f"{paper_base_url.replace('Abstract', 'Metadata').replace('hash', 'file')}.json"
)
pdf_paths.append(
f"{paper_base_url.replace('Abstract', 'Paper').replace('hash', 'file')}.pdf"
)
return paper_ids, abstract_paths, metadata_paths, pdf_pathsThe URL pattern for NeurIPS is consistent enough that string replacement produces valid metadata and PDF URLs from the abstract URL.
Running it concurrently
With get_paper_paths as a coroutine, collecting all years is a single
asyncio.gather call:
import asyncio
async def scrape_all(start_year: int, end_year: int) -> dict:
async with aiohttp.ClientSession() as session:
tasks = [
get_paper_paths(session, year)
for year in range(start_year, end_year + 1)
]
results = await asyncio.gather(*tasks)
return dict(zip(range(start_year, end_year + 1), results))All year pages are fetched in parallel. The total runtime is roughly the slowest single request, not the sum of all requests.
Output structure
Each paper gets its own directory under its conference year:
data/
└── 2021/
└── {paper_id}/
├── {paper_id}_abstract.json
├── {paper_id}_metadata.json
└── {paper_id}.pdf
This layout makes it easy to scan a specific year, check which papers are already downloaded, or feed the directory into a downstream pipeline (a RAG system, for instance).
CLI
python crawler.py \
--start_year 2020 \
--end_year 2023 \
--output_dir ./data/ \
--type all--type accepts abstract, metadata, pdf, or all. Useful when you only
want text and not the full PDFs.
What makes it fast
The speed gain is not parallelism in the CPU sense — asyncio is single-threaded. The gain comes from I/O concurrency: while one request is waiting on the network, other coroutines are running. For a workload that is almost entirely network-bound (HTTP requests, file downloads), this is equivalent in practice to running hundreds of threads with none of the overhead.
For a workload with CPU-bound processing between requests, you would want
ProcessPoolExecutor or true parallelism instead.