Skip to content

uiuc-kang-lab/scrape_cves

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrape_cves

Scrape CVE records from the NVD API and flatten the relevant CVSS v3.1 metrics into a CSV, along with the most-starred GitHub repository referenced by each CVE.

Setup

Requires Python 3.11+. Dependencies are managed with uv:

uv sync

Create a .env file with API keys:

NVD_API_KEY=your_nvd_api_key
GITHUB_API_KEY=your_github_personal_access_token
  • NVD_API_KEY — request one at https://nvd.nist.gov/developers/request-an-api-key. Note: the key is currently read from the environment but not sent with NVD requests (the key-based rate limit did not work in testing), so the scraper runs against the unauthenticated rate limit.
  • GITHUB_API_KEY — a GitHub personal access token (public repo scope is sufficient) used to query star counts

Usage

1. Scrape raw CVE JSON from NVD

scrape_cves.py queries the NVD API one day at a time for the given date range and severity, and writes one JSON file per day to --output_dir. Existing files are skipped, so the scraper is resumable.

uv run scrape_cves.py \
    --start_date 2025-01-01 \
    --end_date 2025-06-30 \
    --severity CRITICAL \
    --output_dir data

Arguments:

  • --start_date (required) — YYYY-MM-DD
  • --end_dateYYYY-MM-DD; defaults to today
  • --severityCRITICAL, HIGH, MEDIUM, or LOW (default CRITICAL)
  • --output_dir — defaults to the current directory

Output files are named cves-{YYYY}-{MM}-{DD}-{SEVERITY}.json.

2. Parse CVE JSON into a CSV

parse_cves.py walks the daily JSON files in --input_dir, extracts CVSS v3.1 metrics, and for each CVE resolves the GitHub reference with the highest star count. Results are appended to --output_file; CVEs already present in the file are skipped.

uv run parse_cves.py \
    --start_date 2025-01-01 \
    --end_date 2025-06-30 \
    --severity CRITICAL \
    --input_dir data \
    --output_file cves_github.csv

CSV columns:

cveid, publishedDate, baseScore, exploitabilityScore, attackVector, attackComplexity, privilegesRequired, userInteraction, scope, impactScore, confidentialityImpact, integrityImpact, availabilityImpact, github_url, n_stars, references

CVEs without cvssMetricV31 are skipped. n_stars is N/A when no GitHub reference is found or the GitHub API call fails.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages