ZZ

Dataset processing, tokenization, and model inference utilities for ML training pipelines.

Setup

pip install -r requirements.txt

Directory Structure

scripts/
  download/     # Dataset download scripts (FineWeb, Wikimedia, HF mirrors)
  extract/      # Data extraction, tokenization, and renaming
  analysis/     # Training duration and metric evaluation
  deepseek/     # LLM inference scripts (DeepSeek-V2-Lite)
logs/           # Training logs and outputs
datasets/       # Downloaded dataset storage

Usage

Download Datasets

# Download FineWeb dataset (simple)
python scripts/download/download_fineweb.py --limit 1000 --output output.txt

# Plan and download shards to hit a token budget
python scripts/download/plan_and_download_fineweb.py --tokens 10B --output /mnt/data/zz/datasets/fineweb

# Download ~100B tokens for GPT-3 ablation (resumable, progress.json)
python scripts/download/plan_and_download_fineweb_gpt3.py --output /mnt/data/zz/datasets/fineweb-edu

# Download and extract in one step (pyarrow iter_batches, memory-safe)
python scripts/download/download_and_extract.py

# Download via wget scripts (hf-mirror.com for China access)
bash scripts/download/wget_fineweb_mirror_1.sh
bash scripts/download/wget_fineweb_mirror_2_5.sh
bash scripts/download/wget_fineweb_mirror_11_20.sh

# Wikimedia dumps
bash scripts/download/wget_wikimedia_1.sh
bash scripts/download/wget_wikimedia_4.sh
bash scripts/download/wget_wikimedia_4_accum.sh
bash scripts/download/wget_wikimedia_5.sh

Extract and Tokenize

# Extract from parquet files
python scripts/extract/extract_parquet.py

# Extract FineWeb data
python scripts/extract/extract_fineweb.py

# Extract GPT-3 ablation shards -> single train_fineweb.txt
python scripts/extract/extract_fineweb_gpt3.py

# Tokenize GPT-3 ablation shards -> uint16 .npy shards (GPT-2 BPE)
python scripts/extract/tokenize_fineweb_gpt3.py

# Rename downloaded parquet files (strip ?download=true suffix)
python scripts/extract/rename_fineweb.py

# Extract Wikipedia dumps
python scripts/extract/extract_wiki.py
python scripts/extract/extract_wiki_corpus.py

Analysis

# Calculate training duration
python scripts/analysis/calculate_duration.py

# Evaluate training metrics
python scripts/analysis/evaluate.py --file logs/train_log_openweb.txt

Model Inference

# DeepSeek-V2-Lite-Chat (4-bit quantized, fits 12GB VRAM)
python scripts/deepseek/run_lite.py                       # interactive chat
python scripts/deepseek/run_lite.py -p "Your prompt"      # single prompt

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
datasets		datasets
finetune		finetune
logs		logs
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZZ

Setup

Directory Structure

Usage

Download Datasets

Extract and Tokenize

Analysis

Model Inference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ZZ

Setup

Directory Structure

Usage

Download Datasets

Extract and Tokenize

Analysis

Model Inference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages