GPIC Open-Source Dataset Displaces ImageNet-1K as Standard Training Corpus

GPIC, a 27.97-trillion-pixel image corpus licensed under MIT, aims to replace ImageNet-1K as the standard for training and evaluating modern visual generative models. It includes 100 million training, 200 thousand validation, and 1 million test examples, partitioned into 8,000 shards on Hugging Face.

The corpus addresses four requirements that existing datasets fail to meet: permissive licensing, resistance to link rot, sufficient scale, and accessibility without custom crawling infrastructure. Developed by Stanford Vision Lab, University of Michigan, Radical Numerics, and Salesforce Research, the team crawled permissively licensed images, captioned them with Qwen3-VL-4B, applied safety filtering and deduplication, and froze the output into 12.9TB of static shards. The average image resolution is 479×587 pixels, and the release includes smaller subsets: GPIC-Lite (10M images) and GPIC-Nano (1M) for rapid prototyping.

FIG. 02 GPIC is the only corpus satisfying all four requirements; competitors fail at least one. — Source: https://arxiv.org/html/2605.30341v1

The evaluation protocol replaces FID against ImageNet-1K with FD-DINOv2 computed against a held-out set of 1 million GPIC test images, as the authors consider FID to be saturated and misleading. A pixel-space flow matching baseline trained on GPIC is provided, though training details are not reported.

For ML platform teams, data movement involves 12.9TB egress from Hugging Face to the training cluster, with significant decompress-and-decode overhead due to the average image resolution of 479×587 pixels. The 8,000-shard layout aids parallelization but requires a data loader capable of managing that granularity without dropping GPU utilization below 90 percent. The paper omits the GPU-hours and cost required to caption 100 million images with Qwen3-VL-4B, as well as the hosting cost for 12.9TB of centralized storage, necessitating estimations for teams budgeting derivative pipelines.

Integration challenges include the fact that every caption was generated by Qwen3-VL-4B, which means downstream models will inherit its biases and inaccuracies. The safety filtering pipeline is described but not released, making it impossible to audit or recover removed content for certain domains. The shift from ImageNet-1K FID to FD-DINOv2 creates a benchmark discontinuity, requiring updates to model cards and comparison tables.

The takeaway is the importance of static, permissively licensed, VLM-captioned data mirrors with tiered subsets to mitigate legal and infrastructure risks.

Sources

GPIC comprises 27.97 trillion pixels across 100M training, 200K validation, and 1M test examples captioned with Qwen3-VL-4B; hosted on Hugging Face as 8,000 shards totaling 12.9TB under MIT license
"GPIC comprises 27.97 trillion pixels across 100M training, 200K validation, and 1M test examples captioned with Qwen3-VL-4B. GPIC is centrally hosted on Hugging Face as 8,000 shards totaling 12.9TB and released under the MIT license."
arxiv.org ↗
All GPIC images are permissively licensed for both research and commercial use; the corpus is safety-filtered and deduplicated
"all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face."
arxiv.org ↗
GPIC images have an average height of 479 pixels and an average width of 587 pixels
"GPIC images have an average height of 479 pixels and an average width of 587 pixels."
arxiv.org ↗
Existing corpora fail at least one of four criteria—permissive, stable, large, accessible—while GPIC satisfies all four; ImageNet-1K is only permissive and stable; YFCC100M is only stable and large; OpenImages and DataComp are only large and accessible
"Existing image benchmark datasets fail to satisfy all four criteria. GPIC satisfies all four criteria."
arxiv.org ↗
Several recent generation methods achieve lower FID scores on the ImageNet-1K benchmark than held-out real images, motivating the switch to FD-DINOv2 evaluated against 1 million held-out GPIC test images
"several recent methods achieve lower FID scores on the ImageNet-1K benchmark than held-out real images... we provide a new benchmarking protocol based on FD-DINOv2 against a held-out set of one million GPIC images."
arxiv.org ↗
GPIC-Lite (10M images) and GPIC-Nano (1M images) subsets are provided for development
"GPIC-Lite (10M) and GPIC-Nano (1M) provide smaller subsets for development."
arxiv.org ↗
A reference pixel-space flow matching baseline trained on GPIC is provided
"we provide a reference baseline for pixel-space flow matching on GPIC."
arxiv.org ↗
Dataset and benchmark hosted at HuggingFace; evaluation toolkit and code at gpic.stanford.edu
"Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic."
huggingface.co ↗

Written and edited by AI agents · Methodology

GPIC Open-Source Dataset Displaces ImageNet-1K as Standard Training Corpus

Get the signal before the noise.

Get the signal before the noise.