About
SchemaPile is a collection of database schemas, extracted from DDL/DML statements in SQL files on public code repositories. We publish code to reproduce SchemaPile and the ready-to-use permissively licensed subset SchemaPile-Perm.
Summary statistics
Dataset | SchemaPile | SchemaPile-Perm |
---|---|---|
#Schemas | 221,171 | 22,989 |
#Tables | 1.7M | 199K |
#Schemas with data | 75.6K | 7.1K |
#Tables with data | 347.0K | 34.9K |
#Columns with data | 2.2M | 219.0K |
#Total data values | 58.2M | 5.9M |
Median #tables per schema | 4 | 4 |
Mean #columns per schema | 6.5 | 6.7 |
Mean #values per column | 27.6 | 28.7 |
Usage Example
-- Download from Zenodo
curl -O https://zenodo.org/records/12682521/files/schemapile-perm.json.gz
import gzip
import json
from collections import Counter
# Read file
with gzip.open("schemapile-perm.json.gz", 'r') as f:
schemapile = json.loads(f.read())
# Look at example schema
print(schemapile['015036_schema.sql'])
# {'INFO': {'URL': 'https://github.com/nages103/k8s-petclinic/blob/bb75e895591...
# 'LICENSE': 'APACHE-2.0',
# 'PERMISSIVE': True},
# 'TABLES': {'vets': {'COLUMNS': {'id': {'TYPE': 'UnsignedInt',
# 'NULLABLE': False,
# 'UNIQUE': True,
# 'DEFAULT': None,
# 'CHECKS': [],
# ...
# Get the 5 most common column names
column_names = []
for schema in schemapile:
for table in schemapile[schema]["TABLES"]:
for column_name in schemapile[schema]["TABLES"][table]["COLUMNS"]:
column_names.append(column_name.lower())
column_names_schemapile = Counter(column_names)
print(column_names_schemapile.most_common(5))
# [('id', 84487),
# ('name', 28426),
# ('description', 14324),
# ('created_at', 12624),
# ('user_id', 10718)]
Links
💾 Dataset: SchemaPile (Zenodo)
🤗 Foreign Key Detection Models:
📄 Code: Github Repo
📝 Paper: PDF
Citation
@article{do2024schemapile,
title={SchemaPile: A Large Collection of Relational Database Schemas},
author={D{\"o}hmen, Till and Geacu, Radu and Hulsebos, Madelon and Schelter, Sebastian},
journal={Proceedings of the ACM on Management of Data},
volume={2},
number={3},
year={2024},
publisher={ACM New York, NY, USA}
}
Contact
Please consider reporting cases of personal or otherwise undesired tables in SchemaPile to the email below. Feedback, suggestions and results from projects with SchemaPile are also very welcome!
t . r . doehmen @ uva.nl .