About

SchemaPile is a collection of database schemas, extracted from DDL/DML statements in SQL files on public code repositories. We publish code to reproduce SchemaPile and the ready-to-use permissively licensed subset SchemaPile-Perm.

Summary statistics

Dataset SchemaPile SchemaPile-Perm
#Schemas 221,171 22,989
#Tables 1.7M 199K
#Schemas with data 75.6K 7.1K
#Tables with data 347.0K 34.9K
#Columns with data 2.2M 219.0K
#Total data values 58.2M 5.9M
Median #tables per schema 4 4
Mean #columns per schema 6.5 6.7
Mean #values per column 27.6 28.7

Usage Example

-- Download from Zenodo
curl -O https://zenodo.org/records/12682521/files/schemapile-perm.json.gz
import gzip
import json
from collections import Counter

# Read file
with gzip.open("schemapile-perm.json.gz", 'r') as f:
    schemapile = json.loads(f.read())

# Look at example schema
print(schemapile['015036_schema.sql'])

# {'INFO': {'URL': 'https://github.com/nages103/k8s-petclinic/blob/bb75e895591...
# 'LICENSE': 'APACHE-2.0',
# 'PERMISSIVE': True},
# 'TABLES': {'vets': {'COLUMNS': {'id': {'TYPE': 'UnsignedInt',
#                                       'NULLABLE': False,
#                                       'UNIQUE': True,
#                                       'DEFAULT': None,
#                                       'CHECKS': [],
#  ...

# Get the 5 most common column names
column_names = []
for schema in schemapile:
    for table in schemapile[schema]["TABLES"]:
        for column_name in schemapile[schema]["TABLES"][table]["COLUMNS"]:
            column_names.append(column_name.lower())
column_names_schemapile = Counter(column_names)

print(column_names_schemapile.most_common(5))

# [('id', 84487),
# ('name', 28426),
# ('description', 14324),
# ('created_at', 12624),
# ('user_id', 10718)]

💾 Dataset: SchemaPile (Zenodo)

🤗 Foreign Key Detection Models:

📄 Code: Github Repo

📝 Paper: PDF

Citation

@article{do2024schemapile,
title={SchemaPile: A Large Collection of Relational Database Schemas},
author={D{\"o}hmen, Till and Geacu, Radu and Hulsebos, Madelon and Schelter, Sebastian},
journal={Proceedings of the ACM on Management of Data},
volume={2},
number={3},
year={2024},
publisher={ACM New York, NY, USA}
}

Contact

Please consider reporting cases of personal or otherwise undesired tables in SchemaPile to the email below. Feedback, suggestions and results from projects with SchemaPile are also very welcome!

t . r . doehmen @ uva.nl .