TFBS Benchmark Dataset
A benchmark dataset for evaluating variant effect prediction models on transcription factor binding site (TFBS) disruption tasks.
Dataset Description
This dataset contains 90,758 variants in transcription factor binding sites with comprehensive annotations for benchmarking variant effect prediction models, particularly the LOL-EVE model.
Dataset Summary
- Total variants: 90,758
- Species: Primarily human (homo_sapiens)
- Genes: Multiple genes with TFBS variants
- Transcription factors: Various TFs including CAMTA1, CAMTA2, CLOCK, E2F6, EBF3, ETV6, HAP1, HIC2, HIF1A, HSF1, KLF15, KLF7, MAX, MNT, and others
- Sequence context: 500bp promoter sequences centered on variants
Dataset Structure
Features
| Feature | Type | Description |
|---|---|---|
variant_id |
string | Unique identifier for each variant |
chromosome |
string | Chromosome (e.g., chr1) |
position |
int64 | Genomic position |
ref |
string | Reference allele |
alt |
string | Alternative allele |
gene |
string | Gene symbol |
species |
string | Species name (e.g., homo_sapiens) |
tf |
string | Transcription factor name |
wt_seq |
string | Wild-type sequence (500bp) |
var_seq |
string | Variant sequence (500bp) |
mammalian_constraint |
float32 | Mammalian constraint score |
expression_variability |
float32 | Expression variability score |
distance_tss |
int64 | Distance to transcription start site |
Data Splits
The dataset contains a single split with all 90,758 variants.
Usage
Loading the Dataset
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("cshearer/LOL-EVE-TFBS-Benchmark")
# Access the data
train_data = dataset["train"]
print(f"Dataset size: {len(train_data)}")
print(f"Features: {list(train_data.features.keys())}")
# Example usage
example = train_data[0]
print(f"Variant ID: {example['variant_id']}")
print(f"Gene: {example['gene']}")
print(f"TF: {example['tf']}")
print(f"Position: {example['chromosome']}:{example['position']}")
Example Data
# Example variant
{
'variant_id': 'chr1_1231946_CCGCCAACG_C',
'chromosome': 'chr1',
'position': 1231946,
'ref': 'CCGCCAACG',
'alt': 'C',
'gene': 'b3galt6',
'species': 'homo_sapiens',
'tf': 'CAMTA1',
'wt_seq': 'GAGCCAGACATCAAGGGCTCCACACAGCCGACTTCACATCTCCAAATCCTACTAACTGGGGATGAGGGTCCACGCGGTTCAGAAGCGGAAGCGCAGGCGCAGGGAAGCGGGGCAGCTTGTCCAAGGTCGCCTCGCCGATAAACGCGAGTCCAACCAGACCCCTTGGGCCTCCGTTTCCCGGTGGCATTCGTAGGTTTTGGCCAGTAGGAGACCAGACGTGCCGGCGGCCGGGGAGGCCAGCGTCGTCGGCCTGTCCCTGCCCCCGGGAACCCCGGGAGCCCCGGTGGCGGCGGAGTCTCGCCAGGGCTCAAGGCCGAGCGGACGGACGATGCCCCAGCCCAAGGCGGGAGGCGGCGGCGGCCTCCAGACCCGCCCTCGCCGTCCGGCCGGCGTACACTTGGCCCCGCGGCCTGCAGCGGCCGTCCCGGGCCCCTCACTCACCGGTCTGCCTCCCCGCGCTCGGGATCCGAGGACCGGAGCGAAGCGTCAGTGACGCCGCCAACGGGCCCGGATCAGGCCACTGCCATCTTTCTTGCGGGCGGGGGCGGTGCGAACGGGCGCGACCTCACGGAGGGGACGCCGGCGCCACCATCTCTCCTCCGGGCGGAAGCGGTCGCGGGGCCGCTCCGAGGTTGACCAATGACAAGGGTGCCCGAGGCCACGTGACGGCCGCCGATTGGCCGCCGGCCTCCGAGCGCCCCGGGGCTCGGCGTCTGCGGAAGGCCCCGGCGCGCTCCCAGGAGCGCCGTGCGCACGCGCACCGCCCCGAGCCGGCGGCGCCTGCGCA',
'var_seq': 'GAGCCAGACATCAAGGGCTCCACACAGCCGACTTCACATCTCCAAATCCTACTAACTGGGGATGAGGGTCCACGCGGTTCAGAAGCGGAAGCGCAGGCGCAGGGAAGCGGGGCAGCTTGTCCAAGGTCGCCTCGCCGATAAACGCGAGTCCAACCAGACCCCTTGGGCCTCCGTTTCCCGGTGGCATTCGTAGGTTTTGGCCAGTAGGAGACCAGACGTGCCGGCGGCCGGGGAGGCCAGCGTCGTCGGCCTGTCCCTGCCCCCGGGAACCCCGGGAGCCCCGGTGGCGGCGGAGTCTCGCCAGGGCTCAAGGCCGAGCGGACGGACGATGCCCCAGCCCAAGGCGGGAGGCGGCGGCGGCCTCCAGACCCGCCCTCGCCGTCCGGCCGGCGTACACTTGGCCCCGCGGCCTGCAGCGGCCGTCCCGGGCCCCTCACTCACCGGTCTGCCTCCCCGCGCTCGGGATCCGAGGACCGGAGCGAAGCGTCAGTGACGCGGCCCGGATCAGGCCACTGCCATCTTTCTTGCGGGCGGGGGCGGTGCGAACGGGCGCGACCTCACGGAGGGGACGCCGGCGCCACCATCTCTCCTCCGGGCGGAAGCGGTCGCGGGGCCGCTCCGAGGTTGACCAATGACAAGGGTGCCCGAGGCCACGTGACGGCCGCCGATTGGCCGCCGGCCTCCGAGCGCCCCGGGGCTCGGCGTCTGCGGAAGGCCCCGGCGCGCTCCCAGGAGCGCCGTGCGCACGCGCACCGCCCCGAGCCGGCGGCGCCTGCGCACCTGCGCA',
'mammalian_constraint': 0.9789,
'expression_variability': 0.3259307772539296,
'distance_tss': 708
}
Applications
This dataset is designed for:
- Variant Effect Prediction: Benchmarking models that predict the functional impact of genetic variants
- Transcription Factor Binding: Evaluating models that predict TF binding site disruption
- Regulatory Genomics: Studying the impact of variants on gene regulation
- Model Comparison: Comparing different variant effect prediction approaches
Data Collection
Source Data
The dataset is derived from variants in transcription factor binding sites with the following characteristics:
- Genomic regions: Promoter regions (500bp sequences)
- Variant types: Substitutions, insertions, and deletions
- Annotation sources: Mammalian constraint scores, expression variability, and TSS distance
Preprocessing
- Variants were filtered to include only those in TFBS regions
- Sequences were extracted as 500bp windows centered on variants
- Functional annotations were added from external databases
- Data was standardized and validated for consistency
Citation
If you use this dataset in your research, please cite:
@dataset{tfbs_benchmark_2024,
title={TFBS Benchmark Dataset for Variant Effect Prediction},
author={Marks Lab},
year={2024},
url={https://huggingface.co/datasets/cshearer/LOL-EVE-TFBS-Benchmark},
license={MIT}
}
License
This dataset is released under the MIT License. See the LICENSE file for details.
Contact
For questions or issues related to this dataset, please contact the Marks Lab or open an issue on the dataset repository.
Related Datasets
Dataset Statistics
- Total variants: 90,758
- Unique genes: Multiple
- Unique TFs: 15+ transcription factors
- Sequence length: 500bp
- Species: Primarily human (homo_sapiens)
- Variant types: Substitutions, insertions, deletions
- Genomic coverage: Multiple chromosomes