TFBS Benchmark Dataset

A benchmark dataset for evaluating variant effect prediction models on transcription factor binding site (TFBS) disruption tasks.

Dataset Description

This dataset contains 90,758 variants in transcription factor binding sites with comprehensive annotations for benchmarking variant effect prediction models, particularly the LOL-EVE model.

Dataset Summary

Total variants: 90,758
Species: Primarily human (homo_sapiens)
Genes: Multiple genes with TFBS variants
Transcription factors: Various TFs including CAMTA1, CAMTA2, CLOCK, E2F6, EBF3, ETV6, HAP1, HIC2, HIF1A, HSF1, KLF15, KLF7, MAX, MNT, and others
Sequence context: 500bp promoter sequences centered on variants

Dataset Structure

Features

Feature	Type	Description
`variant_id`	string	Unique identifier for each variant
`chromosome`	string	Chromosome (e.g., chr1)
`position`	int64	Genomic position
`ref`	string	Reference allele
`alt`	string	Alternative allele
`gene`	string	Gene symbol
`species`	string	Species name (e.g., homo_sapiens)
`tf`	string	Transcription factor name
`wt_seq`	string	Wild-type sequence (500bp)
`var_seq`	string	Variant sequence (500bp)
`mammalian_constraint`	float32	Mammalian constraint score
`expression_variability`	float32	Expression variability score
`distance_tss`	int64	Distance to transcription start site

Data Splits

The dataset contains a single split with all 90,758 variants.

Usage

Loading the Dataset

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("cshearer/LOL-EVE-TFBS-Benchmark")

# Access the data
train_data = dataset["train"]
print(f"Dataset size: {len(train_data)}")
print(f"Features: {list(train_data.features.keys())}")

# Example usage
example = train_data[0]
print(f"Variant ID: {example['variant_id']}")
print(f"Gene: {example['gene']}")
print(f"TF: {example['tf']}")
print(f"Position: {example['chromosome']}:{example['position']}")

Example Data

# Example variant
{
    'variant_id': 'chr1_1231946_CCGCCAACG_C',
    'chromosome': 'chr1',
    'position': 1231946,
    'ref': 'CCGCCAACG',
    'alt': 'C',
    'gene': 'b3galt6',
    'species': 'homo_sapiens',
    'tf': 'CAMTA1',
    'wt_seq': 'GAGCCAGACATCAAGGGCTCCACACAGCCGACTTCACATCTCCAAATCCTACTAACTGGGGATGAGGGTCCACGCGGTTCAGAAGCGGAAGCGCAGGCGCAGGGAAGCGGGGCAGCTTGTCCAAGGTCGCCTCGCCGATAAACGCGAGTCCAACCAGACCCCTTGGGCCTCCGTTTCCCGGTGGCATTCGTAGGTTTTGGCCAGTAGGAGACCAGACGTGCCGGCGGCCGGGGAGGCCAGCGTCGTCGGCCTGTCCCTGCCCCCGGGAACCCCGGGAGCCCCGGTGGCGGCGGAGTCTCGCCAGGGCTCAAGGCCGAGCGGACGGACGATGCCCCAGCCCAAGGCGGGAGGCGGCGGCGGCCTCCAGACCCGCCCTCGCCGTCCGGCCGGCGTACACTTGGCCCCGCGGCCTGCAGCGGCCGTCCCGGGCCCCTCACTCACCGGTCTGCCTCCCCGCGCTCGGGATCCGAGGACCGGAGCGAAGCGTCAGTGACGCCGCCAACGGGCCCGGATCAGGCCACTGCCATCTTTCTTGCGGGCGGGGGCGGTGCGAACGGGCGCGACCTCACGGAGGGGACGCCGGCGCCACCATCTCTCCTCCGGGCGGAAGCGGTCGCGGGGCCGCTCCGAGGTTGACCAATGACAAGGGTGCCCGAGGCCACGTGACGGCCGCCGATTGGCCGCCGGCCTCCGAGCGCCCCGGGGCTCGGCGTCTGCGGAAGGCCCCGGCGCGCTCCCAGGAGCGCCGTGCGCACGCGCACCGCCCCGAGCCGGCGGCGCCTGCGCA',
    'var_seq': 'GAGCCAGACATCAAGGGCTCCACACAGCCGACTTCACATCTCCAAATCCTACTAACTGGGGATGAGGGTCCACGCGGTTCAGAAGCGGAAGCGCAGGCGCAGGGAAGCGGGGCAGCTTGTCCAAGGTCGCCTCGCCGATAAACGCGAGTCCAACCAGACCCCTTGGGCCTCCGTTTCCCGGTGGCATTCGTAGGTTTTGGCCAGTAGGAGACCAGACGTGCCGGCGGCCGGGGAGGCCAGCGTCGTCGGCCTGTCCCTGCCCCCGGGAACCCCGGGAGCCCCGGTGGCGGCGGAGTCTCGCCAGGGCTCAAGGCCGAGCGGACGGACGATGCCCCAGCCCAAGGCGGGAGGCGGCGGCGGCCTCCAGACCCGCCCTCGCCGTCCGGCCGGCGTACACTTGGCCCCGCGGCCTGCAGCGGCCGTCCCGGGCCCCTCACTCACCGGTCTGCCTCCCCGCGCTCGGGATCCGAGGACCGGAGCGAAGCGTCAGTGACGCGGCCCGGATCAGGCCACTGCCATCTTTCTTGCGGGCGGGGGCGGTGCGAACGGGCGCGACCTCACGGAGGGGACGCCGGCGCCACCATCTCTCCTCCGGGCGGAAGCGGTCGCGGGGCCGCTCCGAGGTTGACCAATGACAAGGGTGCCCGAGGCCACGTGACGGCCGCCGATTGGCCGCCGGCCTCCGAGCGCCCCGGGGCTCGGCGTCTGCGGAAGGCCCCGGCGCGCTCCCAGGAGCGCCGTGCGCACGCGCACCGCCCCGAGCCGGCGGCGCCTGCGCACCTGCGCA',
    'mammalian_constraint': 0.9789,
    'expression_variability': 0.3259307772539296,
    'distance_tss': 708
}

Applications

This dataset is designed for:

Variant Effect Prediction: Benchmarking models that predict the functional impact of genetic variants
Transcription Factor Binding: Evaluating models that predict TF binding site disruption
Regulatory Genomics: Studying the impact of variants on gene regulation
Model Comparison: Comparing different variant effect prediction approaches

Data Collection

Source Data

The dataset is derived from variants in transcription factor binding sites with the following characteristics:

Genomic regions: Promoter regions (500bp sequences)
Variant types: Substitutions, insertions, and deletions
Annotation sources: Mammalian constraint scores, expression variability, and TSS distance

Preprocessing

Variants were filtered to include only those in TFBS regions
Sequences were extracted as 500bp windows centered on variants
Functional annotations were added from external databases
Data was standardized and validated for consistency

Citation

If you use this dataset in your research, please cite:

@dataset{tfbs_benchmark_2024,
  title={TFBS Benchmark Dataset for Variant Effect Prediction},
  author={Marks Lab},
  year={2024},
  url={https://huggingface.co/datasets/cshearer/LOL-EVE-TFBS-Benchmark},
  license={MIT}
}

License

This dataset is released under the MIT License. See the LICENSE file for details.

Contact

For questions or issues related to this dataset, please contact the Marks Lab or open an issue on the dataset repository.

Related Datasets

Dataset Statistics

Total variants: 90,758
Unique genes: Multiple
Unique TFs: 15+ transcription factors
Sequence length: 500bp
Species: Primarily human (homo_sapiens)
Variant types: Substitutions, insertions, deletions
Genomic coverage: Multiple chromosomes

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support