TFBS Benchmark Dataset

A benchmark dataset for evaluating variant effect prediction models on transcription factor binding site (TFBS) disruption tasks.

Dataset Description

This dataset contains 90,758 variants in transcription factor binding sites with comprehensive annotations for benchmarking variant effect prediction models, particularly the LOL-EVE model.

Dataset Summary

  • Total variants: 90,758
  • Species: Primarily human (homo_sapiens)
  • Genes: Multiple genes with TFBS variants
  • Transcription factors: Various TFs including CAMTA1, CAMTA2, CLOCK, E2F6, EBF3, ETV6, HAP1, HIC2, HIF1A, HSF1, KLF15, KLF7, MAX, MNT, and others
  • Sequence context: 500bp promoter sequences centered on variants

Dataset Structure

Features

Feature Type Description
variant_id string Unique identifier for each variant
chromosome string Chromosome (e.g., chr1)
position int64 Genomic position
ref string Reference allele
alt string Alternative allele
gene string Gene symbol
species string Species name (e.g., homo_sapiens)
tf string Transcription factor name
wt_seq string Wild-type sequence (500bp)
var_seq string Variant sequence (500bp)
mammalian_constraint float32 Mammalian constraint score
expression_variability float32 Expression variability score
distance_tss int64 Distance to transcription start site

Data Splits

The dataset contains a single split with all 90,758 variants.

Usage

Loading the Dataset

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("cshearer/LOL-EVE-TFBS-Benchmark")

# Access the data
train_data = dataset["train"]
print(f"Dataset size: {len(train_data)}")
print(f"Features: {list(train_data.features.keys())}")

# Example usage
example = train_data[0]
print(f"Variant ID: {example['variant_id']}")
print(f"Gene: {example['gene']}")
print(f"TF: {example['tf']}")
print(f"Position: {example['chromosome']}:{example['position']}")

Example Data

# Example variant
{
    'variant_id': 'chr1_1231946_CCGCCAACG_C',
    'chromosome': 'chr1',
    'position': 1231946,
    'ref': 'CCGCCAACG',
    'alt': 'C',
    'gene': 'b3galt6',
    'species': 'homo_sapiens',
    'tf': 'CAMTA1',
    'wt_seq': 'GAGCCAGACATCAAGGGCTCCACACAGCCGACTTCACATCTCCAAATCCTACTAACTGGGGATGAGGGTCCACGCGGTTCAGAAGCGGAAGCGCAGGCGCAGGGAAGCGGGGCAGCTTGTCCAAGGTCGCCTCGCCGATAAACGCGAGTCCAACCAGACCCCTTGGGCCTCCGTTTCCCGGTGGCATTCGTAGGTTTTGGCCAGTAGGAGACCAGACGTGCCGGCGGCCGGGGAGGCCAGCGTCGTCGGCCTGTCCCTGCCCCCGGGAACCCCGGGAGCCCCGGTGGCGGCGGAGTCTCGCCAGGGCTCAAGGCCGAGCGGACGGACGATGCCCCAGCCCAAGGCGGGAGGCGGCGGCGGCCTCCAGACCCGCCCTCGCCGTCCGGCCGGCGTACACTTGGCCCCGCGGCCTGCAGCGGCCGTCCCGGGCCCCTCACTCACCGGTCTGCCTCCCCGCGCTCGGGATCCGAGGACCGGAGCGAAGCGTCAGTGACGCCGCCAACGGGCCCGGATCAGGCCACTGCCATCTTTCTTGCGGGCGGGGGCGGTGCGAACGGGCGCGACCTCACGGAGGGGACGCCGGCGCCACCATCTCTCCTCCGGGCGGAAGCGGTCGCGGGGCCGCTCCGAGGTTGACCAATGACAAGGGTGCCCGAGGCCACGTGACGGCCGCCGATTGGCCGCCGGCCTCCGAGCGCCCCGGGGCTCGGCGTCTGCGGAAGGCCCCGGCGCGCTCCCAGGAGCGCCGTGCGCACGCGCACCGCCCCGAGCCGGCGGCGCCTGCGCA',
    'var_seq': 'GAGCCAGACATCAAGGGCTCCACACAGCCGACTTCACATCTCCAAATCCTACTAACTGGGGATGAGGGTCCACGCGGTTCAGAAGCGGAAGCGCAGGCGCAGGGAAGCGGGGCAGCTTGTCCAAGGTCGCCTCGCCGATAAACGCGAGTCCAACCAGACCCCTTGGGCCTCCGTTTCCCGGTGGCATTCGTAGGTTTTGGCCAGTAGGAGACCAGACGTGCCGGCGGCCGGGGAGGCCAGCGTCGTCGGCCTGTCCCTGCCCCCGGGAACCCCGGGAGCCCCGGTGGCGGCGGAGTCTCGCCAGGGCTCAAGGCCGAGCGGACGGACGATGCCCCAGCCCAAGGCGGGAGGCGGCGGCGGCCTCCAGACCCGCCCTCGCCGTCCGGCCGGCGTACACTTGGCCCCGCGGCCTGCAGCGGCCGTCCCGGGCCCCTCACTCACCGGTCTGCCTCCCCGCGCTCGGGATCCGAGGACCGGAGCGAAGCGTCAGTGACGCGGCCCGGATCAGGCCACTGCCATCTTTCTTGCGGGCGGGGGCGGTGCGAACGGGCGCGACCTCACGGAGGGGACGCCGGCGCCACCATCTCTCCTCCGGGCGGAAGCGGTCGCGGGGCCGCTCCGAGGTTGACCAATGACAAGGGTGCCCGAGGCCACGTGACGGCCGCCGATTGGCCGCCGGCCTCCGAGCGCCCCGGGGCTCGGCGTCTGCGGAAGGCCCCGGCGCGCTCCCAGGAGCGCCGTGCGCACGCGCACCGCCCCGAGCCGGCGGCGCCTGCGCACCTGCGCA',
    'mammalian_constraint': 0.9789,
    'expression_variability': 0.3259307772539296,
    'distance_tss': 708
}

Applications

This dataset is designed for:

  1. Variant Effect Prediction: Benchmarking models that predict the functional impact of genetic variants
  2. Transcription Factor Binding: Evaluating models that predict TF binding site disruption
  3. Regulatory Genomics: Studying the impact of variants on gene regulation
  4. Model Comparison: Comparing different variant effect prediction approaches

Data Collection

Source Data

The dataset is derived from variants in transcription factor binding sites with the following characteristics:

  • Genomic regions: Promoter regions (500bp sequences)
  • Variant types: Substitutions, insertions, and deletions
  • Annotation sources: Mammalian constraint scores, expression variability, and TSS distance

Preprocessing

  • Variants were filtered to include only those in TFBS regions
  • Sequences were extracted as 500bp windows centered on variants
  • Functional annotations were added from external databases
  • Data was standardized and validated for consistency

Citation

If you use this dataset in your research, please cite:

@dataset{tfbs_benchmark_2024,
  title={TFBS Benchmark Dataset for Variant Effect Prediction},
  author={Marks Lab},
  year={2024},
  url={https://huggingface.co/datasets/cshearer/LOL-EVE-TFBS-Benchmark},
  license={MIT}
}

License

This dataset is released under the MIT License. See the LICENSE file for details.

Contact

For questions or issues related to this dataset, please contact the Marks Lab or open an issue on the dataset repository.

Related Datasets

Dataset Statistics

  • Total variants: 90,758
  • Unique genes: Multiple
  • Unique TFs: 15+ transcription factors
  • Sequence length: 500bp
  • Species: Primarily human (homo_sapiens)
  • Variant types: Substitutions, insertions, deletions
  • Genomic coverage: Multiple chromosomes
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support