Feature : Add Pretraining Data

#4
by Tonic - opened
Data Quests for Open Science org

The SimpleFold Paper makes use of the following datasets :

This project makes use of several publicly available datasets for training 
models. We would like to acknowledge the following sources:

- The Protein Data Bank [1]
- The AlphaFold Database [2,3]
- The AFESM Metagenomic Atlas [4]
- The Atlas MD dataset [5]


[1] Protein Data Bank: the single global archive for 3D macromolecular structure 
data. Nucleic Acids Res 47: D520-D528 doi: https://doi.org/10.1093/nar/gky949
[2] Jumper, J et al. Highly accurate protein structure prediction with AlphaFold.
Nature (2021).
[3] Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the
structural coverage of protein-sequence space with high-accuracy models. Nucleic 
Acids Research (2021). 
[4] Yeo J, Han Y, Bordin N, Lau AM, Kandathil SM, Kim H, Karin EL, Mirdita M, 
Jones DT, Orengo C, Steinegger M. Metagenomic-scale analysis of the predicted
protein structure universe. bioRxiv, 2025.
[5] Yann Vander Meersche, Gabriel Cretin, Aria Gheeraert, Jean-Christophe Gelly, 
and Tatiana Galochkina. Atlas: protein flexibility description from atomistic molecular
dynamics simulations. Nucleic acids research, 52(D1):D384–D392,
2024.

it would be extremely convient to follow their filtering protocol and make the structures available

Sign up or log in to comment