Accelerating AI for Drug Discovery: Ginkgo’s GDPx Functional Genomics and GDPa Antibody Developability Dataset Series

Community Article Published June 24, 2025

By John Androsavich, General Manager, Ginkgo Datapoints
And Georgia Channing, ML for Science Engineer, Hugging Face

We’re excited to share that a new suite of high-quality, large-scale biological datasets from Ginkgo Datapoints is now available on Hugging Face: the complete GDPx functional genomics and GDPa antibody developability series.

This release contains everything you need to explore molecular interactions in the cell—between genes, proteins, antibodies, and more—unlocking critical applications in biological research and drug discovery. From transcriptomic response prediction to antibody property inference, these datasets support use cases like perturbation-response modeling, mechanism of action (MoA) characterization, and make it possible to build models of perturbation responses and antibody developability.

image/png


TL;DR: Download the latest datasets here!

Functional Genomics:

Antibody Developability:


The Challenge: Data Bottlenecks in AI-based Drug Discovery

Despite the rapid progress of machine learning for drug discovery, a major bottleneck remains: access to standardized, large-scale, and well-annotated biological data. Functional genomics and protein engineering tasks often require datasets that are not only big, but also consistent, diverse, and paired with rich metadata.

Until recently, most such datasets were either proprietary or fragmented across labs and formats.

The GDPx and GDPa datasets aim to close that gap. Created through high-throughput experimental platforms at Ginkgo, they bring together:

  • Thousands of chemical perturbation conditions across diverse human cell types
  • Dose–response and time-course gene expression & imaging data
  • Biophysical developability profiles for hundreds of IgG antibodies, with matched sequence data

Specifically, the release includes:

  • GDPx1–2: Transcriptomic responses to small molecules
  • GDPx3: Morphological phenotypes captured via high-content imaging
  • GDPa1: Developability metrics for therapeutic antibodies

All datasets are now fully accessible on the Hugging Face Hub with built-in dataset loaders—ready to be plugged into your model training workflows or used for exploratory analysis.

Whether you're building predictive models for drug mechanisms, generating new antibody sequences, or integrating RNA, morphology, and sequence data into unified representations, these datasets are for you!


Quick Primer on Genomics and Antibody Developability

If you're an ML researcher new to biological data, here’s a brief primer to help you navigate the GDPx and GDPa datasets:

Functional Genomics

Functional genomics explores how genes and their products (like RNA or proteins) behave under different conditions, such as drug treatments or genetic modifications.

In GDPx, functional genomics data is captured through RNA sequencing (DRUG-seq) to observe how small molecules affect pathways and gene expression.

DRUG-seq

A high-throughput, low-cost RNA sequencing method optimized for screening compounds. Outputs are UMI count matrices, ideal for gene expression modeling, perturbation clustering, MoA prediction, and contrastive learning tasks.

Cell Painting

A high-content imaging assay that stains cells with multiple fluorescent dyes targeting distinct organelles.

  • GDPx3 includes raw 16-bit TIFF images
  • Suitable for vision-based learning and cross-modal analysis
Perturbation

Any treatment or modification that changes a cell’s internal state—e.g., drug, gene edit, stressor. In GDPx, perturbations are small molecule compounds tested across dose, cell type, and time.

Perturbation-Response Modeling

Predicting how a cell responds to an intervention based on gene expression or morphology. Critical for drug discovery and systems biology.

Mechanism of Action (MoA) Characterization

Inferring how a compound works within a cell, including pathways targeted and downstream effects.

Developability (Antibodies)

Refers to an antibody’s manufacturability, stability, and clinical viability. GDPa1 includes 10 assays for 246 antibodies: aggregation, hydrophobicity, thermostability, polyreactivity, etc.

Multi-omics

Combining transcriptomic, proteomic, and imaging data for comprehensive biological modeling. GDPx aligns DRUG-seq and Cell Painting data across conditions.

LOPAC1280

A reference library of 1,280 bioactive compounds used in GDPx1 and GDPx2.


So, how was this data collected?

Perturbation response profiling

image/png

Ginkgo has developed the Response Analytics from Perturbations for Intelligent Discovery (RAPID) platform to enable profiling of cellular responses to chemical, biological, and genetic perturbations. By leveraging advanced automation and analytics, Ginkgo Datapoints delivers comprehensive insights into compound effects and pathway modulation—providing the data customers need to make informed decisions quickly and confidently. The platform automates the entire process: from cell culturing to perturbations to data analysis, and offers the following readouts:

  • DRUG-seq: generate large transcriptomic datasets, enabling multiple use cases from target ID to tox profiling. DRUG-seq focuses on sequencing 3' ends to quantify mRNA abundance efficiently.
  • High-content imaging: get high-throughput insights into cellular morphology. Using fluorescent dyes, we visualize cellular components to create rich morphological profiles of cellular responses.
  • Combined approach: By integrating transcriptomic and morphological data, the RAPID platform provides unparalleled insights into cellular responses to perturbations. This multi-modal approach enables deeper understanding of mechanism of action and more accurate predictions of compound effects.

Antibody developability

image/png

To generate the large-scale, structured datasets needed for training machine learning models on antibody properties, Ginkgo has developed a high-throughput experimental platform called PROPHET-Ab.

This platform automates the process of producing and characterizing therapeutic antibodies—essentially turning the wet-lab workflow into a data-generation pipeline designed with ML in mind.

The process starts with transient expression of antibody candidates in mammalian cell systems (HEK or CHO cells). The platform is compatible with a wide range of antibody formats, including full-length IgGs, single-domain antibodies (VHHs), and more complex multispecifics.

Each antibody is then evaluated through a suite of standardized biophysical and functional assays that measure:

  • Production quality (e.g. yield, purity, aggregation)
  • Biophysical traits (e.g. hydrophobicity, self-interaction, thermostability)
  • Pharmacokinetics (e.g. clearance potential via Fc receptor binding)
  • Functionality (e.g. how well the antibody binds to its target)

The results from these assays are automatically tracked, quality-controlled, and formatted into structured, tabular data with both raw measurements and curated features.


What's in the GDPx and GDPa Data?

neuro1 fibrosis1

GDPx1: DRUG-seq + Chemical Perturbation in A549 Cells

  • Context: 1,264 compounds from LOPAC1280 tested at 2 concentrations in A549 lung carcinoma cells
  • Data:
    • Metadata
    • UMI count table
  • Uses:
    • Predicting perturbation–response
    • Transcriptomic representation learning
    • Drug response benchmarking

GDPx2: DRUG-seq + Chemical Perturbation in 4 Primary Cell Types

  • Context: 85 compounds, 6 doses, 4 cell types (e.g., myoblasts, melanocytes)
  • Data:
    • Metadata
    • UMI count table
    • Dose-response and pathway tables (see preprint)
  • Uses:
    • Cell-type-specific drug modeling
    • Dose-dependent gene program learning
    • Transfer learning

GDPx3: Cell Painting + Chemical Perturbation in 3 Primary Cell Types

  • Context: 40 compounds, 4 doses, 2 timepoints, 4 cell types (e.g., fibroblasts, endothelial cells)
  • Data:
    • Metadata
    • .tiff images
  • Uses:
    • Modeling morphological responses
    • Cross-modal embedding comparison
    • Multi-modal learning

GDPa1: Antibody Developability Dataset (246 IgGs, 10 Assays)

  • Context: Biophysical metrics from 10 developability assays
  • Data:
    • Sequences
    • Processed assay data
    • Raw tidy assay data
    • Literature comparison tables (see preprint)
  • Uses:
    • Sequence-to-property prediction
    • Pretrained model evaluation
    • Thermostability and developability benchmarks

Get Started

By releasing the full GDPx and GDPa dataset series on Hugging Face, Ginkgo Datapoints is supporting open research in drug discovery.

Learn more: https://datapoints.ginkgo.bio/

Explore the datasets now:

Functional Genomics:

Antibody Developability:


Load with Hugging Face Datasets:

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("ginkgo-datapoints/GDPx1")

We can't wait to see what you build!

Community

Sign up or log in to comment