Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Authors

Peteani, G.; Sgueglia, G.; Lemmin, T.; Chino, M.

Abstract

Motivation Protein language models (pLMs) capture evolutionary sequence constraints but are limited in modeling underrepresented functional classes due to training data imbalance. Metalloproteins constitute a fundamental but sparsely represented class in sequence databases. We therefore assess whether structure-conditioned synthetic sequences can be used to specialize pLMs toward metal-binding functionality. Results We fine-tuned the generalist model ProtGPT2 on synthetic sequences generated by the inverse-folding model ProteinMPNN, constructing training sets with controlled variation in size and diversity. Fine-tuning increased recovery of canonical metal-binding motifs from 43% in the baseline model to 91% in the fine-tuned models. Generated sequences retained high predicted structural confidence and structural similarity to known folds, despite low sequence identity. Analysis of latent representations from ProtGPT2 indicated that fine-tuned models occupy distinct regions of embedding space relative to both the baseline model and structure-conditioned sequences, consistent with partial incorporation of structural constraints while preserving sequence diversity. A multi-step filtering pipeline applied to sequences lacking canonical motifs identified candidate metal-binding sites in four-helical bundle topologies not detected in a non-redundant subset of Protein Data Bank structures or in AlphaFold-predicted proteomes. Availability and implementation Code, trained models, and datasets are available at: https://doi.org/10.5281/zenodo.18672158 and https://huggingface.co/gsgueglia.

Follow Us on

0 comments

Add comment