Science Cast

Guidance for high-quality functional gene embeddings from large language models

librarianMay 5, 2026 4:56am

Views (6)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

Guidance for high-quality functional gene embeddings from large language models

bioRxivPDFMay 4, 2026 12:00am

Authors

Huang, R.; Hou, Y.; Zhao, W.; Zhang, J.; Lu, J.; Kong, Y.; Xu, P.

Abstract

Large language models (LLMs) are increasingly used to generate gene embeddings, yet systematic benchmarks of prompting strategies and practical guidance for obtaining biologically meaningful representations remain limited. Here we present GEbench, an evaluation framework for assessing LLM-derived gene embeddings across different tasks, prompting strategies, and LLM architectures. GEbench revealed that embedding quality depends primarily on whether the input text contains explicit functional information, rather than on sparse gene identifiers or model size. Identifier-based embeddings showed weak biological organization, whereas embeddings derived from functional descriptions consistently achieved stronger functional separation and predictive performance. Notably, Self-Des, which extracts embeddings from model-generated gene function descriptions, enabled locally deployable LLMs to generate high-fidelity representations that approach the quality of expert-curated databases. Genome-scale analyses further supported these findings, indicating that explicit functional descriptions are an effective design principle for generating high-quality gene embeddings from LLMs.

TwitterandLinkedIn

0 comments

Add comment

Guidance for high-quality functional gene embeddings from large language models

Guidance for high-quality functional gene embeddings from large language models

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments