Deciphering the links between metabolism and health by building small-scale knowledge graphs: application to endometriosis and persistent pollutants
Deciphering the links between metabolism and health by building small-scale knowledge graphs: application to endometriosis and persistent pollutants
Mathe, M.; Laisney, G.; Filangi, O.; Giacomoni, F.; Delmas, M.; Cano-Sancho, G.; Jourdan, F.; Frainay, C.
AbstractKnowledge graphs (KGs) are a robust formalism for structuring biomedical knowledge, but large-scale KGs often require complex queries, are difficult for non-experts to explore, and lack real-world context (such as experimental data, clinical conditions, patients symptoms). This limits their usability for addressing specific research questions. We present Kg4j, a computational framework built on FORVM (a large-scale KG containing 82 million compound-biological concept associations), that constructs local, keyword-based sub-graphs tailored to address biomedical research questions. Resulting graphs support hypothetical relationships and can integrate experimental datasets, enabling the discovery of plausible but yet unknown connections. Starting from a conceptual definition of a research field of interest (e.g., disease, symptoms, exposure), the framework extracts relevant associations from FORVM and identifies potential biological mechanisms and chemical compounds. We applied this approach to endometriosis, exploring links between exposure to Persistent Organic Pollutants (POPs) and disease risk. We propose a novel validation strategy comparing the resulting sub-graph (2,706 nodes and 23,243 edges, 0.002% of FORVM) with recent scientific literature, showing consistency with known findings while also revealing new hypothetical associations requiring further investigation.We also showed that removing duplicated nodes and edges from the KG improves the proportion of validated nodes (from 8.4% to 16%), doubles the precision (from 0.085 to 0.197) while maintaining the recall (0.954 to 0.952), illustrating a trade-off between the loss of potentially relevant but redundant information and the reliability of remaining associations. By combining automated knowledge mining with experimental data integration, this framework supports reproducible, context-based exploration of biomedical knowledge and systematic hypothesis generation. Applied to endometriosis, it highlights potential mechanisms linking exposure to POPs to the aetiology of the disease, offering a scalable strategy for constructing disease-specific KGs.