Machine learning approaches for the identification and analysis of enterotoxin genes in Staphylococcus aureus genomes
Machine learning approaches for the identification and analysis of enterotoxin genes in Staphylococcus aureus genomes
Uttin, A.; Leggett, R.; Moulton, V.; Dicks, J.
AbstractStaphylococcus aureus produces a broad range of enterotoxins that act as superantigens, disrupting host immune responses and resulting in a myriad of clinical symptoms. However, large-scale analyses determining enterotoxin gene diversity, lineage structure and isolate metadata remain scarce. We analysed 15,887 S. aureus RefSeq genomes using a machine learning pipeline combining profile Hidden Markov Model-based enterotoxin gene identification, lineage typing, gene profile-based strain clustering and association rule mining using a broad range of gene and metadata features. This approach identified 35 distinct enterotoxin genes and five variant forms, including two putative novel enterotoxin genes, sel34 and sel35. HDBSCAN clustering distinguished 45 enterotoxin gene profile groups, revealing strong associations between the two major egc enterotoxin gene cluster variants (OMIWNG and OMIUNG) and Clonal Complex membership: CC5, CC22 and CC45 with OMIWNG; CC30 and CC121 with OMIUNG. Integration of isolate metadata exposed distinct geographic and temporal trends, including a recent rise in non-egc lineages derived from Asia and animal sources. These findings show that S. aureus enterotoxin diversity is structured by lineage, mobile genetic element composition and Clonal Complex association. The discovery of sel34 and sel35, together with the comprehensive overview of lineage-specific enterotoxin profiles, expands current understanding of S. aureus virulence evolution and provides a scalable analytical framework for monitoring toxin gene dynamics in clinical and environmental populations.