Categories
CoronaVirus News

Development of a fast feature extraction method for SARS-CoV-2 spike sequences using amino acid physicochemical properties


COVID-19 continues to spread today, leading to an accumulation of SARS-CoV-2 virus mutations in databases, and large amounts of genomic datasets are currently available. However, due to these large datasets, utilizing this amount of sequence data without random sampling is challenging. Major difficulties for downstream analyses include the increase in the dimension size along with the conversion of sequences into numerical values when using conventional amino acid representation methods, such as one-hot encoding and k-mer-based approaches that directly reflect sequences. Moreover, these sequences are deficient in physicochemical characteristics, such as structural information and hydrophilicity; hence, they fail to accurately represent the inherent function of the given sequences. In this study, we utilized the physicochemical properties of amino acids to develop a rapid and efficient approach for extracting feature parameters that are suitable for downstream processes of machine learning, such as clustering. A fixed-length feature vector representation of a spike sequence with reduced dimensionality was obtained by converting amino acid residues into physicochemical parameters. Next, t-distributed stochastic neighbor embedding (t-SNE), a method for dimensionality reduction and visualization of high-dimensional data, was performed, followed by density-based spatial clustering of applications with noise (DBSCAN). The results show that by using the physicochemical properties of amino acids rather than conventional methods that directly represent sequences into numerical values, SARS-CoV-2 spike sequences can be clustered with sufficient accuracy and a shorter runtime. Interestingly, the clusters obtained by using amino acid properties include subclusters that are distinct from those produced utilizing the method for the direct representation of amino acid sequences. A more detailed analysis indicated that the contributing parameters of this novel cluster identified exclusively when utilizing the physicochemical properties of amino acids significantly differ from one another. This suggests that representing amino acid sequences by physicochemical properties might enable the identification of clusters with enhanced sensitivity compared to conventional methods.