Preprocessing of amino acid chains of antibody structure for machine learning analysis

Authors

  • Manuel Erazo Valadez Centro Nacional de Investigación y Desarrollo Tecnológico
  • María Yasmin Hernández Pérez Centro Nacional de Investigación y Desarrollo Tecnológico
  • Elizabeth Ernestina Godoy Lozano https://orcid.org/0000-0001-6927-9132
  • Javier Ortiz Hernández Centro Nacional de Investigación y Desarrollo Tecnológico
  • Juan Mauricio Téllez Sosa Instituto Nacional de Salud Pública
  • Juan José Flores Sedano Centro Nacional de Investigación y Desarrollo Tecnológico
  • Patricia Alejandra Cuevas Chavez Centro Nacional de Investigación y Desarrollo Tecnológico

DOI:

https://doi.org/10.61467/2007.1558.2026.v17i1.1030

Keywords:

Antibody classification, amino acid sequence representation

Abstract

Antibody classification represents a task of growing importance in bioinformatics. In recent years, the identification of antibodies capable of recognising and neutralising SARS-CoV-2 has become a central focus in immunological research and bioinformatics. Antibody representation presents several challenges, as antibody structure and function are highly variable, which complicates the development of a universal classification framework. Antibodies are composed of heavy and light chains that contain hypervariable complementarity-determining regions, which define their specificity. These structural variations create substantial challenges for sequence alignment, feature extraction, and classification. In this research, three methods for representing amino acid sequences were compared: TF–IDF, Atchley Factors, and ProtVec. These representations were evaluated using decision trees, logistic regression, and support vector machines. A separate dataset was generated for each representation. The results suggest that the representation based on Atchley Factors achieved comparatively stronger performance in the task of antibody classification.

 

Smart citations: https://scite.ai/reports/10.61467/2007.1558.2026.v17i1.1030

Dimensions.
Open Alex.

References

Abbas, A. K., Lichtman, A. H., & Pillai, S. (2021). Cellular and molecular immunology (10th ed.). Elsevier.

Asgari, E., & Mofrad, M. R. K. (2015). ProtVec: A continuous distributed representation of biological sequences. PLoS ONE, 10(11), e0141287. https://doi.org/10.1371/journal.pone.0141287

Atchley, W. R., Zhao, J., Fernandes, A. D., & Drüke, T. (2005). Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences, 102(18), 6395–6400. https://doi.org/10.1073/pnas.0408677102

Birunda, S. S., & Devi, R. K. (2021). A review on word embedding techniques for text classification. In J. S. Raj, A. M. Iliyasu, R. Bestak, & Z. A. Baig (Eds.), Innovative data communication technologies and application (pp. 267–281). Springer. https://doi.org/10.1007/978-981-15-9651-3_23

Chen, X., Dougherty, T., Hong, C., Schibler, R., Zhao, Y. C., Sadeghi, R., Matasci, N., Wu, Y.-C., & Kerman, I. (2020). Predicting antibody developability from sequence using machine learning. bioRxiv. https://doi.org/10.1101/2020.06.18.159798

Greiff, V., Yaari, G., & Cowell, L. G. (2020). Mining adaptive immune receptor repertoires for biological and clinical information using machine learning. Current Opinion in Systems Biology, 24, 109–119. https://doi.org/10.1016/j.coisb.2020.10.010

Ibero-American Cooperative Group on Transfusion Medicine. (2020). Basic and applied immunohematology. GCIAMT.

Jurafsky, D., & Martin, J. H. (2021). Speech and language processing (3rd ed.). Pearson.

Leem, J., Mitchell, L. S., Farmery, J. H. R., Barton, J., & Galson, J. D. (2022). Deciphering the language of antibodies using self-supervised learning. Patterns, 3(7), Article 100513. https://doi.org/10.1016/j.patter.2022.100513

Lefranc, M.-P., Giudicelli, V., Ginestoux, C., Jabado-Michaloud, J., Folch, G., Bellahcene, F., … & Lefranc, G. (2009). IMGT®, the international ImMunoGeneTics information system®. Nucleic Acids Research, 37(Database issue), D1006–D1012. https://doi.org/10.1093/nar/gkn838

Li, L., Gupta, E., Spaeth, J., Shing, L., Bepler, T., & Caceres, R. S. (2022). Antibody representation learning for drug discovery. arXiv. https://doi.org/10.48550/arXiv.2210.02881

Li, X., Van Deventer, J. A., & Hassoun, S. (2020). ASAP-SML: An antibody sequence analysis pipeline using statistical testing and machine learning. PLOS Computational Biology, 16(4), e1007779. https://doi.org/10.1371/journal.pcbi.1007779

Magar, R., Yadav, P., & Barati Farimani, A. (2021). Potential neutralizing antibodies discovered for novel coronavirus using machine learning. Scientific Reports, 11(1), Article 5261. https://doi.org/10.1038/s41598-021-84637-4

Murphy, K. M., Weaver, C., & Berg, L. J. (2022). Janeway’s immunobiology (10th ed.). W. W. Norton & Company.

Olsen, T. H., Boyles, F., & Deane, C. M. (2022). Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science, 31(1), 141–146. https://doi.org/10.1002/pro.4205

Parham, P. (2021). The immune system (5th ed.). W. W. Norton & Company.

Pulendran, B., & Davis, M. M. (2020). The science and medicine of human immunology. Science, 369(6511), eaay4014. https://doi.org/10.1126/science.aay4014

Punt, J., Stranford, S. A., Jones, P., & Owen, J. (2020). Kuby immunology (8th ed.). McGraw-Hill.

Raybould, M. I. J., Kovaltsuk, A., Marks, C., & Deane, C. M. (2021). CoV-AbDab: The coronavirus antibody database. Bioinformatics, 37(5), 734–735. https://doi.org/10.1093/bioinformatics/btaa739

Sapoval, N., Aghazadeh, A., Nute, M. G., Antunes, D. A., Balaji, A., Baraniuk, R., Barberan, C. J., Dannenfelser, R., Dun, C., Edrisi, M., Elworth, R. A. L., Kille, B., Kyrillidis, A., Nakhleh, L., Wolfe, C. R., Yan, Z., Yao, V., & Treangen, T. J. (2022). Current progress and open challenges for applying deep learning across the biosciences. Nature Communications, 13(1), Article 1728. https://doi.org/10.1038/s41467-022-29268-7

Yadav, D., Yadav, N., Kumar, A., Sharma, P., & Sood, D. (2022). Probing the immune system dynamics of the COVID-19 disease for vaccine designing and drug repurposing using bioinformatics tools. Immuno, 2(2), 172–191. https://doi.org/10.3390/immuno2020022

Zhang, Y., Chen, Q., Yang, Z., Lin, H., & Lu, Z. (2019). BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data, 6(1), Article 52. https://doi.org/10.1038/s41597-019-0055-0

Downloads

Published

2026-01-02

How to Cite

Erazo Valadez, M., Hernández Pérez, M. Y., Godoy Lozano, E. E., Ortiz Hernández, J., Téllez Sosa, J. M., Flores Sedano, J. J., & Cuevas Chavez, P. A. (2026). Preprocessing of amino acid chains of antibody structure for machine learning analysis. International Journal of Combinatorial Optimization Problems and Informatics, 17(1), 196–214. https://doi.org/10.61467/2007.1558.2026.v17i1.1030

Issue

Section

Articles

Most read articles by the same author(s)