Fine tuning Transformers models for converting handwritten scientific texts into LaTeX format

Authors

  • Ricardo Alvarez Perez Laboratorio de Inteligencia Artificial, CIC IPN https://orcid.org/0009-0006-1457-770X
  • Ricardo Barrón Fernandez Laboratorio de Inteligencia Artificial, CIC IPN

DOI:

https://doi.org/10.61467/2007.1558.2026.v17i2.1266

Keywords:

OCR, Text detection, Text recognition, DETR, TrOCR, Transformers, LaTeX

Abstract

This study introduces a methodology for optical character recognition (OCR) that leverages transformer-based architectures to enhance the detection and recognition of textual content within images. The approach integrates state-of-the-art models, employing DETR (Detection Transformer) for the generation of bounding boxes corresponding to candidate text sequences, and TrOCR for transcribing the text contained within these regions. Both models were fine-tuned on a proprietary dataset comprising handwritten and digitized notes from mathematics-related subjects, including differential equations, calculus, linear algebra, programming, etc. The dataset predominantly consists of mathematical expressions represented in LaTeX format, thereby allowing the proposed method to effectively address the recognition of complex symbolic content in mathematical texts.

 

Smart citations: https://scite.ai/reports/10.61467/2007.1558.2026.v17i2.1266
Dimensions.
Open Alex.

References

Blecher, L., Cucurull, G., Scialom, T., & Stojnic, R. (2023). Nougat: Neural optical understanding for academic documents. arXiv. https://doi.org/10.48550/arXiv.2308.13418

Carion, N., Massa, F., Synnaeve, G., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. arXiv. https://arxiv.org/abs/2005.12872

Chaudhuri, A., Mandaviya, K., Badelia, P., & Ghosh, S. K. (2016). Optical character recognition systems for different languages with soft computing. In Studies in Fuzziness and Soft Computing. Springer. https://doi.org/10.1007/978-3-319-50252-6

Javed, M., Nagabhushan, P., & Chaudhuri, B. (2013). Extraction of projection profile, run-histogram and entropy features straight from run-length compressed text-documents. En Proceedings of the IEEE APSIPA Annual Summit and Conference. IEEE. https://doi.org/10.1109/ACPR.2013.147

Khanam, R., & Hussain, M. (2024). YOLOv11: An overview of the key architectural enhancements. arXiv. https://arxiv.org/abs/2410.17725

Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., & Park, S. (2022). OCR-free document understanding transformer. arXiv. https://arxiv.org/abs/2111.15664

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z., & Wei, F. (2021). TrOCR: Transformer-based optical character recognition with pre-trained models. arXiv. https://arxiv.org/abs/2109.10282

Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2022). A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems, 33(12), 6999–7019. https://doi.org/10.1109/TNNLS.2021.3136258

Mienye, I. D., Swart, T. G., & Obaido, G. (2024). Recurrent neural networks: A comprehensive review of architectures, variants, and applications. Information, 15(9), 517. https://doi.org/10.3390/info15090517

Mor, B., Garhwal, S., & Kumar, A. (2020). A systematic review of hidden Markov models and their applications. Archives of Computational Methods in Engineering, 28(3), 1429–1448. https://doi.org/10.1007/s11831-020-09422-4

Mutlag, W. K., Ali, S. K., Aydam, Z. M., & Taher, B. H. (2020). Feature extraction methods: A review. Journal of Physics: Conference Series, 1591(1), 012028. https://doi.org/10.1088/1742-6596/1591/1/012028

Raisi, Z., Naiel, M. A., Younes, G., Wardell, S., & Zelek, J. S. (2021). Transformer-based text detection in the wild. En Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3162–3171). IEEE.

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2015). You only look once: Unified, real-time object detection. arXiv. https://arxiv.org/abs/1506.02640

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need (arXiv:1706.03762). https://arxiv.org/abs/1706.03762

Yaseen, M. (2024). What is YOLOv8: An in-depth exploration of the internal features of the next-generation object detector. arXiv. https://arxiv.org/abs/2408.15857v1

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., & Shum, H.-Y. (2022). DINO: DETR with improved de-noising anchor boxes for end-to-end object detection. arXiv. https://arxiv.org/abs/2203.03605

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable DETR: Deformable transformers for end-to-end object detection. arXiv. https://arxiv.org/abs/2010.04159

Downloads

Published

2026-02-16

How to Cite

Alvarez Perez, R., & Barrón Fernandez, R. (2026). Fine tuning Transformers models for converting handwritten scientific texts into LaTeX format. International Journal of Combinatorial Optimization Problems and Informatics, 17(2), 38–54. https://doi.org/10.61467/2007.1558.2026.v17i2.1266

Issue

Section

CINIAI

Most read articles by the same author(s)