Fine tuning Transformers models for converting handwritten scientific texts into LaTeX format
DOI:
https://doi.org/10.61467/2007.1558.2026.v17i2.1266Keywords:
OCR, Text detection, Text recognition, DETR, TrOCR, Transformers, LaTeXAbstract
This study introduces a methodology for optical character recognition (OCR) that leverages transformer-based architectures to enhance the detection and recognition of textual content within images. The approach integrates state-of-the-art models, employing DETR (Detection Transformer) for the generation of bounding boxes corresponding to candidate text sequences, and TrOCR for transcribing the text contained within these regions. Both models were fine-tuned on a proprietary dataset comprising handwritten and digitized notes from mathematics-related subjects, including differential equations, calculus, linear algebra, programming, etc. The dataset predominantly consists of mathematical expressions represented in LaTeX format, thereby allowing the proposed method to effectively address the recognition of complex symbolic content in mathematical texts.
Smart citations: https://scite.ai/reports/10.61467/2007.1558.2026.v17i2.1266
Dimensions.
Open Alex.
References
Blecher, L., Cucurull, G., Scialom, T., & Stojnic, R. (2023). Nougat: Neural optical understanding for academic documents. arXiv. https://doi.org/10.48550/arXiv.2308.13418
Carion, N., Massa, F., Synnaeve, G., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. arXiv. https://arxiv.org/abs/2005.12872
Chaudhuri, A., Mandaviya, K., Badelia, P., & Ghosh, S. K. (2016). Optical character recognition systems for different languages with soft computing. In Studies in Fuzziness and Soft Computing. Springer. https://doi.org/10.1007/978-3-319-50252-6
Javed, M., Nagabhushan, P., & Chaudhuri, B. (2013). Extraction of projection profile, run-histogram and entropy features straight from run-length compressed text-documents. En Proceedings of the IEEE APSIPA Annual Summit and Conference. IEEE. https://doi.org/10.1109/ACPR.2013.147
Khanam, R., & Hussain, M. (2024). YOLOv11: An overview of the key architectural enhancements. arXiv. https://arxiv.org/abs/2410.17725
Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., & Park, S. (2022). OCR-free document understanding transformer. arXiv. https://arxiv.org/abs/2111.15664
Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z., & Wei, F. (2021). TrOCR: Transformer-based optical character recognition with pre-trained models. arXiv. https://arxiv.org/abs/2109.10282
Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2022). A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems, 33(12), 6999–7019. https://doi.org/10.1109/TNNLS.2021.3136258
Mienye, I. D., Swart, T. G., & Obaido, G. (2024). Recurrent neural networks: A comprehensive review of architectures, variants, and applications. Information, 15(9), 517. https://doi.org/10.3390/info15090517
Mor, B., Garhwal, S., & Kumar, A. (2020). A systematic review of hidden Markov models and their applications. Archives of Computational Methods in Engineering, 28(3), 1429–1448. https://doi.org/10.1007/s11831-020-09422-4
Mutlag, W. K., Ali, S. K., Aydam, Z. M., & Taher, B. H. (2020). Feature extraction methods: A review. Journal of Physics: Conference Series, 1591(1), 012028. https://doi.org/10.1088/1742-6596/1591/1/012028
Raisi, Z., Naiel, M. A., Younes, G., Wardell, S., & Zelek, J. S. (2021). Transformer-based text detection in the wild. En Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3162–3171). IEEE.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2015). You only look once: Unified, real-time object detection. arXiv. https://arxiv.org/abs/1506.02640
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need (arXiv:1706.03762). https://arxiv.org/abs/1706.03762
Yaseen, M. (2024). What is YOLOv8: An in-depth exploration of the internal features of the next-generation object detector. arXiv. https://arxiv.org/abs/2408.15857v1
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., & Shum, H.-Y. (2022). DINO: DETR with improved de-noising anchor boxes for end-to-end object detection. arXiv. https://arxiv.org/abs/2203.03605
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable DETR: Deformable transformers for end-to-end object detection. arXiv. https://arxiv.org/abs/2010.04159
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 International Journal of Combinatorial Optimization Problems and Informatics

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.