SISTEM OTOMATIS KLASIFIKASI BUKTI PEMBAYARAN MENGGUNAKAN OCR DAN EMBEDDING BERT DENGAN PENDEKATAN MULTI-MODEL PEMBELAJARAN MESIN

Authors

  • Mitchella Sinta Larasati Universitas Kristen Satya Wacana
  • Suryasatriya Trihandaru Universitas Kristen Satya Wacana
  • Hanna Arini Parhusip Universitas Kristen Satya Wacana

DOI:

https://doi.org/10.23969/jp.v11i01.40994

Keywords:

Optical Character Recognition (OCR), BERT Embedding, Machine Learning, Document Classification, Payment Verification, Multi-Model Classification.

Abstract

The verification process of payment receipts in school environments is still predominantly conducted manually, leading to inefficiency and a high potential for human error. This study proposes an automated system for classifying the validity of digital payment receipts by combining Optical Character Recognition (OCR), BERT (Bidirectional Encoder Representations from Transformers) embeddings, and multi-model machine learning approaches. The system integrates EasyOCR for text extraction from payment receipts, BERT for generating semantic text representations, and four classification algorithms: Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), Naive Bayes (NB), and Logistic Regression (LR). The dataset consists of 185 payment receipt samples, comprising 149 valid and 36 invalid instances, collected via Google Forms and stored in a SQLite database. Experimental results demonstrate that the Multi-Layer Perceptron (MLP) model achieves the highest accuracy of 97% with a test size of 0.2, followed by Logistic Regression with an accuracy of 96.2%, while Naive Bayes exhibits the lowest performance with an accuracy of 85.7%. The proposed system is successfully implemented in a Streamlit-based application, enabling real-time verification of payment receipts with an average processing time of 1.16 seconds per sample.

Downloads

Download data is not yet available.

References

Aljabar, A. (2024). Mengungkap Opini Publik : Pendekatan BERT-based- caused untuk Analisis Sentimen pada Komentar Film. Journal of System and Computer Engineering (JSCE), 5(1), 36–43.

Baek, J., et al. (2019). What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. International Conference on Computer Vision (ICCV), 4716-4726

Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O'Reilly Media

Briggs, A., Morrison, M., & Coleman, M. (2012). Research methods in educational leadership and management.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. Diakses dari https://arxiv.org/pdf/1810.04805.pdf

Friedl, J. E. (2006). Mastering Regular Expressions (3rd ed.). O'Reilly Media

Gopalakrishnan, K. (2021). Available online www.jsaer.com Automated Document Classification using BERT in Banking Industry. 8(12), 224–226.

Hanum, A. R., Zetha, I. A., Fajrina, J. N., Wulandari, R. A., Putri, C., Andina, S. P., Yudistira, N., & Brawijaya, U. (2024). MENDETEKSI BERITA HOAKS PERFORMANCE ANALYSIS OF THE BERT TEXT CLASSIFICATION ALGORITHM. 11(3), 537–546. https://doi.org/10.25126/jtiik2024118093

Hillebrand, L., Deußer, T., Dilmaghani, T., & Kliem, B. (2022). KPI-BERT : A Joint Named Entity Recognition and Relation Extraction Model for Financial Reports.

Irianti, A., Halimah, Sutedi, & Agariana, M. (2025). Integration of BERT and SVM in Sentiment Analysis of Twitter/X Regarding Constitutional Court Decision No. 60/PUU-XXII/2024. Jurnal Teknik Inforatika (JUTIF), 6, 469–482. https://doi.org/10.52436/1.jutif.2025.6.2.4068

Kaesmetan, Y. R., & Kalengkongan, W. W. (2025). Klasifikasi Sentimen Publik Terkait Stunting Di Indonesia Menggunakan BERT Dan SVM Classification of Public Sentiment Related to Stunting in Indonesia Using BERT and SVM. Jurnal of Business and Audit Information System (JBASE), 8(2), 11–23. dx.doi.org/ 10.23965/jbase.v8i2.8960

Mahadevkar, S. V, Patil, S., Kotecha, K., Soong, L. W., & Choudhury, T. (2025). Exploring AI ‑ driven approaches for unstructured document analysis and future horizons. Journal of Big Data, 2024. https://doi.org/10.1186/s40537-024-00948-z

Morrison, J. (2012). Tutorial on logistic-regression calibration and fusion. arXiv preprint arXiv:1211.4104. Diakses dari https://arxiv.org/pdf/1211.4104.pdf

Ovirianti, M., et al. (2022). Support Vector Machine Using A Classification Algorithm. Jurnal Polgan, 8(2), 1-10. Diakses dari https://jurnal.polgan.ac.id/index.php/sinkron/article/view/11597

Przybyła-Kasperek, M., et al. (2024). A multi-layer perceptron neural network for varied conditional classification problems. PLOS ONE, 19(12), e0316186. Diakses dari https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0316186

Sari, B., Sembiring, B., Pandia, M., Sembiring, H., & Margaretta, D. (2023). Naïve Bayes Classifier and Decision Tree Algorithms for Classifying Payment Data. 4(1), 592–600. https://doi.org/10.30865/klik.v4i1.963

Sembiring, R. W., et al. (2023). Naïve Bayes Classifier and Decision Tree Algorithms for Classification Tasks. KLIK: Jurnal Teknologi Informasi, 2(3), 45-58. Diakses dari https://djournals.com/index.php/klik/article/view/963

Simanjorang. (2022). Strategi Pemulihan Umkm Pada Masa New Normal Dan Industri 4.0 Di Desa Pulau Gambar. Jurnal Pengabdian Kepada Masyarakat Nusantara (JPkMN), 2(2), 97–103.

Skalka, J., Przybyła-kasperek, M., & Dagien, V. (2025). Cross-national survey data on student attitudes toward artificial intelligence. 62. https://doi.org/10.1016/j.dib.2025.112022

Smith, R. (2007). An Overview of the Tesseract OCR Engine. International Conference on Document Analysis and Recognition, 629-633

Wilie, B., et al. (2020). IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. Proceedings of the 2020 Empirical Methods in Natural Language Processing, 6987-7007.

Zhao, X., Niu, E., Wu, Z., & Wang, X. (n.d.). Information Extractor.

Downloads

Published

2026-01-24