Applying supervised machine learning algorithms & ensemble models to enhance credit card fraud detection.

Applying supervised machine learning algorithms & ensemble models to enhance credit card fraud detection.

Source

Master's thesis

Author

Al-Balushiyah, Abrar Ahmed.

Other titles

تطبيق خوارزميات التعلم الآلي الخاضعة للإشراف والتعلم الجماعي لتعزيز اكتشاف الاحتيال على بطاقات الائتمان.

Country

Oman

City

Muscat

Publisher

Sultan Qaboos University

Gregorian

2024

Language

English

Subject

Credit card fraud--Prevention--Mathematical models--Oman

Machine learning--Mathematical models--Oman

Thesis Type

Master's thesis

English abstract

The alarming rise of credit card fraud poses a significant threat to individuals, financial institutions, businesses, and governments. Fraudsters employ phishing activities to commit fraud and cause significant annual economic loss. To address this challenge, efficient fraud detection systems must be used to identify and detect fraud. Yet, identifying credit card fraud is a challenging task due to various factors, including the absence of straightforward techniques, imbalanced datasets related to credit cards, and the lack of a standard evaluation matrix to assess the performance of existing techniques. To overcome the earlier-mentioned challenges, a feasible solution is leveraging data mining and machine learning techniques to detect suspicious transactions. This research aims to enhance credit card fraud detection using machine learning algorithms and ensemble models. Various supervised machine learning algorithms were implemented including Decision Tree, Logistic Regression, Naïve Bayes, Random Forest, Artificial Neural Network, and XG-boost. Additionally, to tackle the problem of imbalanced datasets, several resampling methods, such as Under-sampling and Oversampling, were employed to achieve dataset balance. Moreover, various data selection techniques were performed for the feature selection. The models’ performance were assessed using diverse criteria, including Accuracy, Precision, Recall, F1-Score, and Area Under the Curve (AUC). Further modelings were developed using threshold variation on bestperformed models (Random Forest and XG-boost). Ensemble learning models were used to further refine the models' predictions for fraud and non-fraud instances. Three ensemble techniques (Bagging, Boosting, and Stacking) are employed, leveraging the main base models to boost the final fraud predictions and improve the robustness of accuracy, recall, and precision metrics. Based on the additional work performed, the ensemble model using the bagging technique gave the best performance results with 0.99 accuracy, ~0.90 recall, and a precision of 0.77. The ensemble model employed Decision Tree, Random Forest, and Neural Network base models, each utilizing different resampling techniques. This approach gave the ensemble model diversity and robustness, proving model effectiveness even when tested on an unseen dataset.

Arabic abstract

يشكل الارتفاع المثير للقلق في عمليات الاحتيال على بطاقات الائتمان تهديدًا كبيرا للأفراد، والمؤسسات المالية، والشركات، والحكومات. يستخدم المجرمون أنشطة التصيد الاحتيالي لارتكاب عمليات الاحتيال والتسبب في خسائر اقتصادية سنوية كبيرة. ولمواجهة هذه المشكلة، يجب استخدام أنظمة فعالة للتعرف على الاحتيال واكتشافه بسرعة. ومع ذلك، فإن اكتشاف الاحتيال في بطاقات الائتمان يمثل تحديًا بسبب عدة عوامل، مثل الافتقار إلى تقنيات وأنظمة قياسية، وبيانات بطاقات الائتمان غير متوازنة ولا توجد مصفوفة تقييم قياسية تستخدم لتقييم وقياس أداء التقنيات الحالية. للتغلب على التحديات المذكورة سابقًا، تتمثل أحد الحلول القابلة للتطبيق في استخدام تحليل البيانات وتقنيات التعلم الالي للكشف عن المعامالت المشبوهة. يهدف هذا البحث إلى تعزيز اكتشاف الاحتيال في بطاقات الائتمان باستخدام خوارزميات التعلم الالي ونماذج المجموعة. تم تنفيذ العديد من خوارزميات التعلم الالي الخاضعة لإلشراف بما في ذلك شجرة القرار، والانحدار اللوجستي، وسذاجة بايز، والغابات العشوائية، والشبكة العصبية الاصطناعية، و-XG boost. بالاضافة إلى ذلك، لمعالجة مشكلة مجموعات البيانات غير المتوازنة، تم استخدام عدة طرق إلعادة أخذ العينات، مثل أخذ العينات الناقص والافراط في أخذ العينات، لتحقيق توازن مجموعة البيانات. عالوة على ذلك، تم تنفيذ تقنيات مختلفة الختيار البيانات. تم تقييم أداء النماذج باستخدام معايير متنوعة، بما في ذلك الدقة والاستدعاء ودرجة 1F والمنطقة تحت المنحنى )AUC). تم تطوير المزيد من النماذج باستخدام تباين العتبة في النماذج الافضل أدا Forest Random وboost-XG). تم استخدام نماذج التعلم المجمعة لتحسين تنبؤات النماذج لحالات الاحتيال ًء ) وغير الاحتيال. يتم استخدام ثالث تقنيات مجمعة )التعبئة، والتعزيز، والتكديس(، لالستفادة من النماذج الاساسية الرئيسية لتعزيز تنبؤات الاحتيال النهائية وتحسين قوة الدقة والاستدعاء ومقاييس الدقة. واستنادًا إلى العمل الاضافي الذي تم تنفيذه، أعطى نموذج المجموعة الذي يستخدم تقنية التعبئة أفضل نتائج الاداء بدقة ،0.99 واستدعاء 0.90~ ، ودقة .0.77 استخدم نموذج المجموعة النماذج الاساسية لشجرة القرار والغابة العشوائية والشبكة العصبية، حيث يستخدم كل منها تقنيات إعادة تشكيل مختلفة. أعطى هذا النهج تنوع نموذج المجموعة وقوته، مما أثبت فعالية النموذج حتى عند اختباره على مجموعة بيانات غير مرئية.