High-performance non-negative matrix factorization for intrusion detection system.

High-performance non-negative matrix factorization for intrusion detection system.

Identifier

Al-Farsi, Ahmed Saleh Sulaiman (2023).High-performance non-negative matrix factorization for intrusion detection system. (Master's thesis, Sultan Qaboos University, Muscat, Oman).

Author

Al-Farsi, Ahmed Saleh Sulaiman.

Other titles

عامل المصفوفة غير السلبي عالي الأداء لنظام كشف التسلل

Publisher

Sultan Qaboos University.

Gregorian

2023

Language

English

Subject

Computer networks--Security measures--Oman

Computer security--Oman

English abstract

With the emergence of connected objects and the Internet of Things (IoT), millions of users connected to the network produce massive network traffic datasets. These vast datasets of network traffic (Big Data) are challenging to store, deal with and analyze with a regular computer. In addition to that, these large-dimensional files contain millions of cyber-attacks, as reports published by CERT institutions around the world have proven a steady increase in the number of cyber-attacks, so building an efficient intrusion detection system that can deal with this vast data (Big Data) and has a high and fast detection ability and accuracy is a must. Many intrusion detection systems have been proposed to deal with network traffic datasets. Among these solutions are intrusion detection systems based on machine learning algorithms, which have proven to be highly efficient in intrusion detection and have given promising results. An example of some of these algorithms is SVM, KNN, and K-means, however; regular machine learning algorithms suffer from slow training and testing when the dataset size is large. In this study, an intrusion detection system was built based on distributed parallel nonnegative matrix factorization, built in the high-performance computing system (Luban) of Sultan Qaboos University. We used Message Passing Interface MPI for inter-processor communications. The algorithm is built so that all the A (input matrix), W, and H matrices are in memory divided across the processors; we distribute the matrices between the processors carefully to avoid unnecessary communications. Our parallel NMF gave us excellent training speedup results while we increased the number of processors on vast datasets. Two datasets were used to verify the proposed solution's performance: KDD99 and CIC. Our experiments on the proposed solution proved that it gives better results than the traditional ML-based intrusion detection systems. Hence, we could train the model with datasets of one million samples in only 31 seconds and got an excellent detection accuracy rate of 97%.

Member of

Theses and Dissertations

Resource URL

https://hdl.handle.net/20.500.12408/12855

Arabic abstract

مع ظهور الكائنات المتصلة وإنترنت الأشياء )IoT )، ينتج ملايين المستخدمين المتصلين بالشبكة مجموعات بيانات ضخمة لحركة مرور الشبكة. تشكل مجموعات البيانات الضخمة هذه لحركة مرور الشبكة (البيانات الكبيرة) تحديًا لتخزينها والتعامل معها وتحليلها باستخدام جهاز كمبيوتر عادي. بالإضافة إلى ذلك ، تحتوي هذه الملفات كبيرة ابعاد على ملايين الهجمات الإلكترونية ، حيث أثبتت التقارير المنشورة من قبل مؤسسات CERT حول العالم زيادة مطردة في عدد الهجمات الإلكترونية ، وبالتالي بناء نظام فعال للكشف عن التسلل يمكنه التعامل معها. هذه البيانات الضخمة )البيانات الكبيرة( ولديها قدرة كشف عالية وسريعة ودقة أمر مهم جدا. تم اقتراح العديد من أنظمة الكشف عن التسلل للتعامل مع مجموعات بيانات حركة مرور الشبكة. من بين هذه الحلول أنظمة كشف التسلل التي تعتمد على خوارزميات التعلم الآلي ، والتي أثبتت فعاليتها العالية في اكتشاف التسلل وقدمت نتائج واعدة. مثال على بعض هذه الخوارزميات هو SVM و KNN و mean-K ، مع ذلك ؛ تعاني خوارزميات التعلم اآللي العادية من بطء التدريب والاختبار عندما يكون حجم مجموعة البيانات كبي ًرا. في هذه الدراسة ، تم بناء نظام كشف التسلل بنا ًء على عامل المصفوفة الموزعة غير السلبي ، والذي تم إنشاؤه في نظام الحوسبة عالية الأداء )لبان( بجامعة السلطان قابوس. تم تصميم الخوارزمية بحيث يتم تقسيم جميع المصفوفات A( مصفوفة اإلدخال( و W و H في الذاكرة على المعالجات ؛ تم تصميم طريقة توزيع المصفوفات بين المعالجات بعناية لتجنب االتصاالت غير الضرورية. استخدمنا واجهة تمرير الرسائل MPI لالتصاالت بين المعالجات. أعطانا NMF الموازي نتائج تسريع تدريب ممتازة بينما قمنا بزيادة عدد المعالجات في مجموعة بيانات واسعة. تم استخدام مجموعتي بيانات للتحقق من أداء الحل المقترح: 99KDD و CIC. أثبتت التجارب التي أجريناها على الحل المقترح أنه يعطي نتائج أفضل من أنظمة الكشف عن التسلل التقليدية القائمة على ML. ومن ثم ، تمكنا من تدريب النموذج بمجموعات بيانات من مليون عينة في 31 ثانية فقط ، وحصلنا على معدل دقة كشف ممتاز بلغ .٪97