الملخص الإنجليزي
With the freedom offered by the Deep
Web, people have the opportunity to express
themselves freely and discretely, and sadly, this
is one of the reasons why people carryout illicit
activities there. In this work, a novel dataset for
DarkWeb active domains known as crawler DB is presented. To build the crawler-DB, The
Onion Routing network (Tor) was sampled,
and then a web crawler capable of crawling
into links was built. The link addresses that
are gathered by the crawler are then classified
automatically into five classes. The algorithm
built in this study demonstrated good
performance as it achieved an accuracy of 85%.
A popular text representation method was used
with the proposed crawler-DB crossed by two
different supervised classifiers to facilitate the
categorization of the Tor concealed services.
The results of the experiments conducted
in this study show that using the Term
Frequency-Inverse Document Frequency (TF IDF) words representation with linear support
vector classifier achieves 91% of 5 folds cross validation accuracy when classifying a subset
of illegal activities from crawler-DB, while
the accuracy of Naïve Bayes was 80.6%. The
good performance of the linear SVC might
support potential tools to help the authorities
in the detection of these activities. Moreover,
outcomes are expected to be significant for
both practical and theoretical aspects and they
may pave the way for further research.