New DarkBERT AI Trained By Hackers & Cyber Criminals
We are currently witnessing the early stages of the impact of the new DarkBERT AI, caused by the release of Large Language Models (LLMs) like ChatGPT. The availability of open-source GPT (Generative Pre-Trained Transformer) models has led to a surge in AI applications. However, it is worth noting that LLMs, including ChatGPT, can also be misused to develop advanced forms of malware.
Over time, the use of applied LLMs will continue to grow, with each model specializing in a particular area and trained on carefully curated data for specific purposes. Recently, a new application has emerged: DarkBERT, developed by South Korean researchers. This model was trained on data obtained directly from the dark web. You can find more information about DarkBERT in the linked release paper, which provides an introduction to the dark web itself.
Untapped Performance Potential
DarkBERT is built upon the RoBERTa architecture, an AI approach initially developed in 2019. Surprisingly, researchers have discovered that the model had untapped performance potential beyond what was previously extracted from it in 2019. It appears that the original release of the model was not fully optimized.
To train DarkBERT, the researchers crawled the dark web using the anonymizing firewall of the Tor network. They then filtered the raw data, employing techniques such as deduplication, category balancing, and data pre-processing, to create a database specific to the dark web. DarkBERT utilizes this database to feed the RoBERTa Large Language Model, enabling it to analyze new content from the dark web, which often involves unique dialects and heavily encoded messages, and extract valuable information.
DarkWeb Based Model
While it wouldn’t be entirely accurate to label English as the business language of the dark web, it has its own distinctive characteristics that necessitated training a specific LLM like DarkBERT. The researchers have demonstrated that DarkBERT outperforms other large language models, providing an opportunity for security researchers and law enforcement agencies to delve deeper into the hidden corners of the web, where much of the illicit activity takes place.
Similar to other LLMs, DarkBERT is a work in progress, and further training and fine-tuning can enhance its performance. Its ultimate applications and the knowledge it can uncover are yet to be fully explored and understood.
Understanding and navigating the unindexed parts of the internet, commonly accessed through specific software and hidden from search engines like Google, proved to be a challenging endeavor.
In a forthcoming paper titled “DarkBERT: A language model for the dark side of the internet,” the researchers describe their approach, which involved connecting their model to the Tor network—a system used to access the dark web. Through this connection, the model generated a comprehensive database of the raw data it encountered.
Cybercrime Fighter
According to the team, their new LLM demonstrated superior capabilities in comprehending the dark web compared to other models designed for similar tasks, including RoBERTa—a language model developed by Facebook researchers in 2019 with the ability to predict concealed sections of text within unannotated language examples.
“Our evaluation results indicate that the DarkBERT-based classification model surpasses those of well-known pretrained language models,” the researchers stated in their paper.
The team envisions various cybersecurity applications for DarkBERT, such as detecting websites involved in ransomware sales or the leaking of confidential data. It could also be deployed to scour the vast number of dark web forums, monitoring them for any illicit information exchange occurring on a daily basis.
While we await practical implementation, it is important to question whether we truly desire AI to assume a policing role on the internet, even if the system performs as intended.