SMS spam detection by operating on byte-level distributions using Hidden Markov Models

M. Zubair Rafique Next Generation Intelligent Networks Research Center
Muddassar Farooq Next Generation Intelligent Networks Research Center

download slides (PDF)

The volume of spam SMS received by mobile users has increased dramatically in recent years. SMS provides a perfect model for spam and is widely exploited through arbitrary advertising campaigns and propagation of scam schemes. The increasing threat can be controlled through proficient and robust filtering systems. Filtering of SMS spam is a significant challenge because it has a specified syntax and structure and the filtering module must execute on resource-constrained mobile devices.

In this paper we present a novel method which incorporates the underlying byte-level data coding scheme of SMS to detect spam messages. Our proposed scheme is robust to word adulteration techniques and language transformations as it works on the GSM layer of the mobile phone. The framework first builds a model of byte-level distributions of benign and spam messages and then transforms the features of these models into Hidden Markov Models (HMMs). This process leads to a new learning algorithm for the classification of spam SMS, which is based on the probabilistic variation from the trained models. Our framework is lightweight as it requires less processing and memory resources and hence can easily be deployed on mobile devices.

The results of carefully designed experiments - by accounting the rigorous test cases - demonstrate that our framework provides a high detection rate and low false alarm rate in classification of spam SMS. We report our experiments on real-world benign and spam datasets collected from Grumbletext and through various social communities.