[Webinar] Build Your GenAI Stack with Confluent and AWS | Register Now
If you’ve never received spam or scam over any means of communication, you are probably stuck in 1975! Bad actors use any possible means to achieve their ends. Telcos are providing the perfect conduit for them and fighting this plague is not an easy task. In this blog, I will describe an approach for fighting SMS spam in real time, with AI technology, that has been implemented by cutting-edge providers.
Before diving in, let’s take a quick trip down memory lane. When SMS spam started over a decade ago, it quickly became a huge problem. It was all over the news, impacting everybody in the world, and it still is. Everyone was trying to come up with a solution: third parties, cell phone manufacturers, Apple with IoS, Google with Android, etc. That said, the biggest impact was on telcos. Not only was it providing a bad customer experience, it was also overloading telco networks and the SMSC (short message service center), generating more cost because of the additional capacity required to cope with the increasing but non-value add traffic.
While content scanning solutions exist to block specific messages, they are complex and costly options. Privacy policies often make them more difficult to use. On top of this, blocking one specific message at a time only helps improve the customer experience, but does not reduce the traffic generated by spam, since they reach the network before being blocked. Because of this limitation, solutions like pattern recognition were implemented based on SMS metadata to detect bad actors and disable their SIM card, preventing them from sending messages altogether.
The first versions of this pattern detection were based on metrics calculation (numbers of messages per minute, message size, etc.), in batches, via rule-based solutions. Spammers, being smart, figured out the patterns and modified their behaviors to circumvent the rules. This solution was very successful at the beginning, but quickly became unmanageable as it transformed into a hide-and-seek game between the spammers and the teams maintaining the rules (point 6 in image 1). Operational cost went up and new solutions had to be implemented.
With the advent of machine learning, the rule-based algorithms started getting replaced. Detection was greatly improved and ML eliminated the need to update the rules manually. While spammers realized it was becoming more and more difficult to fly under the radar, detection for these ML-based solutions was still being done in batches. Spammers came to realize they only had a limited time window, between the batches, to act. They started blasting spam heavily within that “allowed” time window before being blocked. Constantly improving and reducing the batch processing (point 6 in image 2) meant that each batch immediately started after the completion of the last; this effectively meant that the time between batch cycles was the limit of how much the “allowed” time could be mitigated. The use of AI with these ML-based solutions has brought detection to the next level, but batch processing remains an obstacle to improving the spam problem. What's next to help with this hurdle?
In order to unleash the full power of artificial intelligence, batches must be removed from the equation. Data must transit, be transformed and enriched, and features must be calculated in real time. To achieve this, the SMSC sends records as they are created to Confluent (1). To enrich the data, information coming from various databases must be stored in Confluent as well. Connectors are pre-built software integrations that facilitate data movement from sources or to sinks (2).
As an example, if your customers' records are held in an Oracle database, changes to a profile are captured by the Oracle CDC Source Connector and written to a topic in Confluent Cloud. As some of the data might contain sensitive information, field-level encryption can be used to protect that data. Then, leveraging stream processing (kStream, ksqlDB, or Flink), the metrics are calculated in real time, based on a specified time window. The data is also enriched with the data fetched from the various data sources (3). With the prepared data, the application running the machine learning model evaluates all features provided with the metrics, and a score is returned to determine anomalies (4). Based on that, a score-handling microservice can decide if a specific customer has to be blocked or not and call the blocking API. As regulation might also be important to track what happened, a Snowflake sink connector might also be leveraged to write the data in a table for forensic and reporting purposes. Having this data in a central nervous system makes it super easy to reuse the same data for different use cases. Finally, using this solution, everything is automated, and no manual tuning is required, reducing OPEX cost.
The benefits of this approach are the following:
Spammers can be blocked in seconds instead of hours, making it unprofitable for them to continue their attacks.
Customer experience is improved by reducing unsolicited messages
Network load is reduced from non-value add traffic, hence reducing cost
Looking at it from a data/topic flow perspective (numbers from image 3, matching numbers from image 4), it would look something like this. (All the data is available within Confluent.)
The data stored in the “Data & Metrics” bucketcan also be forwarded to the AI platform for the model to be retrained on a periodic basis without impacting the anomaly detection flow.
Image 5 is a combination of image 3 and 4, regrouping everything together in this high-level architecture diagram to provide a single picture of the whole use case. Confluent Cloud is a fully managed, cloud-agnostic solution that can run in AWS, Azure, or GCP. Private networking enables secure connectivity with your VPC or VNET. Since security is a key topic for telcos, you can check out our Confluent Cloud security portal to get all the relevant information.
The quicker one gets to the data, the faster one can take advantage of it and take action. Telcos can no longer wait for batches of data to be processed to take action. We live in a real-time world—there’s no more time for batches!
Having a central nervous system that makes data readily accessible drives new approaches and capabilities. As in the example above, why does sales data matter to data about spammers? Well, spammers usually buy pre-paid SIM cards in bundles. They are cheap, anonymous, and quickly switched when blocked. If we can pair sales ID with SIM card ID, detecting two to three SIMs as spammers from the same sales ID makes the other SIMs in that bundle highly suspicious, allowing our ML model to learn from it and return a higher anomaly score.
AI requires as much data as possible. The more data you have in a central location, the better predictions will be. It then becomes super easy to transform the data, making it a data product available for any AI model to use. This is what Confluent brings with the concept of a central nervous system. Not only is the data available in real time, it’s also stored for an infinite period of time for any new use cases that might need it. The advanced stream governance package makes it easy to tag and add business metadata, making the data discoverable for the rest of the organization.
The use case described in this blog is just one use case amongst a million possible others. The only limit here is your imagination!
I’d highly recommend taking a look at Confluent’s full library of solutions and use cases. And before you read too much more, give it a try yourself with Confluent Cloud’s free trial and launch your very own cluster on AWS, Google Cloud, or Azure.
This blog explores how cloud service providers (CSPs) and managed service providers (MSPs) increasingly recognize the advantages of leveraging Confluent to deliver fully managed Kafka services to their clients. Confluent enables these service providers to deliver higher value offerings to wider...
With Confluent sitting at the core of their data infrastructure, Atomic Tessellator provides a powerful platform for molecular research backed by computational methods, focusing on catalyst discovery. Read on to learn how data streaming plays a central role in their technology.