[Webinar] Build Your GenAI Stack with Confluent and AWS | Register Now
$5,600 per minute.
That sounds like a long-distance call to the moon. It was actually Gartner’s estimated cost of network downtime. For telecommunication companies (telcos) facing risks of equipment failures, software misconfiguration, network overload, and power outages, annual service outage costs can exceed billions of dollars. How can some of these costs be avoided?
With vast global networks, telcos have a mission-critical imperative to provide seamless connectivity for tens of millions of customers while ensuring optimal performance and swift issue resolution. However, the growing complexity of these networks—evolving from 3G, 4G, and 5G—poses challenges for service providers, particularly when it comes to proactively identifying and resolving issues. When mean time to recovery (MTTR) can make the difference between minimal impact and critical outage, every minute counts. Continuously improving on MTTR helps ensure a 99.999% SLA, or fewer than 6 minutes of downtime per year.
Bifurcated, segmented telco networks have teams monitoring customer and network health, where each team is responsible for managing a part of it, but often lacking the holistic view necessary to assess the entire network’s health. These teams, while owning service-level responsibilities, do not own the underlying network infrastructure, which can lead to inefficiencies in problem detection and resolution.
Enter predictive customer support—when powered by data streaming, this approach can transform how telcos monitor and address network health and support issues. A data streaming platform ingests and processes real-time data at scale, from customer behavioral data, network performance metrics, to subnetwork data from towers to switches. Teams can train predictive models with this data to preemptively detect and rectify anomalies before they escalate and impact a large swath of customers. Streaming data for predictive customer support positively impacts business revenue, productivity, and brand reputation:
Ensure SLAs by leveraging predictive insights such as problems affecting phones of a particular model or software version and getting ahead of it before it affects customers.
Significant time and cost savings from fewer customer support calls and tickets. Each call could otherwise cost thousands to diagnose and fix, requiring expensive escalation that involves numerous teams across customer service, ops/architecture/engineering teams in radio network, core network, and underlay network architecture.
Ability to focus on value-add work, improving resource allocation and accelerating new feature rollout to add more value to the network and deliver new products for greater customer satisfaction.
Increase trust and transparency by communicating to customers that you’re aware of their issue (e.g., dropped calls) and have recommended actions or a fix status update.
Greater customer satisfaction and reduced churn, eliminating the risk of frustrated customers who don’t call in or report issues and directly churn without contacting customer service.
Real-time and predictive alerting for telco ops teams, automatically flagging abnormal patterns in network or user phones to be fixed (e.g., software updates, restarts).
Continuous monitoring of buggy cell phone software to ensure its behavior is as expected.
Better partnerships with hundreds or thousands of third-party providers with reliable operational-level agreements.
One of the key objectives is to save customers from the frustration of experiencing service disruptions and the hassle of reaching out to support and waiting for resolution. Telcos can demonstrate their commitment to customer satisfaction and build trust by preemptively resolving issues, even before customers are aware of them. To do this, telcos need to overcome some key technical challenges:
Siloed data with difficulties integrating disparate data in real time.
Unprecedented volume of data from petabytes per day across millions of devices around the world.
Vast disconnected teams where the ops team is in one part of the business and the architecture team in another, misaligning service due to lack of visibility into other teams and shared data views.
Batch ETL/ELT data pipelines with days-long processing leading to stale data.
Legacy technologies such as mainframes, messaging queues, and on-premises databases.
Lack of scalability in running after-the-fact jobs and queries against sensitive, yet operationally important databases.
With Confluent, telcos can analyze real-time data holistically, and train predictive algorithms to identify patterns indicative of potential problems, such as software glitches or network irregularities, enabling swift intervention to avoid widespread disruptions.
To overcome the existing data and infrastructure challenges, telcos can use Confluent’s data streaming platform to:
Stream data on-premises and across any cloud environment to support global 24/7 telco operations. On-prem Confluent Platform enables telco organizations to take action faster, unaffected by potential network outages while Confluent Cloud is the most elastic, resilient, and performant platform powered by Kora Engine. Cluster linking mirrors topics to seamlessly share data across hybrid, multicloud deployments.
Connect data in real time, wherever it may reside across your data architecture. Pre-built, fully managed connectors help build streaming data pipelines to bring together all telemetry data—customer activity, phone model, software, device status and throughput, network, radio technology, cell tower, GPS, calls made. Integrate countless data points from myriad sources for a comprehensive, real-time view of network performance. Bridge operational and analytical data from on-prem relational databases to cloud-native tools such as BigQuery, Grafana, and Datadog.
Process data streams to build live data products that can be used across many teams. Apache® Flink stream processing helps transform and enrich real-time data to train and improve predictive models. For example, call drop rate in a certain area can be joined with the network information such as health statistics around network node in order to detect abnormal data spikes and predict soon-to-develop issues.
Govern reliably by safeguarding customer and network data with Stream Governance. Leverage Stream Lineage for visibility, client-side field-level encryption and RBAC for security and compliance, and Schema Registry to ensure compatibility as data requirements change. Data Portal allows teams across the telco organization to share data products to build new predictive models faster.
By fixing network issues before they cascade into widespread outages, telcos can mitigate churn risk and preserve their service-level agreements for customers, ensuring minimal downtime and rapid MTTR.
The diagram below illustrates the hybrid deployment architecture and streaming data pipelines for this predictive customer service use case on Confluent Platform and Confluent Cloud:
The implementation comprises the following:
Customer and network telemetry data (e.g., call drop rate, signal strength) are stored in a data center. Connectors for IBM MQ, Splunk, PostgreSQL, and SQL Server write data in real time to topics in Confluent Platform.
Cluster linking mirrors topics from Confluent Platform into Confluent Cloud. There, stream processing joins data streams to create data products: network_perf_enriched, subscriber_perf_enriched, call_perf_enriched, and subscriber_perf_issues.
Enriched data is sent downstream to GCS, BigQuery, etc. to predictively analyze KPIs as they come in and fire off alerts when a node or customer is detected as being unhealthy due to a spike in call drop rate.
Enriched data trains machine learning (ML) models and powers real-time dashboards.
Predictive alerts notify teams via emails, Slack, or SMS (e.g., Pagerduty) as well as network management systems (i.e., NOCC) to take immediate action.
The following provides a closer look at stream processing. Call failure rate is not simply how many times a customer fails to make or receive a call. Rather, we need to know if a call failure is due to a network issue or whether it’s an isolated mobile device issue. After a call ends, it releases a clear code to the network, which is captured and stored in a database. Ops teams can stream process data to detect fluctuations in this clear code, to determine normal/abnormal clearing. Raw data containing customer PII can be structured and cleaned in order to be reusable by other teams. In the diagram below, stream processing is used to join RAN, Core, and underlay network performance data to form an enriched view. This is then joined with clear codes to form a holistic view of call_perf_enriched. This can further be combined with subscriber data to identify affected customers.
Throughout this process, raw real-time data is continuously transformed into data products (in green). Once ops teams share these data products, data science teams can then immediately use them to build predictive models.
Below is an example of a Flink SQL query for this use case. It continuously processes real-time data the moment it’s written to the subscriber_perf_issues topic, filtering for call success rates that dip below 100% with non-normal clearing codes. This enables telcos to instantly find mobile users that are suffering across all areas of visibility (RAN network, Core network, and actual call results) and indicates unhealthy network situations. Ops teams can immediately take action to proactively fix the issue before success rates decline further and affect more customers.
Predictive customer support on a data streaming platform enables telcos to leverage real-time insights to improve SLAs, making products more appealing to customers and gaining a competitive edge in the market. Moreover, it enables telco organizations to streamline operations, save customers and support teams time, and lower operational expenses. The reduced TCO and greater ROI can be applied to freeing up valuable resources and allowing development teams to focus on innovation and building new products and features around 5G, working with phone vendors to enrich network services.
In essence, predictive customer support signals a new era of proactive network management, where data-driven insights empower telcos to anticipate and resolve issues swiftly, bolstering customer satisfaction and operational efficiency. As telcos embrace this approach, they can optimize key performance indicators against industry benchmarks, staying ahead of the curve and delivering superior services to drive sustainable growth in a highly competitive market.
To learn more, check out these additional resources:
This blog explores how cloud service providers (CSPs) and managed service providers (MSPs) increasingly recognize the advantages of leveraging Confluent to deliver fully managed Kafka services to their clients. Confluent enables these service providers to deliver higher value offerings to wider...
With Confluent sitting at the core of their data infrastructure, Atomic Tessellator provides a powerful platform for molecular research backed by computational methods, focusing on catalyst discovery. Read on to learn how data streaming plays a central role in their technology.