[Webinar] Bringing Flink to On-Prem and Private Clouds. Register Now

How BT Group Built a Smart Event Mesh with Confluent

Écrit par

BT Group's Smart Event Mesh: Centralized Event Streaming With Decentralized Customer Experience, Automation, and a Foundation Built on Confluent.

BT Group is a British multinational telecommunications holding company headquartered in London, England. It has operations in around 180 countries and is one of the largest telecommunications companies in the world, providing a range of products and services including fixed-line, broadband, mobile, and TV.

BT's Digital unit is responsible for leading BT Group’s digital transformation, driving innovation, and delivering the products and services that customers need.

The Digital unit works collaboratively with BT Group’s customer-facing and corporate units, and has an Architecture function that owns the technology vision across the organization, taking a direct role in creating and delivering strategic cloud, integration, software engineering, and test capabilities.

The Integration sub-domain within the Architecture unit provides robust, scalable, and automated platforms and solutions where services, events, and data are completely democratized, readily discoverable, easily accessed, and self-served.  This includes a range of components from APIs to Microservices, ETL, Managed File Transfers, and Event Streaming.

Background

In 2020, the Digital unit's integration architecture lacked a key capability for supporting large-scale event streaming that could enable the business by achieving at-speed and at-scale data movement across the application landscape.

The group came to understand that event-sourced, real-time situational awareness was a key characteristic for many modern-day business solutions, and many internal requirements included some form of low latency event processing, whether that be confined to on-premises applications, applications in the cloud, or third party/SaaS services.

An event streaming platform is an enabler to the business, encouraging more freedom and flexibility to explore wider use cases that wouldn't otherwise be possible with batch data.  It provides a step-change experience with improved visibility, event democratization, and the reusability of near real-time data, direct from strategic sources of truth in order to facilitate accurate decision-making, improved insight opportunities, and an overall richer customer experience.

At the time, BT Group's cloud adoption was already established with the AWS platform, which was actively used for application modernization and hosting.  A decision had also been made to explore Google Cloud Platform (GCP) for business intelligence, analytics, and AI/ML use cases.

The Integration team had a clear remit—design and deliver a connected event streaming backbone to support BT Group's entire hybrid cloud landscape.

Why Confluent?

Apache Kafka® provides an industry standard foundation for data streaming, with Confluent continually building on this with committer-driven expertise, bringing new fully supported, enterprise-grade products, functionality, and services.

Below are some of the reasons the Integration team opted for a Confluent-based platform:

- Resilience - A single, multi-regional cluster deployed across data centers mitigates concern of cluster availability, minimizes interfacing application impacts, and maintains operational stability for all connected services.

- Flexibility - Native integration via the Kafka protocol for a multitude of development languages (Java, non-Java, REST) provides flexibility to a vast array of BT applications needing to integrate with Kafka.

- Connect Architecture - Kafka Connect offers 120+ pre-built configuration-driven connectors that can be deployed out of the box for sources and sinks to produce to or consume from Kafka with little effort or overhead.

- Scalability and Performance - Having clusters capable of supporting all customer-facing units means handling voluminous throughput, low latencies, and multiple workload profiles in a manner that is easy to deploy, scale, balance, and maintain.

- Run Anywhere - A single technology present across the hybrid cloud landscape ensures consistency, compatibility, and security. Integration's roadmap includes leveraging the benefits of fully managed SaaS services that Confluent supports.

- Enterprise Support Wrap - Best-in-class support offered by Confluent for a team relatively new to the technology with upskilling, training, and operational support.

Phase 1 – The Common Event Broker

The event streaming platform centered around what has been called the "Common Event Broker" (CEB). Its purpose was to provide a common layer of abstraction for the transient persistence of well-curated and meaningful events produced by any upstream source, and to make those events available for consumption by any interested downstream parties.

Common use cases include:

  1. Incident Management and Network Health Monitoring - Streaming information and events that come from network equipment into Kafka and onwards to applications like Service Now (SNOW).  These events are applicable to a range of business interests including network health, performance, and addressing customer equipment faults.

  2. Google Insights - Insight analysis performed against event streams that are sourced from numerous applications (e.g., CRM, Sales, Orders) that feed into machine-learning algorithms and help predict areas of concern and drive decision-making for products, services, and marketing purposes. The same data is also often synchronized to the AWS CEB where it is used for order journeys and other operational processes. A common CRM pattern includes change-data-capture (CDC) from applications emitting trail logs that track changes that occur within a database.  Whilst CDC is not considered a modern mechanism of event streaming, raw change logs can be important for the near real-time exposure of data, and these raw logs are often transformed to suit downstream applications as it’s consumed.

  3. Event Driven Reconciliation - Desk agents use SNOW to log customer requests and faults.  These logs are sent to downstream applications like Salesforce via Kafka.  This data is used to organize the dispatch of field engineers to jobs.  When status updates are made in Salesforce as a result of field engineers progressing or completing a job, an update is produced back to Kafka for SNOW to consume for reconciliation purposes.  These applications are not synchronously bound by these interactions and operate independently, but exchange data in an event-driven fashion via the Common Event Broker.

Fig 1 – the Common Event Broker architecture

(see full-size image)

The high-level methodology implemented was publish-subscribe, and this worked well within a single CEB cluster on an individual platform (e.g., on-premises). However with the interests of cloud, the question of making those same events available across platforms in a consistent way needed to be addressed.

One of the principles agreed from the outset was "publish once locally, consume anywhere."  This meant that producers and consumers, wherever they were hosted, only had to worry about producing to or consuming from the CEB that was local to them, with Integration taking responsibility behind the scenes for the management and exposure of those events across the event mesh to wherever they were needed.  This simplified the approach significantly for both the upstream and downstream application teams that wanted to share data across platforms.

In terms of the architecture design, the team explored numerous options, including:

  1. Hub and Spoke - A CEB cluster on each platform with the on-premises instance nominated as a hub, and all events, regardless of their origin, traversing that hub to make the data available to other clusters.

    1. Drawbacks of this approach included:

      1. Hub becomes a bottleneck with an extra hop potentially needed in many circumstances.

      2. Hub becomes a single point of potential failure.

      3. Cloud-to-cloud integration would need to go back to ground before going back to the target cloud.

      4. Cost of the hub cluster likely to be prohibitive over time as all feeds would go via it.

      5. Deemed an overall rigid approach being tied to the hub instead of having flexibility to synchronize topics only between the clusters that need it.

  2. Fully Meshed - An all-to-all deployment where each cluster on its respective platform is effectively in charge of its own destiny, managing the synchronization of events directly and independently with other CEB clusters across the estate.

    1. Drawbacks of this approach included:

      1. Complex networking that could result in all-to-all connectivity.

      2. Governance over the events traversing the estate becomes challenging by having independent CEB clusters.

      3. Overheads introduced for auditing the whereabouts of data both across the hybrid landscape and within the CEB ecosystem itself.

      4. Overall cluster and egress management is difficult.

  3. Smart Event Mesh - The approach that was ultimately agreed upon went on to form the basis for what is the second phase of the event streaming platform evolution: building on the benefits of flexibility in the fully meshed architecture while providing improved autonomy and cross-cluster coherence bringing simplicity and governance.  Each cluster is able to communicate directly with the others as required, but a central control plane manages configurations that automate the inclusion and/or exclusion of clusters to the event synchronization process.

Fig 2 – a high-level view of configuration-driven event synchronization

(see full-size image)

At the high level, this approach requires:

  • Connectivity from each cluster to every other cluster (as required).

  • Topic-based synchronization capability through Confluent's Cluster Linking.

  • Kafka Connect infrastructure for sources and sinks on each platform (as required).

  • Access to a centrally managed Schema Registry with regular, autonomous propagation of updates to the local clusters.

  • A central control plane to manage the backend processes facilitating cluster automation and onboarding journeys.

Phase 2 – The Smart Event Mesh

Beyond the initial architecture discussions, the team expanded the Smart Event Mesh into a much bigger vision and evolutionary objective.

The Common Event Brokers are a foundational mesh of interconnected brokers that has formed a streaming integration layer between decoupled applications, cloud services, and devices, regardless of where they are deployed.

Creating a "Smart Event Mesh" is creating an event mesh that, through a series of enhancements and workflows, provides event discoverability, holistic automation, federated self-service, and governance.

The Digital Integration team is now, with the support of Confluent, in the process of implementing this phase two vision.

Fig 3 – the Smart Event Mesh architecture

(see full-size image)

Key, high-level objectives of the Smart Event Mesh are:

Discoverability

  • Use of an event portal to allow anyone at BT access to browse and discover available events.

  • Events to be cataloged with business descriptions, schemas, and a click-through interface for requesting to consume.

  • Metadata tagging to aid backend processes that determine data types (PII etc), consumer hosting, master topics, and any available remote topics.

Self-Service

  • Freedom, flexibility, and control to teams outside of Integration to interact directly with Confluent (Common Event Broker) and far fewer touchpoints. 

  • Client libraries, Schema Registry and Broker access, with light-touch onboarding governance to ensure consumption and implementations are appropriate.

Automation

  • Building “behind the scenes” processes that autonomously control the Confluent clusters wherever they are: 

    • Topic creation and Schema Registry.

    • Access Control Lists (ACLs) and producer approval processes for consumers.

    • Configuration-driven Cluster Link invocation and revocation.

Governance

  • Providing clarity around the existing best practices with additional functionality that delivers a seamless service:

    • Establishing clear boundaries of data ownership.

    • Managing workflows for all user interactions, applying appropriate checks and balances, provisioning access, templates.

    • API-based schema registration and enforcing the propagation of schemas across the Kafka ecosystem.

    • Seamless metadata exchange and embedding common cross-platform functionality (data quality/policies) with BT Group's strategic Data Fabric. 

Proactive Monitoring

  • Bolstering existing monitoring and logging solutions to facilitate autonomous cluster management.

    • Threshold-based alerting:

    • Measuring volumetrics, throughputs, early-detection of potential bottlenecks.

    • Measuring topic consumption.

    • Automated housekeeping that is integrated with Cluster Linking.

Phase 3 – Beyond the Smart Event Mesh

As work continues, the use of Confluent as a data streaming platform allows Digital Integration to build out features of the Smart Event Mesh but also helps in realizing the value of it as well. This brings a single user experience to pan-BT customers who are able to self-serve with minimal overheads while retaining control, security, and governance across the estate.

The Common Event Broker on Google Cloud Platform is currently deployed self-managed as Confluent Platform. However, work is underway to assess a migration to Confluent Cloud, which would complete the picture and provide a consistent, fully-managed service across both clouds.  This would further reduce the operational demands placed on the incumbent DevOps team.

The engineering and DevOps teams have enthusiastically built out the entire Kafka estate from scratch, starting relatively inexperienced to the technology, but working tirelessly to gain knowledge, upskill, and translate that into a fully fledged production deployment.  Kafka Connect and ksqlDB have both been successfully implemented as mechanisms for transforming and streaming near real-time data between sources and targets.

Additional work is also in-flight to deliver BT Group's first Customer Facing Kafka (CFK), which is expected to dovetail into the Smart Event Mesh architecture over time and unlock yet more value both internally and externally.

BT Group is a fast-paced and dynamic organization with challenging technical demands.  The technology partnership with Confluent is not only keeping Integration abreast of those demands, but is helping to accelerate the team’s potential beyond them.

  • Paul Marsh is a Principal Solution Architect responsible for Data Integration at BT Group and has over 17 years of experience in the telecoms industry. Paul has spent most of his career designing and building batch and real-time integration applications using Ab Initio software, and played an intrinsic role in the design and delivery of EE's first Big Data Analytics platform (mData - now Active Intelligence). Since 2018 he's focused his attention on the wider integration architecture and has championed the deployment and evolution of a pan-BT event streaming platform. He enjoys spending time with his family and has a passion for horse racing.

Avez-vous aimé cet article de blog ? Partagez-le !