Build and track pipeline with Crunchbase’s new HubSpot integration | LEARN MORE

The Missing Piece in the Modern Data Stack: Real-Time Transactional Integration

This article is part of the Crunchbase Community Contributor Series. The author is an expert in their field and we are honored to feature and promote their contribution on the Crunchbase blog.

Please note that the author is not employed by Crunchbase and the opinions expressed in this article do not necessarily reflect official views or opinions of Crunchbase, Inc.


Nnamdi Iregbulem recently penned a great article for Crunchbase News about “How the Modern Data Stack Is Going Real-Time,” in which he mentioned some of the most popular tools for building a real-time infrastructure. He also brought up some real-world use cases and showed how this trend is transforming our industry’s changing landscape. As someone closely involved in the very process of bringing about that change, though, I couldn’t help noticing that one very important piece of the puzzle was missing: real-time data ingestion from transactional databases.

In this blog, I’d like to continue the conversation that Nnamdi started by sharing my own perspectives on the modern data stack (MDS), why real-time data ingestion is an essential element of it, and how enterprises can make it part of their existing data infrastructure.

 

Defining the modern data stack

Competitive advantage in today’s world hinges on an organization’s ability to innovate and adapt rapidly to change. That means having a clear view of the environment, adequate information to assess the options, and the capacity to move quickly. Unfortunately, traditional data infrastructure and batch processing methods lack the speed and agility necessary to support that kind of decisive, effective leadership.

Decision-makers need visibility to events as they unfold—not after the fact. The “rearview mirror” perspective no longer works. It drives faster, higher-volume integration, with much higher stakes. For this reason, the modern data stack is gaining popularity among many enterprises. 

Conventional elements of the data stack have included a fully managed data pipeline for ELT (extract, load, transform), a cloud-based columnar warehouse or data lake, a data transformation tool, and a business intelligence platform. The traditional approach with transactional systems has been batch-mode integration scheduled to run periodically during off-peak hours. This is resource-intensive, inefficient, expensive and prone to data inconsistencies. This severely impedes the ability of organizations to get up-to-date and accurate information.

The modern data stack, in contrast, seeks to reimagine traditional data flows using cloud-based tools to build a faster, more scalable and more flexible data infrastructure. The MDS isn’t just about specific tools; it’s about removing friction to make modern enterprises more agile and responsive. Like any other tech stack, the MDS should be always evolving.

In the context of the MDS, integration has been focused on SaaS workflows. Unicorn companies like Fivetran have driven this trend, weaving together social media, web analytics, online payments and more. Unfortunately, the SaaS application integrators fail to address some of the most important cases for enterprises: transactionally intensive databases like Oracle, SQL Server, SAP HANA, MySQL, DB2, and others. These systems house an organization’s most mission-critical data. As such, it offers the highest potential for adding business value with integration.

Large transactional systems typically have APIs that are limited in performance and/or scope. That, in turn, forces enterprises to extract data from online transaction processing databases using slow and inefficient batch replication and unreliable custom scripts. A few companies have responded with real-time options, but they aren’t cloud-native and they don’t offer a fully managed cloud service option.

 

Why does real-time matter so much?

Transactional enterprise systems typically collect and store an organization’s highest-value data. Unfortunately, though, that data is inaccessible within the modern data stack or, at the very least, is only available after a considerable time lag. Real-time integration with transactional databases is difficult, primarily out of concern for security and performance.

Innovation in digital customer experience is driving higher and higher expectations. It’s about speed, service and personalization. Consider, for example, that 80% of American consumers are more likely to purchase from a company that personalizes its sales offering. Unfortunately, you can’t deliver real-time personalization if your CRM takes hours or days to update your customer recommendation engine.

Manual data entry is out. Automation is in.

With Crunchbase Enterprise, we’ll update your Salesforce account records automatically. 

Strategists in the financial services industry likewise agree that real-time is the future for companies in their market space. Fraud detection alone is nearly a $30 billion industry; companies that fail to adopt the MDS will be at a distinct disadvantage in their quest to limit fraud-related expenses. Those firms will also be spending more than they need to on batch-mode integration, with higher resource expenditures, data inconsistencies, delayed analytics and lost revenue.

Regardless of the industry, companies that don’t begin implementing an MDS approach today will soon find themselves edged out by the competition.

The entire point of the MDS is to do away with data silos. Unfortunately, the lack of real-time data integration between transactional systems of record and modern analytics platforms simply creates an environment in which the only remaining data silos are those that contain the organization’s most valuable information.

The missing piece in this modern data stack, therefore, is the capacity to push high-volume enterprise data from online transaction processing databases and data warehouses to the modern data platform of your choice, efficiently and in real-time. Fortunately, we have the missing piece; it’s what I like to call “real-time ELT.”

 

Real-time ELT bridges the gap

The secret sauce for transactional integration is a technology called change data capture, or CDC for short.

CDC makes it possible to create streaming data pipelines by monitoring each source system to identify changes in real-time. This is often accomplished using transaction logs, although several alternate approaches may also be applied, depending on the data source. Each change in the source system is captured as it happens and is then transmitted to the target system. This gives business users instant access to the organization’s most valuable data.

Enterprise CDC offers extraordinarily low latency without the need for resyncs. You get tremendous scalability, high performance and guaranteed delivery. CDC also allows companies to phase out the use of custom scripts that typically require ongoing maintenance and troubleshooting. This approach has zero impact on the performance of production databases and zero security impact.

Real-time/streaming ELT is ideal for organizations that have massive datasets requiring deduplication and other preprocessing before they’re ingested into a real-time analytics platform. With a modern, real-time data stack, simple incremental transformations can take place multiple times each second and give insights almost instantly. CDC supports enterprises in modernizing their data infrastructures and achieving real-time integration for a range of modern use cases.

In short, CDC eliminates the last remaining silo in the modern data stack. It opens the door to real-time analytics that encompass the entirety of your enterprise data. It drives ML/AI workloads, reduces operational overhead, improves data quality and a lot more.


Rajkumar Sen, founder and chief technology officer at Arcion headshot

Rajkumar Sen is the founder and chief technology officer at Arcion, the only cloud-native, CDC-based data replication platform. In his previous role as director of engineering at MemSQL, he architected the query optimizer and the distributed query processing engine.

  • Originally published June 22, 2022