Document databases with a convergence of Graph, Stream, and AI

August 16, 2023

Reading Time: 25 minutes

How about more insights? Check out the video on this topic.

In this blog post we’ll delve deep into this topic, covering many aspects briefly for clarity and exploring some areas in depth. Over the past decade, document databases have revolutionized data handling, and while they continue to be effective, they also present new challenges.

What’s at the core and what is changing?

At its essence, our aim is to store data and harness its intelligence, recognizing its high value. However, the landscape is rapidly changing. Massive amounts of real-time data from machines, devices, sensors, and smart things are being generated, which is changing the way data is consumed and processed. Just a few years ago, traditional data sources constituted more than 90% of all data. Now, device/machine data is becoming the main data source, and in the next seven to eight years, they will account for 90% of the global data sphere, while traditional data will drop to less than 10%. This monumental shift brings both opportunities and challenges.

Use Cases

IoT Devices: A surge in data from edge-level devices, both in the local and cloud levels. This includes real-time data from industrial IoT, vehicular metrics, and other typical IoT devices.
Text + Video + Audio: With a broad array of data sources, including text, video, and audio, there’s a need to process them simultaneously rather than in isolation.
Real-time Streaming: As devices mostly produce time series data, there’s a shift towards analyzing this data in real-time rather than processing them offline, especially due to the immense volume of data having perishable insights.
Consumer Internet: Traditional data, like log files, clickstream data, and even detailed pixel data, is becoming real-time, crucial for enhancing e-commerce conversions, better user experience and lead quality.
Fintech: The sector focuses on real-time transaction understanding, fraud analysis, and the recommendation of products, especially in mass-market finance.
Root Cause Analysis: Rather than treating data in silos, there’s a drive to understand the root cause and interlink data as it arrives.
Auto ML/AI: AI is being integrated more closely with data, emphasizing adaptability and agility. Consequently, there’s a rise in auto ML and AI applications that require real-time data handling.

These use cases highlight the industry’s move towards real-time data processing and the varied nature of data sources, both of which present challenges and opportunities.

Characteristics of Modern Data and Technological Advancements

Following are the characteristics of the emerging data, which eventually drives the design and architecture of the system.

Most of the data is unstructured.

A significant 80-90% of data is unstructured, leading to the development and rise of document databases.

Most of the data is Real-Time data.

The data is generated in real-time, signifying a rapid growth in streaming analysis, with a notable 50% year-on-year growth in just two years.

Most of the data is Interconnected.

Despite data coming from varied sources, there’s an inherent connection between them. This importance of interrelation gives rise to the popularity of graph databases, which are growing non-linearly.

AI is becoming an integral part of the process.

With the volume of data being massive and certain computations being time-intensive, predictive analysis and AI are becoming indispensable. Market trends corroborate this with increasing AI applications.

The Market Response is also in line with the above trend. The rapid growth of NoSQL databases (four to five times faster than traditional ones) and the rise in AI applications highlight the market’s acknowledgment and reaction to these data characteristics.

Supporting Advancements: Yearly progress in hardware and network sectors bolster these data trends and their implications.

Overall, modern data’s features align with emerging technological advancements and are substantiated by market trends and responses.

Hybrid Deployment: Edge, Local, and Cloud

The device data is increasingly driving the data trend and the processing requirements. Therefore, processing at the edge level is necessary along with the computing at the Cloud level.

Deployment and Processing Models [all connected]

Edge Deployment: Systems or portions of systems are integrated directly within devices, even in an embedded manner.

Local Deployment: These are onsite systems that work efficiently when connected to larger networks.

Cloud Deployment: Centralized, vast repositories that can harness the vastness of the internet.

Interconnectivity: All deployment models (edge, local, and cloud) should be interconnected to ensure seamless data flow and cooperation.

Real-time Data Flow: There’s a continuous influx of data, with most being in real-time. This necessitates fast processing and immediate response mechanisms.

Fusion of Systems: Modern applications and solutions demand the integration of various systems into a single platform. This means simultaneously handling streaming data, graph processing, and AI analytics.

AI Integration: AI needs to be incorporated where the data resides, rather than exporting data to AI models and reintegrating the insights.

Importance of Multi-modal Databases: Given the varied data types and the complexities associated with them, there’s a rising demand for databases that can handle multiple modes or types of data natively.

In essence, modern deployment strategies highlight the need for versatile, interconnected systems that can handle diverse data types and analytics requirements simultaneously.

Why is a document database alone not sufficient?

Document databases, while valuable, have certain limitations. For instance, they might not be apt for all data types. Enforcing a document structure on varied data can distort its inherent intelligence and context, leading to potential inefficiencies in performance. Moreover, these databases are not innately tailored for stream processing. Simply organizing data using timestamps isn’t true stream processing; it’s imperative to extract information as data comes in rather than after it’s logged. Activities such as continuous data processing, and pattern or anomaly identifications in sliding windows, and complex event processing should be integral to real-time data handling.

Furthermore, simply embedding reference IDs within a document doesn’t translate to genuine graph processing. Authentic graph processing requires the preservation of data links, optimization of join operations, and the safeguarding of data relationship integrity. Contrary to document databases, true graph databases address graph-related issues more intuitively and effectively.

On the AI front, the traditional method of exporting data from databases for AI processing and then re-importing the AI model post-analysis can be cumbersome. Instead, incorporating AI directly into the data setting can refine and automate the entire process, from enhancing training efficiency to streamlining model versioning and deployment.

To sum it up, while document databases possess undeniable strengths, they aren’t a universal solution for all data-related challenges. Recognizing their restrictions and supplementing them with additional tools and systems is crucial.

Problem with stitched-together platform

Stitched-together platforms pose significant challenges, particularly when trying to integrate multiple systems to create a unified solution. Merging three to five different systems can result in a complex platform that, while diverse, can lead to inefficiencies. This is due to the varying disciplines of each system regarding data ingestion, querying, and retrieval. For a single event, the communication overhead can be significant, especially if these systems are on separate machines or containers, leading to high network hops.

When integrating these systems, data aggregation becomes an issue, complicating the architectural design. With multiple silos in play, managing the platform requires expertise in each system, resulting in long development cycles. Some businesses have even shut down entire divisions due to the complexity and length of time required to realize such platforms.

Updates to one system can cascade, causing potential breakages elsewhere, which is not only inefficient but also costly. This complexity escalates when trying to scale; deciding where to add servers, partitioning data, and maintaining centralized or distributed data storage becomes almost untenable.

Furthermore, there’s an industry trend that leans towards in-memory computing, which while advantageous in some respects, can also be seen as a limitation. Data growth is outpacing available memory at a rapid rate. Solely relying on in-memory solutions can be brittle, given the pace at which data is being generated. Moreover, the current data trend requires a balance between memory and disk storage, as depending solely on memory can hinder system robustness. Yet, transitioning to disk storage might slow down the processing speed, affecting real-time processing requirements.

Traditional NoSQL databases have proven helpful, but there’s also a growing need to integrate some relational database features. Greater control over data is essential, and this includes using features like buffer pools and page caches typically found in relational databases. Other requirements include concurrent read and writes, transaction capabilities, durable data logging, automated cluster management, crash recovery and consistent access models.

In essence, the current approach to data management, particularly with siloed systems and an over-reliance on in-memory computing, is not efficient. A more holistic, integrated system that considers newer data trends and incorporates AI, graph and streaming is crucial. This system should offer more control over data and adapt to modern challenges posed by the exponential growth of data and evolving technologies.

Convergence

Convergence is essential in today’s multifaceted problem landscape. When we examine issues from a high level, it becomes clear that problems present themselves across numerous dimensions simultaneously. And to counter this, the solutions must also be multi-dimensional and brought together in a singular space — a converged space.

By converging these solutions into one space, we equip ourselves with the tools to address these various challenges from multiple angles at once. This consolidation shifts us from intricate, hard-to-understand architectures to simpler, linearly scalable ones. A noteworthy analogy to consider here is the distribution model of water. Whether it’s a drop, a cup, or a bucket, water remains consistent in its properties. Similarly, when defining our “unit of compute” in technological solutions, it needs to be comprehensive and scalable. It should not be like separating hydrogen and oxygen and attempting to recombine them later at the processing or delivery stage. Rather, the choice of this unit is pivotal as it dictates the ease of distribution, high performance, massive scale, and manageability.

The primary advantage of this convergence is its ability to unify disparate elements into a cohesive unit and scale it. This eradicates isolated compartments and promotes linear scalability. The benefits also include reduced latencies, ensuring data remains within its designated compute unit, thus avoiding unnecessary network hops and data duplication. From a coding standpoint, this simplifies problem-solving and system management. Furthermore, by converging, you achieve faster performance, adaptability across various hardware, and cost efficiency.

However, the question remains: How can we successfully converge these elements, especially when they exist separately? The answer lies in modular development. By creating components that are modular, we can incorporate them within processes like staged event-driven models, proposed by Eric Brewer. This allows for the creation of multiple stages using a configuration file, enabling components to interact seamlessly. The data stays in one place, with references passed between stages, allowing for coherent processing. By treating these components as finite state machines, we can integrate them into various stages, providing a streamlined way to converge different systems or elements. The convergence also avoids numerous copying of data across layers/components and several network hops for every single event data handling. This alone is a game changer when it comes to extremely low processing latency and high performance. Without convergence, these can’t be simply achieved.

IO Layer

The IO Layer is a critical component in optimizing data storage and retrieval, particularly when dealing with the limitations of memory and the performance disparities between memory and storage drives like SSDs. By creating a dedicated IO layer for a database that sits above standard file systems such as ext3 or ext4 et…, we can achieve more disciplined and efficient data handling, flushing and retrieval.

Simply relying on SSDs without an IO layer isn’t the solution. While SSDs are undoubtedly faster than traditional hard drives, they still lag considerably behind memory in terms of speed. To bridge this gap, an IO layer can be utilized. By designing a custom file IO or data IO system, Instead, of using SSDs merely as replacement of file system, if (virtually) we treat it as an extension of memory then suddenly we have many advantages when it comes to both, handling data more than the available RAM and maintaining the performance in an acceptable range when the data grows out of memory. This not only allows data operations to exceed the limits of inbuilt memory without severe degradation in performance but also offers a gradual performance reduction within the defined boundary. More importantly, this allows the elastic data operation rather than a brittle one which either drops the extra data (when grows out of memory) or reduces the performance to a level which is never acceptable.

In essence, an IO layer can provide seamless integration between memory and SSDs, enabling operations to go beyond memory constraints while still maintaining relatively high-performance levels.

Stream Processing in DB

Stream processing within a database involves handling data in real time, analyzing it, and acting upon it instantly. A key component for efficient stream processing is the continuous sliding window. Without it, the system risks being overwhelmed, consuming all available resources. This approach ensures data is processed in a flowing manner rather than in large, cumbersome chunks.

Furthermore, integrating complex event processing directly into the database is vital. It empowers the system to detect anomalies and patterns in streaming data swiftly. Real-time analytics plays a crucial role in stream processing, though the intricacies are vast. Additionally, it’s challenging for a sole document database to address all these needs. Integrating a graph database with a buffering system can enhance the efficiency and capabilities of stream processing.

Graph in DB

Incorporating graph structures within a database is a complex but vital task. For optimal functioning, this graph layer needs to be integrated seamlessly with both the I/O layer and stream processing components. As data flows into the system and undergoes stream processing, relevant patterns are identified, processed, and then automatically funneled into updating the graph layer. This automated transition ensures that no separate graph processing task is demanded from the user’s end.

An additional benefit is the ability to run universal queries across various layers, enhancing coherence and user experience. This goes beyond just embedding reference IDs. A notable feature is the potential to abstract machine learning tasks. By using specific queries, like those in the Cypher language, it’s possible to implicitly train models, bringing AI capabilities directly into the database. This addition is essential for automation and advanced data processing functionalities.

AI in Databases

Incorporating artificial intelligence directly into databases can unlock significant performance advantages. A primary benefit is the elimination of the need to export data elsewhere for analysis, which instantly offers a performance boost. When AI operates directly within the database, it’s possible, after a few iterations, to set the AI processes on autopilot.

This automated approach not only trains models but also seamlessly deploys them, integrating them directly into the data layer.

A challenge often faced in enterprise settings is the storage of these AI models. Traditional document databases might not be apt for storing large binary model data. Solutions like S3 might be suggested, but many enterprises hesitate to store data outside their direct control or boundaries. This emphasizes the importance of having an internal mechanism, within the database system itself, to manage and store binary and large object data. By doing so, the database can optimally decide the storage and retrieval process of this data, catering specifically to its usage and purpose.

BangDB

BangDB stands out for its comprehensive implementation of several database concepts. Distinctively, it was built from the ground up without borrowing any existing DB engine. This includes its own buffer pool, page cache, and I/O layer, storage and query layer, leading to an extensive list of features. This unique approach to design and architecture provides several benefits, notably in terms of performance and enhanced functionality.

Although the performance was a focal point, BangDB wasn’t solely designed for it. Yet, its architecture led to significant performance advantages. In comparative tests, BangDB showcased better performance than in-memory databases like Redis and databases known for sequential writing like LevelDB. In specific benchmarks, where Redis failed to complete due to memory constraints, BangDB excelled. Similarly, in the YCSB benchmark, it outperformed several other databases.

Key to its success was the early architectural decision to converge with the I/O layer, move beyond just in-memory computing, utilize fine-grain thread machines, and maintain full control over features like the buffer pool and page cache. The system is adept at leveraging infrastructure components like CPUs, memory, and even GPUs.

The presentation concluded by highlighting these achievements, reinforcing BangDB’s position as a robust and efficient database solution.

Following is a concise list of features of BangDB.

Data Model	KV, Doc, binary data, large files, Graph, time-series
Buffer pool	Page cache, manages every single byte of data
Index	Primary, secondary, composite, nested, reversed, geo-index, Vector*
WAL	Transaction, durability, crash recovery, availability
IO-Layer	SSD as RAM+, high performance IO, predictive IO
Deployment	Embedded, Client/Server, p2p distributed, Hybrid, Cloud
Enterprise grade	Data replication, disaster recovery, business continuity
Security	End-to-end TLS/SSL based, user service and API key for auth
Stream	Time-series, ETL, statistics, aggregates, CEP, anomaly, pattern
AI	ML, IE, DL, train, prediction on stream, AutoML, version, deploy
Graph	Graph data platform, Cypher query, Ontology
Cloud platform	Ampere, an interactive front-end platform on cloud
Performance	200K+IOPS, 20K+ Events/sec – per commodity machine
Language	Database – C/C++ , clients – C/C++, Java, C#, Python
Connect	Custom clients, CLI, REST API
License	Free – BSD 3, SaaS, Enterprise – Custom

How about more insights? Check out the video on this topic.

What’s at the core and what is changing?

Use Cases

IoT Devices: A surge in data from edge-level devices, both in the local and cloud levels. This includes real-time data from industrial IoT, vehicular metrics, and other typical IoT devices.
Text + Video + Audio: With a broad array of data sources, including text, video, and audio, there’s a need to process them simultaneously rather than in isolation.
Real-time Streaming: As devices mostly produce time series data, there’s a shift towards analyzing this data in real-time rather than processing them offline, especially due to the immense volume of data having perishable insights.
Consumer Internet: Traditional data, like log files, clickstream data, and even detailed pixel data, is becoming real-time, crucial for enhancing e-commerce conversions, better user experience and lead quality.
Fintech: The sector focuses on real-time transaction understanding, fraud analysis, and the recommendation of products, especially in mass-market finance.
Root Cause Analysis: Rather than treating data in silos, there’s a drive to understand the root cause and interlink data as it arrives.
Auto ML/AI: AI is being integrated more closely with data, emphasizing adaptability and agility. Consequently, there’s a rise in auto ML and AI applications that require real-time data handling.

These use cases highlight the industry’s move towards real-time data processing and the varied nature of data sources, both of which present challenges and opportunities.

Characteristics of Modern Data and Technological Advancements

Following are the characteristics of the emerging data, which eventually drives the design and architecture of the system.

Most of the data is unstructured.

A significant 80-90% of data is unstructured, leading to the development and rise of document databases.

Most of the data is Real-Time data.

The data is generated in real-time, signifying a rapid growth in streaming analysis, with a notable 50% year-on-year growth in just two years.

Most of the data is Interconnected.

AI is becoming an integral part of the process.

Supporting Advancements: Yearly progress in hardware and network sectors bolster these data trends and their implications.

Overall, modern data’s features align with emerging technological advancements and are substantiated by market trends and responses.

Hybrid Deployment: Edge, Local, and Cloud

The device data is increasingly driving the data trend and the processing requirements. Therefore, processing at the edge level is necessary along with the computing at the Cloud level.

Deployment and Processing Models [all connected]

Edge Deployment: Systems or portions of systems are integrated directly within devices, even in an embedded manner.

Local Deployment: These are onsite systems that work efficiently when connected to larger networks.

Cloud Deployment: Centralized, vast repositories that can harness the vastness of the internet.

Interconnectivity: All deployment models (edge, local, and cloud) should be interconnected to ensure seamless data flow and cooperation.

Real-time Data Flow: There’s a continuous influx of data, with most being in real-time. This necessitates fast processing and immediate response mechanisms.

Fusion of Systems: Modern applications and solutions demand the integration of various systems into a single platform. This means simultaneously handling streaming data, graph processing, and AI analytics.

AI Integration: AI needs to be incorporated where the data resides, rather than exporting data to AI models and reintegrating the insights.

Importance of Multi-modal Databases: Given the varied data types and the complexities associated with them, there’s a rising demand for databases that can handle multiple modes or types of data natively.

In essence, modern deployment strategies highlight the need for versatile, interconnected systems that can handle diverse data types and analytics requirements simultaneously.

Why is a document database alone not sufficient?

Problem with stitched-together platform

Convergence

IO Layer

In essence, an IO layer can provide seamless integration between memory and SSDs, enabling operations to go beyond memory constraints while still maintaining relatively high-performance levels.

Stream Processing in DB

Graph in DB

AI in Databases

This automated approach not only trains models but also seamlessly deploys them, integrating them directly into the data layer.

BangDB

The presentation concluded by highlighting these achievements, reinforcing BangDB’s position as a robust and efficient database solution.

Following is a concise list of features of BangDB.

Data Model	KV, Doc, binary data, large files, Graph, time-series
Buffer pool	Page cache, manages every single byte of data
Index	Primary, secondary, composite, nested, reversed, geo-index, Vector*
WAL	Transaction, durability, crash recovery, availability
IO-Layer	SSD as RAM+, high performance IO, predictive IO
Deployment	Embedded, Client/Server, p2p distributed, Hybrid, Cloud
Enterprise grade	Data replication, disaster recovery, business continuity
Security	End-to-end TLS/SSL based, user service and API key for auth
Stream	Time-series, ETL, statistics, aggregates, CEP, anomaly, pattern
AI	ML, IE, DL, train, prediction on stream, AutoML, version, deploy
Graph	Graph data platform, Cypher query, Ontology
Cloud platform	Ampere, an interactive front-end platform on cloud
Performance	200K+IOPS, 20K+ Events/sec – per commodity machine
Language	Database – C/C++ , clients – C/C++, Java, C#, Python
Connect	Custom clients, CLI, REST API
License	Free – BSD 3, SaaS, Enterprise – Custom

0 Comments

About the author

Sachin Sinha

Founder, CEO and Developer at BangDB

Sachin is a technology enthusiast with over 20 years of experience in designing and building software products for large-scale systems. Sachin has authored BangDB, a high-performance converged NoSQL database for Real-Time data processing and analytics for emerging use cases. Prior to founding BangDB, Sachin worked at senior levels at Amazon and Microsoft in database and analytics products. Sachin has been an active participant in the database domain and has published papers, and research work items along with several patents in the core database and file systems areas. He is B.tech from IIT Kanpur and lives in Bangalore, India.