Presenting to you an enlightening conversation with Alexey Palashchenko, Co-founder & CTO at FerretDB (host) and Bruce Momjian, Vice-President at EDB (guest) on PostgreSQL and advanced document data storage techniques. They discuss user stories, the specifics of storing data in text columns, the nuances of JSON vs. JSON B storage, and the potential of PG’s flexible index types.
Highlights of the Interview:
- Delving into PostgreSQL’s various storage solutions.
- The advantages and challenges of different data storage methods, including Java and PHP serialization.
- The potential of using full-text search with the Simple dictionary.
- Real-world applications, with a spotlight on FerretDB.
- A comparative look at emulating MongoDB on PostgreSQL and the complexities therein.
- Bruce Momjian, a key figure in the PostgreSQL community, provides invaluable insights and clarifications, offering readers a chance to enhance their understanding of PostgreSQL’s capabilities and potential.
Hi Bruce, I am glad to see you at the Document Database Community webinars. Today we will talk about non-relational PostgreSQL. Firstly, you mentioned previously at your talk that XML is less popular now. Can you shed some light on its current usage? Additionally, when considering features like the XML data type that might not be widely used, is there a process for removing such features from PostgreSQL due to maintenance concerns?
I don’t anticipate any removal of XML for two primary reasons.
First, we generally don’t remove basic functionalities, and we regard XML as a foundational feature. Although modern developers might not use it as frequently, it’s still present in a lot of legacy development. For instance, Java had a strong reliance on JSON in the past, so you still find XML usage occasionally.
Secondly, XML is part of the SQL standard. So, the idea of removing it doesn’t really align with our commitment to standards. One downside is that XML doesn’t receive the same level of performance improvements or feature enhancements as, say, JSON D. With JSON B, you can utilize the column for fast access and indexing. This capability isn’t present for XML. While there’s no technical impediment to enhancing XML, the demand simply hasn’t been there.
Thank you for that clarification. How does PostgreSQL handle large XML and JSON B documents? Does the MVCC pose any challenges? What about the “Write-Ahead Logging” mechanism? Can you discuss potential solutions?
Absolutely. There are two key points here.
First, fields that exceed several kilobytes are usually stored in a separate entity named the “TOAST” table. Every primary table has an associated TOAST table for these longer values. It’s a concept well-documented in Postgres, and I’ve written blog entries on it which can be found by searching for “TOAST.” Typically, extensive documents like XML or JSON would reside in the TOAST table instead of the primary table’s main page. So, when you update a row without modifying its associated TOASTed value – say, an XML column – a new row version is generated, but both old and new rows reference the original TOAST value. This efficient design means that unless you modify the TOAST fields, they remain unaltered, reducing redundancy.
However, if you do modify these larger fields, Postgres will generate a new TOAST row. A slight change to a big document implies creating and storing a new copy of the entire document in the TOAST table. With time, vacuum processes clear old values, but this method contrasts with databases like MongoDB, which might allow in-place updates to specific document portions without generating an entirely new document.
So, does PostgreSQL support in-place document updates? Some solutions address this by inflating document sizes, causing them to be larger than necessary.
Indeed. We’ve considered, for example, a differential approach. You retain the base document and store only the differential changes when updates are made. Upon retrieval, the original document gets merged with the differential. But this leads to accumulating differentials over time. And eventually, when managing too many differentials becomes complex, you’d revert to creating a new comprehensive document, discarding old differentials. Given the intricacies, especially in concurrent environments with continuous operations, we haven’t solidified this approach. However, it remains an area of interest.
Regarding document storage, you’ve mentioned that there’s compression for indexes with duplicate values. But is there any specific compression mechanism for repeated values or fields within the document itself, not just the index?
Absolutely. When a document is stored in the TOAST table, we automatically compress that document. Any duplication within the document will be compressed. That’s a major benefit of TOAST – not only can we facilitate updates without creating new fields, but we also ensure everything in the TOAST table is compressed.
So, it’s a generalized compression mechanism. Are there specific implementations for frequently used field names? For instance, if every document has a consistently named field like “ID” or another common lengthy field, do we utilize Dictionary Compression or a similar technique?
That’s insightful. Our current compression mechanism focuses solely on a specific field and doesn’t compare with other fields in the same row or different rows. This approach stems from concurrency challenges; it’s complex to consider other rows because updating one row might necessitate changing another. However, we’re contemplating introducing a dictionary – either user-generated or internally crafted. This dictionary would identify common values within a field and substitute frequently occurring values with a more concise representation. So, introducing a “compression dictionary” that operates across rows but within the same field is a concept under exploration.
Another compression-related question: can users deactivate TOAST compression? If so, would such deactivation possibly expedite operations, given that there’s no need for decompression?
To clarify, you’re asking if it’s possible to disable the compression, correct? To possibly speed up processes? Yes, you can. Using the ALTER TABLE command followed by ALTER COLUMN, there’s a storage parameter that allows users to prevent compression.
Let me clarify this point: I often notice user stories where objects are stored in text columns, like Java or PHP serialization. They start by saving information as text, then face challenges creating data sections for fields which are stored as object format. Can PG’s flexible index types address this?
We offer multiple storage solutions. While one method is saving it as a text field, you can also save binary files using the byteA type. We also provide a large object facility to centrally store large objects. Another method involves saving a file system name within a row and referencing the file from the database. This method, however, has synchronization challenges, especially when rows are deleted or transactions aborted. Regarding the indexing part of your question, I’m unsure of the specifics.
To clarify, if users save data as, say, JSON, they can utilize PG’s indexes. But what if the data is saved in a specific structured format, unique to a language and not XML or JSON? Can they use a general index, or would they need to develop their own extension or datatype?
I see. Using full-text search with the Simple dictionary might be a good approach. This wouldn’t alter the wording or assume any particular language. It would allow word searches within the document. Yet, post-search filtering would be needed. To mimic our handling of JSON where we recognize the format and store keys/values, I’d recommend converting their data to JSON and utilizing our JSON capabilities. Otherwise, the full-text search would be the alternative.
Expanding on that: Suppose you have a JSON document suitable for JSON B with minor exceptions, like needing specific sorting or filtering methods, or wanting to preserve field order. We could employ various hacks, like ordering fields correctly or adding specific fields in the object. Postgres allows extensive customization with data types, extensions, and operator classes. Which method would be the most efficient?
If preserving field order is critical, using JSON over JSON B is advisable. However, this means the keys won’t be sorted, and indexing won’t be as efficient. One workaround could involve storing a document as both JSON and JSON B. Use JSON B for rapid lookups and indexing, and retrieve the original document from the JSON version, which is already compressed.
As a real-world example, in FerretDB, we store envelope fields separately and within the document. While it’s a bit of a workaround, it functions well. A significant challenge arises when trying to emulate MongoDB on PostgreSQL, as both have distinct sorting and filtering mechanisms. To resolve this, would we need custom extensions, operator classes, or just SQL stored procedures?
Your challenge would likely require modifying how we store JSON data internally, recording the key positions alongside each key. This would change indexing and storage requirements. It’s a complex requirement, and that’s probably why it hasn’t been pursued much.
You mentioned that minor updates to the TOAST table trigger the creation of a new row. Does this mean an update is essentially the same as an insert?
In Postgres, every update does create a new main table row. If the TOAST table doesn’t need modification, no copy is made. Instead, the new row simply points to the existing TOAST entry. However, if the JSON document is modified, then it’s similar to an insert. Despite this, our performance remains strong. In Postgres, every update is essentially like an insert.
That clarifies things. Thank you for addressing our queries and to the listeners for their insightful questions. It was a productive conversation. Until next time. Goodbye.
Explore the fusion of document and relational databases. Bridging Two Worlds: An Introduction to Document-Relational Databases.
Delve deep into the architecture of Apache Cassandra, a distributed key-value store that serves as a critical component in managing and analyzing large volumes of data in high-scale applications, using Nvidia as a case study.
Explore the world of open standards and licensing in the dynamic realm of database technology. Join Mark Stone in discussing the future of document databases.
Discover FerretDB, an open-source MongoDB alternative. Explore its vision, compatibility, and roadmap for agile databases.
Discover the incredible capabilities of Postgres for document storage. Explore Postgres Document Storage for efficient data management.
Document databases revolutionized data in the last decade. Sachin Sinha (BangDB) delves into their convergence with Graph, Stream, & AI, highlighting benefits and emerging challenges.
A blog post by David Murphy (Udemy) about the document databases benefits and use cases for various technologies.
Migration to document-oriented databases: best practices and common mistakes.
Document Databases: Introduction, flexible schema, document model and JSON documents.
Relational Databases: review, ormalization, SQL language and joins.
Subscribe to Updates