Here’s a round-up of the big database influences in 2024 — like vector store, GraphQL, and open table formats — and what they portend for 2025.
Jan 2nd, 2025 6:28am by
Image by Diana Gonçalves Osterfeld.
For databases, 2024 was a year for both classic capabilities and new features to take priority at enterprises everywhere. As the pressure increased for organizations to operate in a data-driven fashion, new and evolving data protection regulations all over the world necessitated better governance, usage, and organization of that data.
And, of course, the rise of artificial intelligence shined an even brighter light on the importance of data accuracy and hygiene, enabling accurate, customized AI models, and contextualized prompts, to be constructed. Key to managing all of this have been the databases themselves, be they relational, NoSQL, multimodel, specialized, operational, or analytical.
In 2025, databases will continue to grow in importance, and that’s not going to stop. Even if today’s incumbent databases are eventually eclipsed, the need for a category of platforms that optimize the storage and retrieval of data will persist because they are the essential infrastructure for high-value, intelligent systems.
Remember, there’s nothing magic or ethereal about “data.” They are simply point-in-time recordings of things that happened, be it temperatures that changed, purchases that were made, links that were clicked, or stock levels that went up or down. Data is just a reflection of all the people, organizations, machines, and processes in the world. Tracking the current, past, and expected future state of all these entities, which is what databases do, is a timeless requirement.
The most dominant database platforms have been with us for decades and have achieved that longevity by adopting new features reflecting the tech world around them while staying true to their core mission of storage and querying with the best possible performance.
Decades ago, it was already apparent that all business software applications were also database applications, and that’s no less true today. But now that truth has expanded beyond applications to include embedded software at the edge for IoT; APIs in the cloud for services and SaaS offerings, and special cloud infrastructure for AI inferencing and retrieval.
What’s Our Vector, Victor?
One big change of late, and one that will continue into 2025, is the use of databases to store so-called vectors. A vector is a numerical representation of something complex. In physics, a vector can be as simple as a magnitude paired with a direction. In the data science world, a vector can be a concatenated encoding of machine learning model feature values.
In the generative AI world, the complex entities represented by vectors include the semantics and content of documents, images, audio, and video files, or pieces (“chunks”) thereof. A big trend that started in past years but gained significant momentum in 2024 and that will increase in 2025 is the use of mainstream databases to store, index, search and retrieve vectors. Some databases are serving as platforms on which to generate these vector embeddings, as well.
This goes beyond the business-as-usual practice of operational database players adding to their feature bloat. In this case, it’s competitive move meant to counter vector database pure-play vendors like Pinecone, Zilliz, Weaviate and others. The big incumbent database platforms, including Microsoft SQL Server, Oracle Database, PostgreSQL, and MySQL on the relational side, and MongoDB, DataStax/Apache Cassandra, Microsoft Cosmos DB, and Amazon DocumentDB/DynamoDB on the NoSQL/multimodel side, have all added vector capabilities to their platforms.
These capabilities usually start with the addition of a few functions to the platform’s SQL dialect to determine vector distance and then extend to support for a native VECTOR data type, including string and binary implementations. Many platforms are also adding explicit support for the retrieval augmented generation (RAG) programming pattern that uses vectors to contextualize the prompts sent to large language models (LLMs).
Where does this leave specialist vector databases? It’s hard to say. While those platforms will emphasize their higher-end features, the incumbents will point out that using their platforms for vector storage and search will help avoid the complexity that adopting an additional, purpose-specific database platform can bring.
GenAI for Developers and Analysts
Vector capabilities are not the only touch point between databases and generative AI, to be sure. Sometimes, the impact of AI is not on the database itself but rather on the tooling around it. In that arena, the biggest tech at the intersection of databases and generative AI (GenAI) is a range of natural language-to-SQL interfaces. The ability to query a database using natural language is now so prevalent that it has become a “table stakes” capability. But there’s a lot of room left for innovation here.
For example, Microsoft provides not just a chatbot for generating SQL queries but also allows inline comments in SQL script to function as GenAI prompts that can generate whole blocks of code. It also provides for code completion functionality that pops up on the fly right as developers are composing their code.
On the analytics side, Microsoft Fabric Copilot technology lends a hand in notebooks, pipelines, dataflows, real-time intelligence assets, data warehouses and Power BI, both for reports and DAX queries. DAX — or Data Analysis eXpressions — is Microsoft’s specialized query language for Power BI and the Tabular mode of SQL Server Analysis Services. It’s notoriously hard to write in the opinion of many (including this author), and GenAI technology makes it much more accessible.
Speaking of BI, analytical databases have AI relevance, too. In fact, in July of this year, OpenAI acquired Rockset, a company with one such platform, based on the open source RocksDB project, to accelerate its platform’s retrieval performance. Snowflake, a relational cloud data warehouse-based platform, also supports a native VECTOR type along with vector similarity functions, and its Cortex Search engine supports vector search operations. Snowflake also supports six different AI embedding models directly within its own platform. Other data warehouse platforms support vector embeddings, including Google BigQuery. On the data lakehouse side, Databricks is in the vector game, too.
‘OLxP’
Staying with analytical databases for a minute, another trend to watch for in 2025 will be that of bringing analytical and operational databases together. This fusion of OLTP (online transactional processing) and OLAP (online analytical processing) sometimes gets called operational analytics. It also garners names like “translytical,” and HTAP (hybrid transactional/analytical processing).
No matter what you call it, many vendors are bullish on the idea. This includes SAP, whose HANA platform was premised on it, and SingleStore, whose very name (changed from MemSQL in 2020) references the platform’s ability to handle both. Snowflake’s Unistore and Hybrid Tables features are designed for this use case as well. Databricks’ Online Tables also use a rowstore structure, though they’re designed for feature store and vector store operations, rather than OLTP.
Not everyone is enamored of this concept, however. For example, MongoDB announced in September of this year that its Atlas Data Lake feature, which never made it out of preview, is being deprecated. MongoDB seems to be the lone contrarian here, though.
Data APIs and Mobile
That’s not the only area where MongoDB has retreated from territory where others have gone running in. MongoDB also announced the deprecation of its Atlas GraphQL API. Meanwhile, Oracle REST Data Services (ORDS), Microsoft’s Database API Builder, and AWS AppSync add GraphQL capabilities to Oracle Autonomous Database/Oracle 23ai, Microsoft Azure SQL Database/Cosmos DB/SQL Database for Postgres, and Amazon DynamoDB/Aurora/RDS, respectively.
What about mobile databases? At one time, they were a major focus area for Couchbase, for Microsoft, and for MongoDB, with its Atlas Device SDKs/Realm platform. Couchbase Mobile is still a thing, but Microsoft Azure SQL Edge at least nominally shifts the company’s focus to IoT use cases, and MongoDB has officially deprecated Atlas Realm, its Device SDKs and Device Sync (though the on-device mobile database will continue to exist as an open source project). It’s starting to look like purpose-built small-footprint databases, including SQLite, and perhaps Google‘s Firebase, have withstood the shakeout here. Clearly, using one database platform for every single use case is not always an efficacious choice.
Multimodel or Single Platform?
Is the same true for NoSQL/multimodel databases, or can conventional relational databases be a one-stop shop for customers’ needs? It’s hard to say. Platforms like SQL Server, Oracle and Db2 added graph capabilities in past years, but adoption of them would seem to be modest. Platforms like MongoDB and Couchbase still dominate the document store world. Cassandra and DataStax are still big in the column-family category, and Neo4j, after years of competitive challenges, still seems to be king of the graph databases.
But the RDBMS jack-of-all-trades phenomenon isn’t all a mirage. Mainstream, relational databases have bulked up their native JSON support immensely, with Microsoft having introduced in preview this year a native JSON data type on Azure SQL Database, and Azure SQL Managed Instance. Microsoft also announced at its Ignite conference in November that SQL Server 2025 (now in private preview) will support a native JSON data type as well.
Oracle Database, MySQL, Postgres and others have for some time now had robust JSON implementations too. And even if full-scale graph implementations in mainstream databases have had lackluster success, various in-memory capabilities in the major database platforms have nicely ridden out the storm.
Multimodel NoSQL has shown real staying power as well. Microsoft’s Cosmos DB supports document, graph, column-family, native NoSQL and full-on Postgres relational capabilities, in a single umbrella platform. Similarly, DataStax explicitly supports column-family, tabular, graph and vector, while Couchbase supports document and key-value modes.
Data Lakes, Data Federation and Open Table Formats
One last area to examine is that of data virtualization and federation, along with increasing industry-wide support for open table formats. The requirement of cross-data-source querying has existed for some time. Decades ago, the technology existed in client-server databases for querying external databases, with technologies like Oracle Database Links, Sybase Remote Servers and Microsoft SQL Server linked servers. Similarly, a killer feature of Microsoft Access over 30 years ago was its Jet database engine’s remote table functionality, which could connect to data in CSV files, spreadsheets, and other databases.
With the advent of Hadoop and data in its storage layer (i.e., what later came to be known as data lakes), bridging conventional databases to “big data” became a priority, too. Starting with a technology called SQL-H in the long-gone Aster Database, acquired by Teradata in 2011, came the concept of an “external” table. By altering the CREATE TABLE syntax, a logical representation of the remote table could be created without physically copying it, but the query engine could still treat it as part of the local database.
Treating remote data as local is also called data virtualization. Joining a remote table and a local one (or multiple remote tables across different databases) in a single SQL SELECT is called executing a federated query. To varying degrees, both data virtualization and federation have been elegant in theory but often lacking in performance.
To help address this, open table formats have come along, and this year, they have become very important. The top contenders are Delta Lake and Apache Iceberg, with Apache Hudi coming in as somewhat of an also-ran. In practical terms, all three are based on Apache Parquet. Unlike CSV, other text-based file formats, or even Parquet itself, data stored in open table formats can be queried, updated and managed in a high-performance, high-consistency manner.
In fact, for its Fabric platform, Microsoft reworked both its original data warehouse engine and its Power BI engine to use Delta Lake as a native format. Snowflake did the same with Iceberg and other vendors have followed suit. Meanwhile, there are still a variety of database platforms that connect to data stored in open table formats as external tables, rather than as truly native ones.
Next year, look for open table format support to become increasingly more robust, and get ready to devise a data strategy based upon it. With support for these formats, there’s a good chance that many, if not most, database engines will be able to share the same physical data stored in these formats, query it at native speeds, and operate on it in both a read and write capacity. Proprietary formats may slowly be giving way, and platforms in the future may succeed on their innate capabilities more than the feedback loop between their dominance and resulting data gravity.
Feature and Skillset Equilibrium
Ultimately, a single database platform cannot be all things to all customers and users. Many of the incumbent platforms try to support the full complement of use cases, but most end up having a litany of new features added over the years that turn out to be more gimmick than mainstay.
Luckily, the new-fangled features in incumbent databases tend to go through a Darwinist process, eventually distilling down to a core set of capabilities that likely won’t achieve parity with those of specialized database platforms but will nonetheless be sufficient for a majority of customers’ needs. As the superfluous capabilities are whittled away, important workhorse features get onboarded, adopted, and added to the mainstream database and application development canon.
The market works as it should: incumbents add features in response to competitive pressures from new players that they likely would not have added on their own initiative. This allows room for innovative new players but also lets customers who wish to leverage their investments in existing platforms do just that.
It’s interesting how the fundamentals of relational algebra, SQL queries, and the like have stayed relevant over decades, but the utility, interoperability and applicability of databases keep increasing. It means the skillsets and technologies are investments that are not only safe but also, like good financial investments, grow in value and pay dividends.
Some customers will want a first-mover advantage, and so brand-new, innovative platforms will appeal to them. But many customers won’t want to re-platform or trade away their skillset investments and will prefer that vendors widen the existing road rather than introduce detours in it. Those customers should demand and place their bets with the vendors that embrace such approaches. And in 2025, they should expect vendors to welcome and accommodate them.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don’t miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.
Created with Sketch.