Let's talk about something I'm very excited about: tech. We’ve all seen tech inventions rise and fall fast in IT, and it's difficult to predict which inventions will make it and which won’t. Some tech, however, can blow your mind when you first discover it, and make you feel certain it’ll be part of the future. I think that's the case with Artificial Intelligence (AI), even though AI is not new. What's new is that we have a tremendous amount of data that makes this technology even more precise and impressive. So, let's talk about data, Postgres and AI!
By analyzing data growth between 2010 and 2020, Statista estimates the volume of data/information created, captured, copied and consumed worldwide will reach 181 zettabytes by 2025 (1zB = 1 million PB). Given this huge amount of data, I don't see why companies wouldn't invest in data capabilities—because what's the point of spending money to obtain this data if we don't leverage it?
That's always been the dream behind business analytics: trying to guess the future from past and present data. Do you think the name "Oracle" was created on a whim? In the same vein, Cassandra was named after the Trojan priestess who could utter true prophecies (even though she was cursed and no one believed her).
Without going that far, a few years ago I worked on a very old project. Right after World War II, in France, child mortality was very high. The French government ordered a study comparing the situation in France with other European countries. The lack of milk was identified as one of the main reasons for this. France needed more cows and more efficiency to increase milk production.
Since then, farmers began collecting data, which was centralized and analyzed by public organizations. We have so much data, and the information is so detailed, that when a little female veal is born we can tell not only the quantity of milk this cow will produce during her whole life, but also the quality of her milk (percentage of fat, proteins, etc.), with minimal errors. The results are precise because the breadth of data nullifies the errors even if there’s some bad data. Today, people would label this project AI, even though it's really SQL queries in a database and statistics. But when you think deeply about what's behind AI, isn't it simply querying data and calculating statistics?
Of course, as I said, you need to have a lot of data, and this data needs to be as clean as possible (not too dirty) or your predictions for the future will be completely wrong! This is where relational database management systems (RDBMS) can be invaluable, as they allow concurrent access and data management (data can easily become corrupted or wrong if not managed correctly) and data integrity with data domains, data types, constraints, etc.
Here’s one example of bad data leading to inaccurate predictions: during the early-stage of the COVID era, doctors were overworked—especially the ones who were able to read chest X-rays to determine if a patient had COVID or not. Researchers decided to train AI to recognize COVID damaged lungs from healthy lungs. The problem was it was very difficult to find healthy adult chest X-rays. Researchers decided to train their AI on childrens’ X-rays, as childrens’ lungs are typically much healthier than adults. After several days of training, AI found a shortcut and was able to distinguish with a high success rate childrens’ chest X-ray from adults’. (Source here.)
Postgres is an old project. I'd even say a very old project, as the design note for Postgres was published in 1986. The first version of Postgres was released in 1989. This project has more than 30 years of intense reflexion, refactoring, re-designing and extending capabilities! This explains why the Postgres Development Group includes very senior and experienced developers who know what they’re doing and can help less experienced developers write better code.
Postgres has another characteristic: it was designed for stability first. That's why the first versions of Postgres were feature-poor. I believe Postgres has now reached the point where everyone knows it's stable. No wonder developers voted Postgres the most popular, most admired and desired and most used database in Stack Overflow’s survey in 2023! To maintain this stability, the project has several rules without exception:
- There will never be new features in a minor release
- A feature will be added to a major version if it is stable enough (meaning it was tested thoroughly, there is no reasonable doubt it's safe and there is no known bug)
For Postgres 15, we were looking forward to embedding JSON tables (a standard SQL feature), but removed this feature at the last minute as it wasn’t ready for production. We added some of these changes to Postgres 16, and we hope the rest will make it to Postgres 17, but there's no guarantee. What we can guarantee is that it will be there if we think it's production-ready.
Last but not least, Postgres is extensible. That's the second point of the design note from 1986. This paper emphasizes that Postgres "will provide extensibility for data types, operators and access methods". The next section of this note explains that the main goal of extensibility is that "the DBMS can be used in new application domains". It seems like it would be only natural to make Postgres AI-compliant, right?
As I mentioned, the tech world is often influenced by trends, and it's sometimes difficult to predict if these trends will stay or not. I'd say AI is here to stay because (brace for shocking news), it's not new at all!!!
In addition to working as a database consultant, I'm also teaching databases at the university. This year marks a change in how I’ll be grading my students, because in June 2023, 75% of the reports I received from my students were totally or partly written by chatGPT! As my job does not consist of grading chatGPT on its knowledge of Postgres (which could be better), I decided that this years' students will have to present a tech talk on a topic I'll give them beforehand.
Back when I was in engineering school (in the early 2000s), my team's project was to train a neural network how to play checkers. Due to our lack of time and resources (CPUs weren't what they are now), we could only train our AI for 30 moves. Our AI was quite good for the first 30, but then began playing at random after the first 30, so that any reasonable human could beat it if they could avoid losing for the first 30 moves.
Another old example of AI is the computer Deep Blue, which was famous for beating Gary Kasparov in an epic contest of 6 chess games. What you might not know is that this was a re-match, as Kasparov had already beaten Deep Blue in a previous contest. Some moves are still considered controversial (in particular, a special pawn move where the computer didn't take the material advantage for a strategic advantage, which is something no computer had ever done before—see a complete analysis here.) This happened during the 90s!
There’s more than one kind of artificial intelligence. The different types include:
- Reactive AI: This type of AI will only react to some events (like how Deep Blue reacts to a chess move, or an autonomous car reacts to outside events like a speed limit, a pedestrian, etc.)
- Generic AI: This type of AI is used for chatbots. It will identify keywords and give predefined answers based on those keywords.
- Limited AI: This is when an AI is limited to a specific domain, like banking fraud detection, for example. Another good example of limited AI is Amazon Alexa. It can only perform a very specific and limited set of actions.
- Super Smart AI: This is what every human imagines when they read or hear the words Artificial Intelligence (like HAL in the iconic movie 2001: A Space Odyssey). This kind of AI is supposed to be better than the human brain. It does not exist yet and won't exist for a long time.
So, what can we do when we put everything together?
The human brain is limited when it comes to analyzing large amounts of data. With its limited capacity, our brain will try to summarize the data, whereas AI can find templates in a huge dataset that our brain can process. As humans, we will still have to be careful with the templates found by AI, as the AI might have difficulty excluding some hypotheses and may be confused between correlation and cause. For example, in France, 57% of deaths occur in a hospital bed. Does this mean hospital beds are dangerous for humanity? We all know that's not the real reason behind that number, and AI will have to be trained to learn that it's not the case.
Still, AI can help you analyze your data and find templates. As a generic database, I think Postgres is the ideal candidate to help with this task. In fact, we already have tools built on top of Postgres to perform these kinds of queries. For example, the very popular pgvector extension enables you to store your vectors and perform similarity searches by allowing things like exact and approximate nearest neighbor search, L2 distance, inner product, and cosine distance.
We also have tools like EvaDB that will connect to your relational database and perform SQL queries on pre-trained models like Hugging Face, OpenAI, YOLO, and PyTorch, for example.
But what if we look at the problem the other way around, and we use AI to make Postgres better? As an experienced Postgres expert, I'm excited by this idea! For example, optimization is one of the most difficult tasks, as a human brain has to focus on a small set of queries to optimize a system—but we know optimizing a small part of the system can lead to global performance degradations! With its large-scale view, AI could suggest better ideas.
I have a lot of examples in mind, but just think what we could do with automatic constant tuning of Postgres, automatic indexation (and I mean dropping indexes as well as creating them), a better optimizer for Postgres, maybe suggesting an architecture, or a data model. We could even create a more natural language to query our data.
I believe that by combining the synergy of AI with PostgreSQL will result in endless opportunities. Especially when it comes to enhancing the extensibility and flexibility of PostgreSQL to make it even greater and relevant for even more domains and use cases. AI's evolution, from Reactive AI to Limited AI, opens the door to enhanced data analysis capabilities, and it seems to me that Postgres is the ideal tool for making this future a reality.