A new tool makes it easier for database users to perform complex statistical analyses of tabular data without needing to understand the underlying processes.
GenSQL, a generative AI system for databases, can help users make predictions, detect anomalies, estimate missing values, fix errors, or create synthetic data with just a few keystrokes.
For example, if the system analyzes medical data from a patient who usually has high blood pressure, it can identify a low blood pressure reading that is unusual for that patient, even if it falls within the normal range.
GenSQL automatically combines a tabular dataset with a generative probabilistic AI model, which can handle uncertainty and adjust its decision-making based on new data.
Moreover, GenSQL can be used to create and analyze synthetic data that imitates real data in a database. This is especially useful when sensitive data cannot be shared, like patient health records, or when real data is limited.
This new tool is built on SQL, a programming language for creating and managing databases, introduced in the late 1970s and used by millions of developers worldwide.
"Historically, SQL showed the business world what a computer could do. They didn't have to write custom programs; they just had to ask questions of a database in a high-level language. We believe that as we move from just querying data to asking questions of models and data, we'll need a similar language that teaches people the meaningful questions they can ask a computer with a probabilistic model of the data," says Vikash Mansinghka ’05, MEng ’09, PhD ’09, senior author of a paper introducing GenSQL and a principal research scientist leading the Probabilistic Computing Project in the MIT Department of Brain and Cognitive Sciences.
When the researchers compared GenSQL to popular AI-based methods for data analysis, they found it was not only faster but also produced more accurate results. Importantly, the probabilistic models used by GenSQL are easy to understand, so users can read and edit them.
“Just looking at data and trying to find patterns using simple statistical rules might miss important interactions. You need to capture the correlations and dependencies of the variables, which can be quite complex, in a model. With GenSQL, we aim to allow many users to query their data and models without needing to know all the details,” says lead author Mathieu Huot, a research scientist in the Department of Brain and Cognitive Sciences and member of the Probabilistic Computing Project.
They are joined on the paper by Matin Ghavami and Alexander Lew, MIT graduate students; Cameron Freer, a research scientist; Ulrich Schaechtle and Zane Shelby of Digital Garage; Martin Rinard, an MIT professor in the Department of Electrical Engineering and Computer Science and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Feras Saad ’15, MEng ’16, PhD ’22, an assistant professor at Carnegie Mellon University. The research was recently presented at the ACM Conference on Programming Language Design and Implementation.