Data-centric project requirements?
Blog: Strategic Structures
Several times this week, in different circumstances, I was asked a question having these three words close together. That’s not new. It happened previously. But this concentration triggered the write-up that follows. Nothing original and neither is the reason to write it:
Everything that needs to be said has already been said. But since no one was listening, everything must be said again.
— André Gide
Let’s first clarify what is data-centric and then see why it doesn’t go well with project and even less so with requirements.
What is data-centric?
The short answer is in these three1 principles:
- Data is self-describing and does not rely on an application for interpretation and meaning.
- Data is expressed in open, non-proprietary formats.
- Applications are allowed to visit the data, perform their magic and express the results of their process back into the data layer for all to share.
I think these three are the most important and are context-independent. The data-centric manifesto they are taken from is currently with an enterprise-only focus which I find unfortunate. Yes, the problem they address is most severely felt – or rather not felt because of ignoring or misattributing the symptoms – in large organizations. Yet, what is behind these principles is equally important for personal information management and on the open web. Let’s go quickly through all three levels then, from big to small, and see what data-centric means for the world wide web, for corporate IT, and then for personal information management.
The web was designed to be a decentralized system where the agreement on a few standards, basically HTTP and HTML, enabled free choice on just about anything else. People were finally free to express themselves and to choose from where and how to get information. They got free to innovate on building new browsers, websites, and whatever web applications and services they can think of. A system like this, with a self-maintained organization, can work well and have a natural tendency for virtuous cycles. In other words, it can amplify goodness and develop its own immune system for whatever threatens its viability. All it needs is to have the right kind of enabling constraints, for example, the standards I mentioned above, and to allow autonomy of all subsystems. This is the balance between autonomy and cohesion. It works for animals, people, tribes, organizations, society, and a socio-technical system like the web.
So the web flourished as a decentralized system, where people we free to choose to create more choices. And then one day the platforms appeared. They offered good and free services. Or at least they looked good and free at first. In reality, they were (and are) neither good nor free. The platforms are not nearly as good information providers as it was the decentralized web before them. What we see is not what we are looking for, but what their algorithms decide to show us. And the services of these platforms are not free. Quite the contrary. We pay with our data, and we pay twice. Once by being their content providers and a second time by giving them our personal data. Importantly, we don’t give them only our current personal data but also future ones, by allowing them to track our online behaviour. Who’s them? I’m talking of course about IT giants like Google, but the best example of extreme centralization and lock-in is Facebook2. In this way, the web, a decentralized system, shaped by the users, turned into a hyper-centralized system, shaped by a few powerful corporations3. It also formed users’ expectations. In 2019 Facebook and Google announced that it was now possible to copy images from Facebook to Google Photos. That’s the new norm for innovation. Only a few people noted the absurdity. As Ruben Verbourgh pointed out, 50 years after being able to send video signals in 380,000km distance we celebrate that we can finally move a photo by 11km (the distance between Facebook and Google headquarters). A bit dystopian, isn’t it?
Yet, the problems with this centralization are not widely understood. For example, very few people understand how platform-based political propaganda works, but even fewer relate it to the hyper-centralization of the web. Same with fake news and so on. Maybe the least understood of the damages is how it suffocates innovation. It’s easy to illustrate. Even when you use Google for product search, where it should excel after so many years of work, huge investments and massive feedback, it’s really lame. Try searching for a bike below a certain price and certain weight. You’ll get results for bikes above that, but okay, then you use the shopping filter. Currently, that will not allow you to specify the weight even though it’s available in most technical specifications published online. But even if they add it at some point, the final selection will still exclude the majority of the offerings by smaller companies.
A way out is to decouple data and applications.
This was for the web. Now for enterprises.
In enterprises, for decades the applications were built in a way that the data model is separate for each application, trapped inside it, and the interpretation of the data is in the application code. The applications themselves are built based on historical functional requirements. When some change needs to be done, coming from new business needs or changed legislation, it takes months, costs huge amounts of money, and leads to increased complexity and technical debt. The same when two or more application needs to be integrated. This is the application centric-way of building applications. It is dominant to this day. Most big enterprises have thousands of application silos. They try to integrate the data through data warehouses, data lakes, point-to-point interfaces and APIs. All these methods provide partial and temporary solutions and add to the technical debt.
A way to solve this is to decouple data from applications.
A lot more can be said about what data-centric means at the enterprise level. If you have a bit of time to learn about it, I’d recommend watching this video. If you have more time, it’s worth going through The Data-Centric Revolution and if you have even more, read it after its predecessor Software Wasteland to get a better understanding of the size and the nature of the problem. If you have spent more than a couple of years in a big organization, you’ll find many familiar patterns.
We have this problem not only on the open web and in enterprises. We have it also with our personal information management. Our emails are trapped in one application, our documents in another, and then we keep our bookmarks disconnected from them inside the browser. We use one application to search on the web, and another to search our files. Now we can combine it but only if we forfeit our freedom to choose operating system and browser. If we look for something we communicated in writing, we have to remember where we wrote or read it. If we don’t, we have to search our files, our email, Twitter, Facebook, and the web. When we write a Word document we have to open it with MS Word. But when we are in Word we don’t have access to our tasks. For that, we need to go to another application.
A way to solve this is to decouple data from applications.
At all these scales, societal, organizational and personal, when it comes to managing information, we have similar kinds of problems coming from the tight application-data coupling or platform-data coupling. I will focus on organizations from now on to the end of this article, but it’s important to keep the bigger picture in mind.
What does a digital transformation from application to data-centric enterprise look like? In a perfect world it would look something like this:
EKG stands for Enterprise Knowledge Graph. It is something that complies with these design and governance principles.
A more realistic, but still ambitious transformation, will keep the data of the current applications where it is but will have it duplicated (virtualised or streamed) in the enterprise knowledge graph where it will be living an independent life, together with its semantics.
However, all new applications should be built in a way that they “visit the data, perform their magic and express the results of their process back into the data layer for all to share”.
While data-centric is all about decoupling applications and data, by itself, as a goal, slogan and buzzword, is problematic. Data-centric has the similar problems as the preceding waves of process-centric and service-centric movements. Even worse, the word itself can impede the transformation it promotes. When a data department in a big organization is promoting data-centricity inside the organization, that can be easily misconstrued as just being self-centric. If the idea is sold, there will be more budget for the data department. It’s like a planet faking strong gravity to attract matter so that it gets more mass and consequently stronger actual gravity.
What is actually needed is not data-centrism4, just decoupling data from applications. Even not that, loose-coupling would suffice. It’s also easy to explain. You want to run, but your leg is in plaster. Once the plaster is removed you can bend your knee and ankle, you can run, jump, you can walk in one direction and then abruptly change it. But no, loose-coupling is the language of SOA; it’s passé5. Nobody would listen. Not that there are many that hear the data-centric cries, but at least they stand the chance to be echoed off the more modern knowledge graphs and F.A.I.R principles.