Apps Break Data
Blog: strategic structures
Information is not a first-class citizen in corporate information systems. Worse, it is neglected. Then why do we still call them information systems? We don’t. We call them applications. And applications, quite appropriately, are built or purchased with an application-centric mindset. Consequently, data is broken into diverse fragments, tightly coupled with applications, and expensive to reassemble into coherent information entities. Software engineering and procurement practices work in sync with market forces to maintain this trend.
How does this happen? And what can be done about it?
This essay is about the corporate management of data. If an essay about corporate data management had been written a decade ago and mentioned Artificial Intelligence, that would have been a weird essay. It’s the opposite today. A weird essay is a data management essay that’s not about AI. And this will be such a weird essay. I think that AI, more concretely LLMs, came too early, before we managed to solve problems with data quality and interoperability.
Speaking of AI and weird, a recent paper1 reported that all LLMs have WEIRD bias. They are trained on data coming exclusively from countries that are W.E.I.R.D: Western, Educated, Industrialized, Rich, and Democratic. For LLMs, WEIRD is the norm.
In large organizations, weird is the norm as well, and here is one manifestation:
Information systems are not about information.
Now, the question is, of course,
How could that be?
An IT investment is a chain of decisions framed between two realizations. The first is becoming aware of a business need. The second is making the change that is supposed to address that business need. In between, we have to justify the investment and make a set of choices.
The event that triggers the process is becoming aware of a business need. There should be some critical mass and energy to move forward. When there is, it leads to a business case that needs to justify the investment.
The business case defines the initial scope. The scope may change later on, but its initial state marks the application boundaries. Imagine staking out a house. The threads between the stakes determine where the wall of the future silo will be built.
What is important to note here is that application boundaries are historical and accidental. They are determined by past experience and chance.
Since justification is an important factor, it’s worth mentioning a common business case paradox. The more value a business case claims, the higher the chance of being supported. But the more is promised, the lower the probability that it will be delivered. Chris Potts calls that “Project Probability Paradox.”
And IT projects fail a lot. In the early 90s, Standish Group alarmed everyone by reporting that only 16.2% were successful; in other words, that fraction of the projects met their KPI targets. In the two decades that followed, things did not change much.
Failed IT projects get all the attention, but they are not responsible for the current state of corporate IT landscapes. It is created by the successful ones. How? Projects create local optima, which end up as data silos.
Famous and familiar
Once we are in the project stage, two interesting things happen. One is related to requirements, and the other to representation.
Functional requirements bias
The non-functional requirements come second. If some of them get more attention, their visibility is due to market forces. That’s the case with scalability. It was neglected before the cloud. Now, many features take care of scalability, and the cloud providers make sure that they are right in your face. As long as you can afford it, you are welcome to scale.
Among non-functional requirements, scalability got lucky. Security as well. Others, not so much. Worst of all is interoperability. Nobody so far has invented a way to make it sexy or to make a profit from it.
The functional requirement bias and the neglect of the non-demonstrable interoperability work in synergy with another pattern, that of underrepresentation.
Underrepresentation
By space, I mean that, in the project decisions, the whole of the enterprise is not represented. Basically, there is nothing to offset the tendency for the project to achieve its KPIs at the expense of enterprise-wide benefits. It is a well-known problem. Many approaches have been proposed as solutions. Disciplines like IT Governance and Enterprise Architecture(EA) are born out of these concerns. But they have a marginal effect. IT Governance is limited to a few checkpoints and rarely focuses on data and interoperability. Enterprise Architecture is called enterprise, yet enterprise architects report to the CIO, which, in combination with some other pathologies,2 makes EA dysfunctional. But even when it’s not, having working software is way more important than diagrams with boxes and arrows that may point to some risk.
Not only space but also time is not represented on project boards and steering committees. More precisely, the future is not represented. All projects are driven by historical functional requirements, determined by concrete business cases (good by itself but not balanced) and are not prepared for unforeseen business cases. In other words, the software that is built is not future-proof.
High cost of change and integration

Why is this the case?
The information models of corporate applications are in the physical layer, separate for each application and their interpretation is hidden in the application code.
What does it lead to?
High cost of change, integration and migration. Or, in other words, technical debt. In each application, the data structure is fixed by the functional requirements. They are accidental and historical. The function scope determines the data scoop. But both the organization and the environment change. There are new user and business needs and new compliance requirements. Implementing a change in an IT landscape built with an application-centric mindset takes a long time.
Even a small change requires changing a database schema, which can take months and cost millions. Or there is a need to integrate the data from several applications. It can be done by building interfaces, buying data-integration platforms and data lakes, and implementing API platforms. Investing in them means the technical debt is paid for by taking even bigger loans with higher interest.
What can be done about it?
Unify and Decouple
Looking back at history, we’ll see this is not a new problem. The spread of steam engines and other machinery during the industrial revolution necessitated the production of large amounts of screws and bolts. Each manufacturer had its own view on the pitch, depth and form of screw threads, and this resulted in a large variety of threads. Exchange and replacement were limited, and repairs were difficult and expensive. To address this “evil,” Joseph Whitworth, a prominent British engineer and inventor, gathered all types of screws and bolts and compared them. Then, he presented his proposal for unification to the Institution of Civil Engineers in 1841. His “Paper on an Uniform System of Screw Threads” marked the birth of standardization.
That’s the first history lesson: When the diversity of engineering design or measurement systems creates a problem, reduce it. Unify.
It is not a specific way to deal with this kind of situation. As explained in detail in Stimuli & Responses, one way to match the variety of the environment is to reduce it. The other is to increase our variety. We’ve been doing so for centuries using various information technologies. So, let’s have a second look at the history, but this time, the history of information technologies.
There is an interesting trend: we tend to increase the flexibility of the information technologies we use so that they can be used in more and different ways. The history of information technologies can be looked at as a history of increasing decoupling. When writing was invented, it decoupled content from its only medium so far, oral speech. It opened up new possibilities. Messages and stories can travel in time and space. When symbols got decoupled from the objects they represent, this allowed for new ways of thinking. Later on, the printing press enabled individual ownership of books and, in this way, decoupled the interpretation of information from authoritative sources like priests or scholars. The decoupling of software from hardware marked the era of modern computing. Decoupling of service from implementation, brought by new protocols and architecture styles, resulted in advanced web clients and applications. In summary, an effective way to amplify our variety is through information technologies and finding more and smarter ways to decrease the dependencies between their components.
So, the shortest answer to what can be done to improve the abysmal situation created by the application-centric mindset is to unify and decouple. What needs to be unified is identity, structure and semantics. And what needs to be decoupled is data from applications.
Regular readers of Link & Think have already recognised that this is the balance between Cohesion and Autonomy, and that it can only work if maintained at every level. When unifying identity, structure, and semantics, decoupling is also needed, and when decoupling data from applications, it should be done in a standardised way so that they keep working together and in combinations not possible before.
Unify
F1: (Meta) data are assigned globally unique and persistent identifiers
The standardised, established and proven way to do so is with URIs, uniform resource identifiers. Typically, that’s HTTP URIs, but if there is a need to decouple from the host, we now have Decentralised Identifiers (DID) as another option for URIs.
The second thing to unify is the structure. Data is stored using heterogeneous structures. Even when using the same storage paradigm, interoperability is deficient. Different proposals exist to solve this. From what I’ve experienced,3the only mature standard that effectively unifies heterogeneous data structures is RDF. I explained some of RDF’s benefits in the previous post.
The third thing that needs to be unified is the semantics. When discussing the IT investment, I wrote earlier that it happened between two realizations. It wasn’t a problem for you to figure out that the same word, realize, has two definitions. Within that single sentence, it referred to the IT investment initiation with its first meaning (become aware of something) and to the implementation with its second (cause to happen). This shift of meaning was implicit for you, but for machines, when dealing with structured data, it needs to be made explicit.4 Let’s take three of them:
- Data is self-describing and does not rely on an application for interpretation and meaning.
- Applications are allowed to visit the data, perform their magic and express the results of their process back into the data layer.
- Access to and security of the data is the responsibility of the enterprise data layer or the personal data vault and is not managed by applications.
When data is self-describing, it will be interpreted in the same way by different applications. The evolution of the applications will not affect the meaning of the data. No application will be a bottleneck. Old applications can be phased-out with little impact. New applications can come and use the existing data. Data models will be simpler, and the volume of the programming code needed will be lower. The changes in the data will be done once and re-used by different applications. For data to be self-describing also means that the validations and the business rules are in the data layer, too, and expressed in a unified way in all three dimensions: identity, structure and semantics. Applications don’t own the data and don’t store the results of the data processing they provide. They visit the data, use it conforming to the policies, and store the result back into the data layer. The policies (for access, usage, etc), just like the validation and rules, are also part of the data layer and expressed in a unified way. Since applications will not have their own models and determine access to the data, there won’t be any application-induced fragmentation. Apps won’t break data.
First published on Link & Think.
Related posts
Data-centric project requirements?