Organizations need to give unstructured data its rightful place if they want to get value out of data
Blog: Capgemini CTO Blog
According to IDC, the total volume of data will reach 163 zettabytes in 2025. It is expected that 80% of this will be unstructured data. That’s a mind-boggling number, though what is even more amazing is that companies have only marginally shifted in how they handle their unstructured data.
Traditionally, companies mainly used structured (meaning that it fits well within the rows and columns of a database) and internal (meaning that it is created within the organization) data. Nowadays, the part of the data that is unstructured and external is growing the fastest. Sources of external data are social media platforms such as Facebook, Twitter, and WhatsApp, but also search phrases in Google, data streams from smart devices (IoT), video streams from security cameras, or geo info used by Uber or Lyft. All these sources, as well as many others, are adding to the enormous pile of unstructured data that is available to be used and analyzed.
Obviously, unstructured data has always been part of the data used by companies, consisting of text documents, presentations, notes, and to a lesser degree, photos, videos, and images. Traditionally, this has been addressed by either storing this information in a database (as BLOB or CLOB data), or by using an enterprise content management system (ECM). The drawback of storing a contract in a database, for instance, is that it can only be stored and retrieved. It cannot be searched or edited. In this context, an ECM can be seen as the next step. It provides the ability to not only store and edit the data, but also to share it, work on it simultaneously with other people, understand changes between versions, etc. Indeed, this is already a big improvement from the standpoint of handling and leveraging unstructured data.
Unfortunately, this won’t be enough for all the unstructured data that is out there on the internet. The emergence of what is called big data has led to a landslide of new products aimed at handling unstructured data fast enough, also in the event of large data volumes. Hadoop, HDFS, and Map/Reduce can now almost be considered household names, but there are many more products that have emerged as solutions for situations where traditional databases fall short. Document stores, key-value stores, column family stores, and graph databases are all examples of new categories of databases that help manage the large amounts of unstructured data we are seeing today. Semi-structured data, such as documents, can best be handled by using a document store. Any combination of data can be stored as is and does not have to comply with a uniform format – something unheard of in a relational database.
There seems to be a gap between the potential business value that unstructured data holds and day-to-day practices. Some of these challenges include:
- Many possible applications are in use, each supported by its own solution (wiki, ECM, sentiment analysis, etc.) that partially overlaps in both data and functionality
- Over time, many companies have implemented multiple similar solutions (multiple wikis, ECM-systems, search systems, etc.)
- Different search functions that only cover part of the data, delivering irrelevant result lists
- No, inconsistent, or incomplete metadata (where does data originate, who has edited the data, where and when?)
- Same data in multiple locations with small differences – which is the correct version?
The high cost of maintenance and of finding relevant data, as well as the low probability of actually finding the information you want are some of the effects of the situation described above. Finding contradictory data and the effort to find out which is the correct set of data are other disadvantages. In short, there is still much to be won by organizing unstructured data better.
Organizing for value out of data
Organizations that want to get value out of data need to have a solid data foundation that covers both structured and unstructured data, but achieving such a foundation requires remedying the challenges stated above. Several capabilities are needed to better manage unstructured data:
- Text parsing: enabling the interpretation of text documents
- Tagging: apply one or more labels to a document to support the categorization (and thereby the retrievability) of the data
- Semantic analysis of text, and analytics of videos, photos and images
- Generating and maintaining a taxonomy (a classification of data)
- Ability to store (big amounts of) unstructured data
- Search functionality (of structured and unstructured data, not only through text strings, combining multiple criteria and being able to define importance per criterium)
- Availability of metadata: describing not only the data itself but providing a full data lineage.
Analyzing all processes where unstructured data is involved and understanding how it is used will provide an integral view on the unstructured data in the organization. This makes it possible to understand how this data can best be supported. The list mentioned above can help understand to what degree a certain application supports the required functionalities.
For all systems that store unstructured data, it can then be determined whether the system is a reference system, a system of entry (input), or a system of use. While data can be entered and used in many different systems, there can only be one system of reference for the same data. Working in this way ensures that it is clear what constitutes the correct data at any point in time. The resulting simplification and alignment support the data foundation mentioned earlier and makes it possible to get value out of unstructured data, whether it be in combination with structured data, or not.
So for organizations there is a big opportunity to get more value out of data by reorganizing the unstructured data they already have. Let’s no longer wait and build that data foundation!