6 Things to Remember When Preparing Big Data for Advanced Analytics
Blog: NASSCOM Official Blog
Studies indicate that organizations are finding it difficult to realize the full potential of data. Only about half of the organizations think they are able to use data and analytics for competitive purposes.
“Information is the oil of the 21st century, and analytics is the combustion engine” -Peter Sondergaard, Former Senior Vice President, Gartner
The quote above highlights the importance of big data management and analytics. Many organizations have ambitious plans for analytics and ML. Overall the investments in big data and analytics are increasing every year. However, studies indicate that organizations are finding it difficult to realize the full potential of data. Only about half of the organizations think they can use data and analytics for competitive purposes. And even less than that think they are a data-driven organization and are getting results from the data and AI investments. 55% of data collected by organizations is never used. Why is this happening?
On the other hand, we came across organizations that have been very successful in using data for their business needs. When we look at those examples one thing that stood out was that technology was not the main barrier here. There are several open-source and proprietary platforms available today. Hadoop led the technology landscape in the last decade and then many other options emerged in specific areas like MPP databases, NoSQL databases, data lakes, advanced analytics, stream analytics, BI, ML, and so on. We observed is that the successful big data implementations have high levels of seamless integration between data assets, are agile enough to respond to the changing business needs, and enable self-service BI and analytics for the end-user. These attributes of seamless integration, agility, self-service BI are achieved by focusing on six key data disciplines.
#1: High-Speed Data Acquisition and Processing
There are many options for how you can do data acquisition, extraction, and ingestion into the big data platform. The choice will largely depend on factors like the frequency with which you want to capture the incoming data, whether the data is coming in batches or real-time, whether to employ a push or pull model, etc. Irrespective of the tool or approach you choose, how well your data platform performs in terms of speed, will ultimately decide if users will use the platform widely.
#2: Metadata Management and Data Catalog
Metadata helps us decode three different perspectives. One is, it will help us to create very accurate and consistent reports. Metadata is information about your data. If we know the metadata very well, then, we can check where the data is coming from in our reports. The second use of metadata is, it allows the end-user to find data. This is important, especially in a big data platform. Because in a big data platform we end up collecting lots of data, thousands of data sets are collected from the data source daily. It becomes very difficult for a data scientist or a data analyst who wants to find the data in a large data lake or a data warehouse that you have created. The third use is, if you combine it with other elements it will also help you to track the data lineage as well.
# 3: Ensuring Data Quality
We are all aware that if we do not maintain data quality, then the data platform soon reflects a ‘garbage in garbage out’ type of scenario. Therefore, maintaining data quality is paramount. There are many different thoughts about who is responsible for data quality. We tend to agree with the suggestion that data quality is the responsibility of the data owner. However, the data platform should have some tools available to check the quality of data when it is integrated on the central platform.
# 4: Master Data Management (MDM)
The next important point is Master Data Management (MDM). It’s about maintaining the master list, whether it is the products, vendors, suppliers, or customers. Having a master list across the organization helps to create accurate analytics. Usually, your organization will require standardized master lists in order to bring consistency in reporting and analytics. These master lists act as a single source of truth across the organization. The master lists will be consistent across the data platform. You can consolidate master lists by (1) matching and merging, (2) data standardization, (3) data consolidation.
# 5: Data Security
The fifth ability is about data security and access control. When we bring data together in a central storage area, the data owners demand the highest level of data security especially when that data is personal or financial data. Data owners will not give you the data if there is no guarantee that data is going to be secured. It has to be protected and secured from unauthorized access by using measures like user authentication, User Access Control using RBAC or Item level security, data encryption, and network security. These security measures should be implemented in all components of the big data platform.
# 6: Data Lineage
The last important aspect is the data lineage. It is about tracing data back to its origin. In order to trust the data used for BI and analytics, users will demand to know the flow of their data, where it originated, who had access to it, which changes it underwent, and when. They would want to know where it resided throughout the organization’s multiple data systems. Hence data lineage is an important aspect.
These are the 6 key areas important to make your data platform scalable, agile, searchable, secure, and traceable.