Blog Posts

17 More Must-Know Data Science Interview Questions and Answers, Part 3

Blog: Think Data Analytics Blog

The third and final part of 17 new must-know Data Science interview questions and answers covers A/B testing, data visualization, Twitter influence evaluation, and Big Data quality.

This post contains answers to:


Q13. What makes a good data visualization?

 
Gregory Piatetsky answers:

Note: This answer contains excerpts from the recent post What makes a good data visualization – a Data Scientist perspective.

Data Science is more than just building predictive models – it is also about explaining the models and using them to help people to understand data and make decisions. Data visualization is an integral part of presenting data in a convincing way.

There is a ton of research of good data visualization and how people best perceive information – see work by Stephen Few and many others.

Guidelines on improving human perception include:

See 39 studies about human perception, by Washington Post graphics editor for a lot more detail.

From Data Science point of view, what makes visualization important is highlighting the key aspects of data – what are the most important variables, what is their relative importance, what are the changes and trends.

Data visualization should be visually appealing but not at the expense of loading a chart with unnecessary junk, like in this extreme example on the right.

How do we make a good data visualization?

To do that, choose the right type of chart for your data:

Here is an example of visualization of US Presidential Elections, 1976-2016, that shows multiple variables at once: the electoral college votes difference (y-axis), the % popular vote difference (X-axis), the size of the popular vote (circle area), winner party (color), and winner name and year (label). See my post on What makes a good data visualization for more details.

US Presidential Elections, 1976-2016

References:


Q14. What are some of the common data quality issues when dealing with Big Data? What can be done to avoid them or to mitigate their impact?

 
Anmol Rajpurohit answers:

The most common data quality issues observed when dealing with Big Data can be best understood in terms of the key characteristics of Big Data – Volume, Velocity, Variety, Veracity, and Value.

Volume:

In the traditional data warehouse environment, comprehensive data quality assessment and reporting was at least possible (if not, ideal). However, in the Big Data projects the scale of data makes it impossible. Thus, the data quality measurements can at best be approximations (i.e. need to be described in probability and confidence intervals, and not in terms of absolute values). We also need to re-define most of the data quality metrics based on the specific characteristics of the Big Data project so that those metrics can have a clear meaning, be measured (good approximation) and be used for evaluating the alternative strategies for data quality improvement.

Despite the great volume of underlying data, it is not uncommon to find out that some desired data was not captured or is not available for other reasons (such as high cost, delay in getting it, etc.). It is ironical but true that data availability continues to be a prominent data quality concern in the Big Data era.

Velocity:

The tremendous pace of data generation and collection makes it incredibly hard to monitor data quality within a reasonable overhead on time and resources (storage, compute, human effort, etc.). So, by the time data quality assessment completes, the output might be outdated and of little use, particularly if the Big Data project is to serve any real-time or near real-time business needs. In such scenarios, you would need to re-define data quality metrics so that they are relevant as well as feasible in the real-time context.

Sampling can help you gain speed for the data quality efforts, but this comes at the cost of a bias (which eventually makes the end result less useful) because of the fact that samples are rarely an accurate representation of the entire data. Lesser samples will give higher speed, but with a bigger bias.

Another impact of velocity is that you might have to do data quality assessments on-the-fly, i.e. somewhere plugged-in within the data collection/transfer/storage processes; as the critical time-constraint does not give you the privilege of making a copy of a selected data subset, storing it elsewhere and running data quality assessments on it.

Variety:

One of the biggest data quality issues in Big Data is that the data includes several data types (structured, semi-structured, and unstructured) coming in from different data sources. Thus, often a single data quality metric will not be applicable for the entire data and you would need to separately define data quality metrics for each data type. Moreover, assessing and improving the data quality of unstructured or semi-structured data is way more tricky and complex than that of structured data. For example, when mining the physician notes from medical records across the world (related to a particular medical condition) even if the language (and the grammar) is same the meaning might be very different due to local dialects and slang. This leads to low data interpretability, another data quality measure.

Data from different sources often has serious semantic differences. For example, “profit” can have widely varied definitions across the business units of an organization or external agencies. Thus, the fields with identical names may not mean the same thing. This problem is made worse by the lack of adequate and consistent meta-data from each data source. In order to make sense of data, you need reliable metadata (such as to make sense of sales numbers from a store, you need other information such as date-time, items purchased, coupons used, etc.). Usually, a lot of these data sources are outside an organization and thus, it is very hard to ensure good metadata for such data.

Another common issue is syntactic inconsistencies. For example, “time-stamp” values from different sources would be incompatible unless they are captured along with the time zone information.

Image source.

Veracity:

Veracity, one of the most overlooked Big Data characteristics, is directly related to data quality, as it refers to the inherent biases, noise and abnormality in data. Because of veracity, the data values might not be exact real values, rather they might be approximations. In other words, the data might have some inherent impreciseness and uncertainty. Besides data inaccuracies, Veracity also includes data consistency (defined by the statistical reliability of data) and data trustworthiness (based on data origin, data collection and processing methods, security infrastructure, etc.). These data quality issues in turn impact data integrity and data accountability.

While the other V’s are relatively well-defined and can be easily measured, Veracity is a complex theoretical construct with no standard approach for measurement. In a way this reflects how complex the topic of “data quality” is within the Big Data context.

Data users and data providers are often different organizations with very different goals and operational procedures. Thus, it is no surprise that their notions of data quality are very different. In many cases, the data providers have no clue about the business use cases of data users (data providers might not even care about it, unless they are getting paid for the data). This disconnect between data source and data use is one of the prime reasons behind the data quality issues symbolized by Veracity.

Value:

The Value characteristic connects directly to the end purpose. Organizations are harnessing Big Data for many diverse business pursuits, and those pursuits are the real drivers of how data quality is defined, measured, and improved.

A common and old definition of data quality is that it is the “fitness of use” for the data consumer. This means that data quality is dependent on what you plan to do with the data. Thus, for a given data two different organizations with different business goals will most likely have widely different measurements of data quality.This nuance is often not well understood – data quality is a “relative” term. A Big Data project might involve incomplete and inconsistent data, however, it is possible that those data quality issues do not impact the utility of data towards the business goal. In such a case, the business would say that the data quality is great (and will not be interested in investing in data quality improvements). For example, for a producer of mashed potato cans a batch of small potatoes would be of same quality as a batch of big potatoes. However, for a fast food restaurant making fries, the quality of the two batches would be radically different.

The Value aspect also brings in the “cost-benefit” perspective to data quality – whether it would be worth to resolve a given data quality issue, which issues should be resolved on priority, etc.

Putting it all together:

Data quality in Big Data projects is a very complex topic, where the theory and practice often differ. I haven’t come across any standard theory yet that is widely-accepted. Rather, I see little interest in the industry towards this goal.In practice, data quality does play an important role in the design of Big Data architecture. All the data quality efforts must start from a solid understanding of high-priority business use cases, and use that insight to navigate various trade-offs (samples given below) to optimize the quality of the final output.

Sample trade-offs related to data quality:

Given the magnanimous scope of work and very limited resources (relatively!), one common way for data quality efforts on Big Data projects is to adopt the baseline approach, in which, the data users are surveyed to identify and document the bare minimum data quality needed to ensure that the business processes they support are not disrupted. These minimum satisfactory levels of data quality are referred to as the baseline, and the data quality efforts are focused on ensuring that data quality for each data does not fall beyond its baseline level. It looks like a good starting point and you may later move into more advanced endeavors (based on business needs and available budget).

Summary of Recommendations to improve data quality in Big Data projects:

Do not rely on machine learning to automatically take care of poor data quality (machine learning is science and not magic!)

Original Source

The post 17 More Must-Know Data Science Interview Questions and Answers, Part 3 appeared first on ThinkDataAnalytics.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/17-more-must-know-data-science-interview-questions-and-answers-part-3/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×