Top 24 Data Modeling Interview Questions With Detailed Answers
Blog: Think Data Analytics Blog
List of Most Frequently Asked Data Modeling Interview Questions And Answers to Help You Prepare For The Upcoming Interview:
Here I am going to share some Data Modeling interview questions and detailed answers based on my own experience during interview interactions in a few renowned IT MNCs.
Below question-answers can be of your great help if you get a chance to face or take an interview on Data Modelling.
Most Frequently Asked Data Modeling Interview Questions
Q #1) What do you understand by Data Modelling?
Answer: Data Modelling is the diagrammatic representation showing how the entities are related to each other. It is the initial step towards database design. We first create the conceptual model, then the logical model and finally move to the physical model.
Generally, the data models are created in data analysis & design phase of software development life cycle.
Q #2) Explain your understanding of different data models?
Answer: There are three types of data models – conceptual, logical and physical. The level of complexity and detail increases from conceptual to logical to a physical data model.
The conceptual model shows a very basic high level of design while the physical data model shows a very detailed view of design.
- Conceptual Model will be just portraying entity names and entity relationships. Figure 1 shown in the later part of this article depicts a conceptual model.
- Logical Model will be showing up entity names, entity relationships, attributes, primary keys and foreign keys in each entity. Figure 2 shown inside question#4 in this article depicts a logical model.
- Physical Data Model will be showing primary keys, foreign keys, table names, column names and column data types. This view actually elaborates how the model will be actually implemented in the database.
Q #3) Throw some light on your experience in Data Modelling with respect to projects you have worked on till date?
Note: This was the very first question in one of my Data Modelling interviews. So, before you step into the interview discussion, you should have a very clear picture of how data modeling fits into the assignments you have worked upon.
Answer: I have worked on a project for a health insurance provider company where we have interfaces build in Informatica that transforms and process the data fetched from Facets database and sends out useful information to vendors.
Note: Facets is an end to end solution to manage all the information for health care industry. The facets database in my project was created with SQL server 2012.
We had different entities that were linked together. These entities were subscriber, member, healthcare provider, claim, bill, enrollment, group, eligibility, plan/product, commission, capitation, etc.
Below is the conceptual Data Model showing how the project looked like on a high-level
Each of the data entities has its own data attributes. For Example, a data attribute of the provider will be provider identification number, few data attributes of the membership will be subscriber ID, member ID, one of the data attribute of claim will claim ID, each healthcare product or plan will be having a unique product ID and so on.
Q #4) What are the different design schemas in Data Modelling? Explain with the example?
Answer: There are two different kinds of schemas in data modeling
- Star Schema
- Snowflake Schema
Now, I will be explaining each of these schemas one by one.
The simplest of the schemas is star schema where we have a fact table in the center that references multiple dimension tables around it. All the dimension tables are connected to the fact table. The primary key in all dimension tables acts as a foreign key in the fact table.
The ER diagram (see Figure 2) of this schema resembles the shape of a star and that is why this schema is named as a star schema.
The star schema is quite simple, flexible and it is in de-normalized form.
In a snowflake schema, the level of normalization increases. The fact table here remains the same as in the star schema. However, the dimension tables are normalized. Due to several layers of dimension tables, it looks like a snowflake, and thus it is named as snowflake schema.
Q #5) Which scheme did you use in your project & why?
Q #6) Which schema is better – star or snowflake?
Answer: (Combined for Q #5&6): The choice of a schema always depends upon the project requirements & scenarios.
Since star schema is in de-normalized form, you require fewer joins for a query. The query is simple and runs faster in a star schema. Coming to the snowflake schema, since it is in normalized form, it will require a number of joins as compared to a star schema, the query will be complex and execution will be slower than star schema.
Another significant difference between these two schemas is that snowflake schema does not contain redundant data and thus it is easy to maintain. On the contrary, star schema has a high level of redundancy and thus it is difficult to maintain.
Now, which one to choose for your project? If the purpose of your project is to do more of dimension analysis, you should go for snowflake schema. For Example, if you need to find out that “how many subscribers are tied to a particular plan which is currently active?” – go with the snowflake model.
If the purpose of your project is to do more of a metrics analysis, you should go with a star schema. For Example, if you need to find out that “what is the claim amount paid to a particular subscriber?” – go with a star schema.
In my project, we used snowflake schema because we had to do analysis across several dimensions and generate summary reports for the business. Another reason for using snowflake schema was it is less memory consumption.
Q #7) What do you understand by dimension and attribute?
Answer: Dimensions represent qualitative data. For Example, plan, product, class are all dimensions.
A dimension table contains descriptive or textual attributes. For Example, the product category & product name are the attributes of the product dimension.
Q #8) What is a fact & a fact table?
Answer: Facts represent quantitative data.
For Example, the net amount due is a fact. A fact table contains numerical data and foreign keys from related dimensional tables. An example of the fact table can be seen from Figure 2 shown above.
Q #9) What are the different types of dimensions you have come across? Explain each of them in detail with an example?
Answer: There are typically five types of dimensions.
a) Conformed dimensions: A Dimension that is utilized as a part of different areas are called a conformed dimension. It might be utilized with different fact tables in a single database or over numerous data marts/warehouses.
For Example, if the subscriber dimension is connected to two fact tables – billing and claim then the subscriber dimension would be treated as a conformed dimension.
b) Junk Dimension: It is a dimension table comprising of attributes that don’t have a place in the fact table or in any of the current dimension tables. Generally, these are properties like flags or indicators.
For Example, it can be a member eligibility flag set as ‘Y’ or ‘N’ or any other indicator set as true/false, any specific comments, etc. if we keep all such indicator attributes in the fact table then its size gets increased. So, we combine all such attributes and put in a single dimension table called a junk dimension having unique junk IDs with a possible combination of all the indicator values.
c) Role-Playing Dimension: These are the dimensions that are utilized for multiple purposes in the same database.
For Example, a date dimension can be used for “Date of Claim”, “Billing date” or “Plan Term date”. So, such a dimension will be called a Role-playing dimension. The primary key of the Date dimension will be associated with multiple foreign keys in the fact table.
d) Slowly Changing Dimension (SCD): These are most important amongst all the dimensions. These are the dimensions where attribute values vary with time. Below are the varies types of SCDs
- Type-0: These are the dimensions where attribute value remains steady with time. For Example, Subscriber’s DOB is a type-0 SCD because it will always remain the same irrespective of the time.
- Type-1: These are the dimensions where the previous value of the attribute is replaced by the current value. No history is maintained in the Type-1 dimension. For Example, Subscriber’s address (where the business requires to keep the only current address of subscriber) can be a Type-1 dimension.
- Type-2: These are the dimensions where unlimited history is preserved. For Example, Subscriber’s address (where the business requires to keep a record of all the previous addresses of the subscriber). In this case, multiple rows for a subscriber will be inserted in the table with his/her different addresses. There will be some column(s) that will identify the current address. For Example, ‘Start date’ and ‘End date’. The row where ‘End date’ value will be blank would contain the subscriber’s current address and all other rows will be having previous addresses of the subscriber.
- Type-3: These are the type of dimensions where limited history is preserved. And we use an additional column to maintain the history. For Example, Subscriber’s address (where the business requires to keep a record of current & just one previous address). In this case, we can dissolve the ‘address’ column into two different columns – ‘current address’ and ‘previous address’. So, instead of having multiple rows, we will be having just one-row showing current as well as the previous address of the subscriber.
- Type-4: In this type of dimension, the historical data is preserved in a separate table. The main dimension table holds only the current data. For Example, the main dimension table will have only one row per subscriber holding its current address. All other previous addresses of the subscriber will be kept in the separate history table. This type of dimension is hardly ever used.
e) Degenerated Dimension: A degenerated dimension is a dimension that is not a fact but presents in the fact table as a primary key. It does not have its own dimension table. We can also call it as a single attribute dimension table.
But, instead of keeping it separately in a dimension table and putting an additional join, we put this attribute in the fact table directly as a key. Since it does not have its own dimension table, it can never act as a foreign key in the fact table.
Q #10) Give your idea regarding factless fact? And why do we use it?
Answer: Factless fact table is a fact table that contains no fact measure in it. It has only the dimension keys in it.
At times, certain situations may arise in the business where you need to have a factless fact table.
For Example, suppose you are maintaining an employee attendance record system, you can have a factless fact table having three keys.
You can see that the above table does not contain any measure. Now, if you want to answer the below question, you can do easily using the above single factless fact table rather than having two separate fact tables:
“How many employees of a particular department were present on a particular day?”
So, the factless fact table offers flexibility to the design.
Q #11) Distinguish between OLTP and OLAP?
Answer: OLTP stands for the Online Transaction Processing System & OLAP stands for the Online Analytical Processing System. OLTP maintains the transactional data of the business & is highly normalized generally. On the contrary, OLAP is for analysis and reporting purposes & it is in de-normalized form.
This difference between OLAP and OLTP also gives you the way to choosing the design of schema. If your system is OLTP, you should go with star schema design and if your system is OLAP, you should go with snowflake schema.
Q #12) What do you understand by data mart?
Answer: Data marts are for the most part intended for a solitary branch of business. They are designed for the individual departments.
For Example, I used to work for a health insurance provider company that had different departments in it like Finance, Reporting, Sales and so forth.
We had a data warehouse that was holding the information pertaining to all these departments and then we have few data marts built on top of this data warehouse. These DataMart were specific to each department. In simple words, you can say that a DataMart is a subset of a data warehouse.
Q #13) What are the different types of measures?
Answer: We have three types of measures, namely
- Non- additive measures
- Semi- additive measures
- Additive measures
Non-additive measures are the ones on top of which no aggregation function can be applied. For Example, a ratio or a percentage column; a flag or an indicator column present in fact table holding values like Y/N, etc. is a non-additive measure.
Semi- additive measures are the ones on top of which some (but not all) aggregation functions can be applied. For Example, fee rate or account balance.
Additive measures are the ones on top of which all aggregation functions can be applied. For Example, units purchased.
Q # 14) What is a Surrogate key? How is it different from a primary key?
Answer: Surrogate Key is a unique identifier or a system-generated sequence number key that can act as a primary key. It can be a column or a combination of columns. Unlike a primary key, it is not picked up from the existing application data fields.
Q #15) Is this true that all databases should be in 3NF?
Answer: It is not mandatory for a database to be in 3NF. However, if your purpose is the easy maintenance of data, less redundancy, and efficient access then you should go with a de-normalized database.
Q #16) Have you ever came across the scenario of recursive relationships? If yes, how did you handle it?
Answer: A recursive relationship occurs in the case where an entity is related to itself. Yes, I have come across such a scenario.
Talking about the health care domain, it is a possibility that a health care provider (say, a doctor) is a patient to any other health care provider. Because, if the doctor himself falls ill and needs surgery, he will have to visit some other doctor for getting the surgical treatment.
So, in this case, the entity – health care provider is related to itself. A foreign key to the health insurance provider’s number will have to present in each member’s (patient) record.
Q #17) List out a few common mistakes encountered during Data Modelling?
Answer: Few common mistakes encountered during Data Modelling are:
- Building massive data models: Large data models are like to have more design faults. Try to restrict your data model to not more than 200 tables.
- Lack of purpose: If you do not know that what is your business solution is intended for, you might come up with an incorrect data model. So having clarity on the business purpose is very important to come up with the right data model.
- Inappropriate use of surrogate keys: Surrogate key should not be used unnecessarily. Use surrogate key only when the natural key cannot serve the purpose of a primary key.
- Unnecessary de-normalization: Don’t denormalize until and unless you have a solid & clear business reason to do so because de-normalization creates redundant data which is difficult to maintain.
Q #18) What is the number of child tables that can be created out from a single parent table?
Answer: The number of child tables that can be created out of the single parent table is equal to the number of fields/columns in the parent table that are non-keys.
Q #19) Employee health details are hidden from his employer by the health care provider. Which level of data hiding is this? Conceptual, physical or external?
Answer: This is the scenario of an external level of data hiding.
Q #20) What is the form of fact table & dimension table?
Answer: Generally, the fact table is in normalized form and the dimension table is in de-normalized form.
Q #21) What particulars you would need to come up with a conceptual model in a health care domain project?
Answer: For a health care project, below details would suffice the requirement to design a basic conceptual model
- Different categories of health care plans and products.
- Type of subscription (group or individual).
- Set of health care providers.
- Claim and billing process overview.
Q #22) Tricky one: If a unique constraint is applied to a column then will it throw an error if you try to insert two nulls into it?
Answer: No, it will not throw any error in this case because a null value is unequal to another null value. So, more than one null will be inserted in the column without any error.
Q #23) Can you quote an example of a sub-type and super-type entity?
Answer: Yes, let’s say we have these different entities – vehicle, car, bike, economy car, family car, sports car.
Here, a vehicle is a super-type entity. Car and bike are its sub-type entities. Furthermore, economy cars, sports cars, and family cars are sub-type entities of its super-type entity- car.
A super-type entity is the one that is at a higher level. Sub-type entities are ones that are grouped together on the basis of certain characteristics. For Example, all bikes are two-wheelers and all cars are four-wheelers. And since both are vehicles, so their super-type entity is ‘vehicle’.
Q #24) What is the significance of metadata?
Answer: Metadata is data about data. It tells you what kind of data is actually stored in the system, what is its purpose and for whom it is intended.
- Practical understanding of the Data Modelling concept and how it fits into the assignments done by you is much needed to crack a data modeling interview.
- The most commonly asked topics in Data Modelling interview are – different types of data models, types of schemas, types of dimensions and normalization.
- Be well prepared for scenario based questions as well.
I would suggest that whenever you are answering a question to the interviewer, it’s better that you explain the idea through an example. This would show that you have actually worked into that area and you understand the core of the concept very well.
All the best!!