Airbnb Data Warehouse Schema - Data Engineering Mock Interview

The Art of Data Partitioning: A Step-by-Step Guide to Designing an Effective Data Model

When designing a data model, one of the most critical aspects to consider is partitioning. Partitioning refers to the process of dividing a large dataset into smaller, more manageable pieces based on specific characteristics or attributes. In this article, we will delve into the world of data partitioning and explore its various strategies, benefits, and best practices.

The Importance of Partitioning

Partitioning is a technique used to reduce the size of a dataset and improve query performance. When dealing with large datasets, query performance can be severely impacted due to increased computation time, memory usage, and storage requirements. By partitioning the data, you can significantly reduce the amount of data that needs to be processed for each query, resulting in faster execution times.

Types of Partitioning

There are several types of partitioning techniques used in data warehousing, including:

* Horizontal partitioning: This involves dividing a table into smaller pieces based on a specific attribute or key. Each piece is stored in a separate file group or partition.

* Vertical partitioning: This involves dividing a table into smaller pieces based on a specific column or set of columns. Each piece is stored in a separate file group or partition.

* Composite partitioning: This involves combining horizontal and vertical partitioning techniques to divide a table into multiple smaller pieces.

When Choosing a Partition Key

The partition key, also known as the primary key or range identifier, determines how data is divided into partitions. The choice of partition key depends on the specific requirements of your application and the characteristics of your data. When selecting a partition key, consider factors such as:

* Data distribution: Is the data evenly distributed across the range of values? If not, you may need to use a different partitioning strategy.

* Query patterns: What types of queries will be executed on the data? If queries frequently filter by a specific attribute, it's beneficial to use that attribute as the partition key.

Best Practices for Partitioning

While partitioning can significantly improve query performance, it requires careful planning and execution. Here are some best practices to keep in mind:

* Start small: Begin with a limited number of partitions and gradually increase as needed.

* Use a consistent partitioning strategy: Avoid switching between different partitioning techniques mid-stream, as this can lead to inconsistent data distribution.

* Monitor query performance: Regularly monitor query performance and adjust the partitioning scheme as necessary.

Designing a Data Model for Airbnb's Review Facts

During our conversation, we explored designing a data model for Airbnb's review facts. We discussed various aspects of the data, including:

* Structured vs. unstructured data: The review facts contain both structured and unstructured data. To effectively handle this, consider using a hybrid approach that combines relational and NoSQL databases.

* Data quality: Ensure that the data is accurate, complete, and consistent. This may require implementing data validation checks and data cleansing procedures.

Benefits of Using Text Mining Techniques

Text mining techniques can provide valuable insights into unstructured data. Some common text mining techniques include:

* Sentiment analysis: Analyzing the tone and emotions expressed in customer reviews to gain a better understanding of their opinions.

* Topic modeling: Identifying underlying themes and topics within large volumes of text data.

Extending the Data Model to Competitive Pricing

Competitive pricing is another important aspect of data warehousing that requires careful consideration. By extending our data model to handle discounts, promotions, and other pricing strategies, we can gain a deeper understanding of Airbnb's business operations.

Additional Areas for Optimization

Other areas where optimization is possible include:

* Dynamic pricing: Implementing dynamic pricing algorithms that adjust prices in real-time based on demand.

* Discounts and promotions: Tracking and analyzing discounts and promotions to identify trends and opportunities for improvement.

"WEBVTTKind: captionsLanguage: endesign um data modeling for arbnb hello everyone welcome to another mock interview with exponent today we're going to talk a little bit about data botling uh with our guest Anushka thank you so much for being here Anushka would you take a moment to introduce yourself to our viewers yes absolutely um so hi and my name is Anushka Tak I am a data engineer with Amazon uh I've been with Amazon for a little over four years now um I started my data engineering Journey with the data Lake team um and this is where we uh built a data Lake to store org level data for analytics by data scientists data other data Engineers as well as product managers um then I moved into a data warehousing team uh which was more database Centric here I got to work with all the business teams and learn more about their core operational business um here uh we design and deliver Solutions building the data models from scratch and we translate business requirements into Data pipelines cool uh thanks for joining in again uh let's jump right into our question for today which is uh uh design um data modeling for arbnb okay I think we have a lot of data to play with uh with Airbnb so all right let's let's jump in before we jump in I definitely want to understand the purpose of this data um uh is it more transactional or analytical okay that's that's a good question so uh this is going to be uh for analytical purposes um let's say for data warehousing so uh so we're going to be doing a data modeling for analytical purposes okay that makes sense um so I think in terms of schema um I'm leaning towards a star schema because star schema uses the denormalized data which means uh it adds a lot of redundant cost volumes to some Dimension tables but but it makes squaring uh faster so reading operations are much much faster and for analytical purposes I'm assuming that we're going to be reading out of our database quite a lot so I think star schema is really well suited uh for our usage here however it is not optimal for storage right um on the flip side we have snowflake schema which utilizes normalized data which in turn reduces redund but it also involves more join to produce the same view that we were uh to to produce the same view of the data right so it is less optimal for read operations but in terms of storage it's a better alternative um for the sake of our exercise I I think since we are focusing on analytical um per uh analytical rates and purposes um we can start our data model um in in Star schema right yeah sounds good okay um okay so now that we have established uh we want to perform analytics on Airbnb data as well as we we're choosing a star schema let's talk about the kind of metrics we want to derive from this data I think it's extremely critical that we work backwards from what are we trying to gain out of this data uh so that it gives us an outline or a direction to think in terms of what is the kind of data that we want to capture around it um so you do you have any directions on um metrics that we can SE out of here that's a good question so I think it's good to scope that one as well so I'm primarily looking at um two key metrics here so one is like customer Obsession or engagement um so um so shortly this is to improve the experience so we can so from my Airbnb point of view the reviews could be very critical about the satisfaction of the customers so how can we use the reviews to improve the experience that's on a on a very very high level uh I would U I would let you Del into the details of how you want to do it uh second is uh business profitability business profitability is mostly about pricing uh Revenue optimization so as a business how can we improve our revenues uh and then is there a trend in terms of the reservation that's happening uh that we can uh that we can get from this data model data model as well yeah these are the two two main uh objectives um Fair Point um I think you're right like bang on uh customer satisfaction is one of the key um key key components of any business and you know it it is a direct uh relationship to to um increase a business profitability from customer feedback so we must gather data around uh customer reviews uh how the customer experience has been with respect to bookings were there any complaints what the res resolution was like so I definitely want to um capture data around let's say like I said reviews um and or maybe think about a review fact that captures all of this right um and talking about the second thing um we can uh captured some data around bookings or reservations and um and see what the reservation Trend analysis has been like we can um gather insights around uh what what like if there are any factors regarding GE geography of an area if there is any specific time period that um that has more bookings than others I think things like that can help us determine uh what the areas or opportunities of improving businesses so that's again a good one so for the time being let's capture all of this information in a table called booking fact and you also touched upon pricing um or Revenue optimization right so maybe um yep we can also um analyze data around the revenue uh and let's call it Revenue fact to keep it simplistic uh simplistic um so yep basically I want to have some data around Revenue um which is like from a particular booking uh how much um Revenue was extracted uh whether we have higher Revenue in certain areas um whether we're getting more business from certain uh booking sources and stuff like that so I think uh that can also provide us a provide us good insights around data um which which can reveal with what is a good um Revenue optimization area right right um So based on this let's categorize um our three main fact tables to to to to drive the data model right so it's going to be B booking fact Revenue fact and review fact let's dive deeper into each of these and let's talk about uh what kind of metrics we want to be present uh in these columns and and how we can um further use that data to derive insights from it right um sounds good okay so let's start with review fact um mhm so I'm thinking uh in review fact we can let me draw a high level UHD here all right uh let's call it review fact there you go uh I definitely want a review ID in there which can serve as a primary key uh for this table and oops we definitely want uh reviews on our listings so I'm thinking that we have a listing ID present in there um which um I I plan to like this can serve as a foreign key to my Dimension table uh which which I will call uh listing let me listing Tim um basically I I want to capture all the background and contextual information of entities in dimension tables and I want to capture all the quantitative uh and numerical data of um of of our facts in the fact tables uh we will join them using uh the keys present in each of these tables so uh this becomes listing key which is the primary key of this table but becomes the foreign key of uh review fact right um foreign key we can also think about who was the user who um provided this review uh can insert okay all right let me insert below uh which now becomes a foreign key to user dim a need for another user listing dim okay uh user dim user ID becomes my um this uh let's let's finish user DM while we are here um we can have um ATT attributes of a user um in this user DM so contact information name email uh attributes of a user basically country and like as we want to um joining date uh and Etc like whatever we want the information on user themm to be here um going back to listing them while we're here let us also think think of um attributes of a listing um so basically I'm thinking the number of rooms um maybe number of rooms we can also think about who the host is so uh again this becomes um this this calls the need of another dimension table where I want to um capture host information um okay uh host Dimension and then host ID becomes the primary key of this one um going back to listing them what else can we in so host ID property type whether it's an apartment or um a house um we can also include what city it is um um yeah I think that should be a good place for listing Dimension um now let's go back to completing host Dimension um here I'm trying to capture all the information or attributes of a host basically so again um this is a type of user if we think about it but for the sake of Simplicity uh let's keep a host Dimension separate to the guest Dimension and uh as a user so let's capture that information here we have host uh ID the name and email um whether or not they are a superhost what um when did they join what the country is and maybe more attributes around phone number uh what their response rate is kind of like gives an idea of um how responsive they are whether or not they're a super host they may that may even be linked so something to uh yeah um yeah I think to start off with uh that is good now we go back to review fact um so I definitely have a a a listing ID that this review is associated with a user ID who has provided this review uh maybe a review date um that makes me think I do want a date Dimension to be a part of my um data model as well um yeah definitely and un avoidable isn't it dat yes uh and um yeah so really useful um in a um in data models um when especially when um in in retail environments when the calendar dates are not the same as uh retail dates uh and financial year it's different so definitely has its own perks um so and more along the line maybe we can think about quarter if you want to do some analysis on a quarter basis uh week of year what whether or not is it a holiday we uh we definitely see a different um increased business in when there is a holiday so I think that can be a useful attribute in the future um so we have date your week month yeah I think that's a good place to start off with yeah okay um going back to review fact now that we have review date which becomes a yep uh we which which can be an FK to dat okay um the most important thing what the rating is uh so I I think I'm envisioning rating to be on a scale of 1 125 um and that is basically a non key column here um we can also think about capturing uh reviews in text so let's call it a text review um and we can also capture uh response time uh basically the time taken by the host to respond to this review so I think that uh brings us to a good place to capture information around reviews here let's see just quick quick question on the grain that you have for the review fact so so it's like um so would you would you prefer review date or is it like a are you thinking in terms of a time stamp for this one for a review um or just like a um no so for the sake of Simplicity I call it date but it's always beneficial to have time stamps in there uh and we can always you know think in terms of slow like if we want historical data to be present uh it is it is best that we act uh that we capture data at a time stamp level yeah yeah makes sense because um if if a same user wants to put two different reviews on a date uh so makees sense to know when when that particular review was put but makes sense yeah pleas go ah yeah um okay so I was thinking I could also connect uh each of these to the review fact to highlight the relationships the fact and dimension tables have with each other here uh listing listing is connected to host right so yes so it's going to be like this here and sense um Yep this one yep so yeah uh because the shape of this data model resembles the shape of a star is where the name is derived from but I think this is a pretty good outline of how we can uh derive insights from reviews um and and um basically look at um analyze the distribution of ratings for across listings or host um we can also identify the top performing hosts based on Revenue um occupancy rates but for that we probably need uh booking fact uh or Revenue fact where I want to capture the payment related data one quick question on the response time so uh so you mentioned like the review um so should we have any controls in place to for the users um to leave a review in terms of one if they have stayed uh should be should they be able to review so do we have to accommodate for that what what's your thoughts on that yeah that's a good point um yeah that's a good point I think we can include a booking ID in here uh which will ensure that only if a user has stayed with a specific host uh the user is able to leave a review um so I think that referential um constraint will will enable the control on that and this I think becomes the FK to booking fact so yeah that's a good sense speaking of uh we can dive deeper into booking fact and and let's see how that um looks okay um let's draw booking fact in here okay um so yes I start off with the primary key of this table and I'll call it booking ID a booking has to be associated with a listing um so this uh again we will uh link it to the same listing Dimension um we also have um I also Envision user information to be associated with the booking fact who the user was associated with a particular booking ID so I want I would want um I would want to use ID present in this fact table um as well as a host ID which we can derive from host dimension um a host can have multiple listings in there so having all of that in the same fact table provides us the clarity of like which listing are we talking about uh and which host um is this we can capture information on what the booking date is what the checking date is what the checkout date is information around that check out date um what the number of nights uh are um what the cost uh of the booking is um and henceforth we can use this to to look into our Revenue um booking Source um so it's it's it's uh it's invaluable to analyze uh what source is deriving a a significant chunk of of our bookings or reservations so I think that can really help and number of guests um so y I think that Pro that gives us a good picture to capture information on them let me see if I can bring the dimensions here no maybe not um but basically the dimensions linked to booking fact are clearly listing them host di user di um and date di of course because we're going to be using a key here so let us see hopefully I can draw and drop here oh awesome so J um so the status of the booking is something that you would want to capture as well is like could could be cancelled or just like yep that is a good point um yeah you're right um we can um we can um have another attribute here let me see uh that can capture the booking status it can be either pending confirmed or cancelled uh and yeah uh we can look into cancellations at a later point to analiz whether it is associated with a specific host or what the reasons were whereever applicable so yeah that's a good point let me um include that as well we can call it booking status okay sounds good and also uh just to capture below um canellation reason um wherever applicable so I expect some nulls here uh which is not really optimal uh if because we don't expect a lot of cancellations to be here but um just just to make quering faster let us keep cancellation reason as part of the same booking fact for the time being good okay um and then listing them so now I think we have almost everything we need here we can proceed to join sorry um all right in there okay um let us highlight the user ID host ID dats listing ID host ID and then all these date columns we can uh link to the dates here um yep all right okay um just of you so holistically I'm looking at booking fact um related to date Dimension user Dimension listing Dimension host Dimensions um and uh I definitely care about what uh what the cost associated with a particular booking has been whether what the status of it is how many guests there are what the uh um check and checkout dates are um so yeah I think that that provides me a good information on uh that puts me at a good place to analyze uh Revenue uh or analyze the booking trends like we just yes good and then to cover the third Point uh which is pricing so let's also talk about Revenue okay so Revenue fact um I start off with a revenue ID which serves as the primary key of this table um a revenue is has is associated with the listing ID so I want to keep it there as is um I don't think I I I don't think I need a user ID but I definitely uh want booking ID to be present in there so let me rename it to booking ID that becomes a forign key to my booking fact table um we also would like uh the most important thing which is the revenue amount generated uh from a particular uh list booking um and then what the payment method was we can think about currency since it's a global platform um and maybe a date of payment or yeah payment date I think that makes more sense um okay yeah so I think um that brings us to a good place um of when we want to analyze the revenue uh coming out of an um a particular booking right um this one I expected I Envision it to be um to be related to listing Dimension and Bing fact so you know what I think we should move it to the first page second page and uh let us bring it here all right um maybe I can move it to this part shuffling here but all right um so booking ID I will link it here my listing ID gets linked from this point and my revenue is a primary key so that works uh we can also link payment date to date Dimension and uh yep here we see two fact tables are related to each other yeah yeah that's um yeah okay um so yeah I kind of like envisioned my data model to be looking like this we have some common Dimension tables and we have two fact tables that will help us uh derive further insights yeah quick question so do we so I just like maybe it's very trival so for the location uh do you are you thinking about say if you want to do it like city-wise Revenue if you have like multiple things that could add more value yes absolutely that's a good point uh we can have another dimension table over here uh to capture all the demographics so we can uh have a location Dimension with with attributes like city state country uh anything that yep uh attributes we we care about so yep that is a good point okay city state and then country region more of on the same lines okay um so that's a good point uh if if I were to derive um insights on um let's say Revenue by location like like you said right um I can um let's let me write it down over here oving it okay um I can definitely use a booking fact um so since I'm looking into Revenue by let's say cities right um I would want booking fact because that one has um the cost Associated uh with with a particular booking right um I can join it um with the listing Dimension and from the listing um I can so I have city which is a location attributes I can even uh I can even further um maybe uh divided into location ID joining um I I I will join it with with the location Dimension uh using the location ID and fetch the city from uh listing Dimension so we can do something like this on on let's say listing on listing let me have have an alias in there listing and let's call it location uh dot location ID is equal to listing dot location ID and I want to join the booking fact and listing on listing ID uh so I have listing ID present in both of them mhm look something like this listing ID so now um I have a cost I have the cost that I can derive from booking fact I have the city that I can get from location dim so I will group it by City comma take a sum of all the cost that I get and that can serve as a as my total revenue that I'm grouping by across City Y across cities right so I think this yeah so this kind of gives me an idea of the revenue by cities we can uh extend the same kind of analysis for across regions countries um and and other attributes yeah and the the nice thing about this one is since we have the de Dimensions across all facts um that helps us to have a trend as well so the city um like cost Trend if I want to look at uh which is good um and also I think the host and U listing uh you have a host ID as well isn't it the listing has a host as well so that means like we can even do some queries like a top performing hosts so which is good um which host has generating a more Revenue right we want to we want to keep them act so so that's something that we can do as well yeah that's good yep yeah that's a good point um in fact it's an interesting insight to derive out of this data model if you want to take a look at which host has had the most cancellations we can definitely again leverage booking fact um and I'm sorry uh we can leverage booking fact um filter um filter on the booking status uh and I'm assuming it to be cancelled and then again hosted by hosting ID so I'm thinking maybe something like um here let's see um so again host by cancellations we can do something like um take booking fact oops uh take booking okay sorry about that take book in fact uh join it with the host Dimension so that we can gather information on um on the host IDs basically so host dimension. host ID join it with booking fact on host ID uh we only care about cancellation so I'm going to write uh the booking status to be cancelled yep um something like cancelled um and group it by host ID let it be from host Tim host ID and um in order to count uh the number of host IDs we can we were like we grouping it but host ID Comm account uh oops host ID come account the number of working IDs that are associated with the particular um column here so I think that gives us a good um number of cancellations by host yep yep Yes sounds good that's good uh nicely done uh so uh probably okay maybe uh could you also talk a bit bit about um uh the tradeoffs or challenges or things you would ideally want to take care uh if you were to design this data model for uh scale of rbnb yeah yeah that's a good point um so um we've kept it quite simplistic here uh the data model and um assumed that you you know um it is suitable for for the the scale of data that we have but as the scale increases of the data booking fact and revenue fact will grow enormously um that means that our qu our database will slow down whenever there is a a read operation performed so we need to be really careful around how we're indexing these tables in our database how we're partitioning it um so that we have some guardrails around uh query query optimizations and and so that we have guardrails around the read operations even with the scale of the data um okay would you have any recommendation on the partitioning uh what we what we can do for these type of facts um yes um I think it so it depends on um the purpose again if we are um if we care about latest data I I think date dimensions are good to Partition the data by uh however if we care about um partitioning the data uh if we care about analyzing the data for specific regions I think region uh can serve as a good partition key primary Keys also um are are a good candidate to be to be partitioning uh upon but the same time we should also consider the tradeoff of um not um making a specific column a hot partition we don't want to be writing to a specific partition um to do a lot of Rights on a specific partition where which basically leads to um data data being uh again making it a hard partition is is making it prone for failure um yes yeah so I think while we're partitioning we definitely want to keep these things in mind like uh what makes our database quicker for reading as well as not prone to failure uh while during rights oper right operations uh we also want to make sure that we're not inducing skewness within our data um so I think these are some good points to consider when we are deciding what our partition key is yes yes definitely I me think we touched quite a few points there uh which is good um yeah um so depending on the purpose um partitioning is a good strategy uh just like not blanket date I mean most of the facts we know it's like back partitioned by dates primarily I mean there's a reason to it but it's also there could be some specific facts which doesn't really have to be parti by trade like uh like you mentioned it could be location specific um yeah and also um read heavy partitions um the hard partition that you mentioned is also quite uh a good thing to take care um these are some of the things that you need to take take care before designing it's good uh otherwise once it's in production then you take care of all these different things yep yeah it could it could induce some sleepless nights yeah okay yeah anything else that you would want to take care on this design um yes um speaking of review fact um um for the purpose of this exercise we have assumed that the text uh data that we have is is confined and is structured but in the real world um I think we may have to deal with unstructured data so we might have to further refine uh this and you know process it in a different schema um alog together or maybe apply some text mining techniques there um so I think that is another opportunity to um that that comes along with uh text Data yep yep exactly I think um now yeah the definitely the review fact is going to be um review text is going to be definitely unstructured probably we can look at um primarily Json type of data there so it's good to so it's interesting that you touched about the text mining here uh so even though text mining is something that we can like say there is there's a lot of scope for the data science related things there but still from a from a data warehousing point of view if we could extract information like uh whether it's a positive review or do some sentiment analysis positive review negative review and then like along with your review fact which you have designed if we can kind of like give an indication that this is this seems to be a positive review there is a negative review this kind this gives another dimension for our uh queries uh analytical queries okay how many of those uh reviews that came for that particular listing uh were negative MH all right so which is another dimension right so yeah that's uh that's a good point that you touched on that X mining thing I think this is a great place to pause um so let's assume you are an interviewer what do you think you went well and then like is there something else that you would have um ideally added to this um design uh or you would like to um I think I I really liked the activity of thinking out loud and you know brainstorming with you uh it gave me a good direction uh and a good structure to think about um to think about things I I missed right so I I think that was uh that was really useful um also working backwards from the business perspective is uh for example like uh thinking about the kind of metrics uh we care about the kind of insights we want to drive out of this data helped me structure the data model so I think that was also a good that that went really well um I given more time I think we can also uh brainstorm and dive deeper into areas like competitive pricing uh how Airbnb is um performing uh at um at par or whether whether or whether or not Airbnb is overpricing or underpricing with respect to other competitors in the market um that can also potentially be a good area uh to to optimize uh business operations I think that one is quite interesting Airbnb has Dynamic pricing as well um in in the real world scenario so they they do have some discounting there so I think um Gathering that data and reviewing the data can also be a good place for uh us to to to get gather some interesting insights there yeah definitely and um um and you you did really well I mean we had a couple of objectives to start with I think we we kind of covered end to end for that um along with the queries and then we talked about few extensions uh and then there were few um few extensions and possibilities that this model gave thanks for that it was really good uh and uh yeah definitely uh competitive pricing is definitely there right so it's just like that is another interesting point um and we we can uh extend this data model um to handle that in terms of the discounts and stuff if you enjoy this interview visit Tri exponent.com to view our data engineering interview course where you'll have access to Library of interview questions expert coaching and peer-to-peer mock interviews where you can practice yourself good luck on your next interview and thanks for watchingdesign um data modeling for arbnb hello everyone welcome to another mock interview with exponent today we're going to talk a little bit about data botling uh with our guest Anushka thank you so much for being here Anushka would you take a moment to introduce yourself to our viewers yes absolutely um so hi and my name is Anushka Tak I am a data engineer with Amazon uh I've been with Amazon for a little over four years now um I started my data engineering Journey with the data Lake team um and this is where we uh built a data Lake to store org level data for analytics by data scientists data other data Engineers as well as product managers um then I moved into a data warehousing team uh which was more database Centric here I got to work with all the business teams and learn more about their core operational business um here uh we design and deliver Solutions building the data models from scratch and we translate business requirements into Data pipelines cool uh thanks for joining in again uh let's jump right into our question for today which is uh uh design um data modeling for arbnb okay I think we have a lot of data to play with uh with Airbnb so all right let's let's jump in before we jump in I definitely want to understand the purpose of this data um uh is it more transactional or analytical okay that's that's a good question so uh this is going to be uh for analytical purposes um let's say for data warehousing so uh so we're going to be doing a data modeling for analytical purposes okay that makes sense um so I think in terms of schema um I'm leaning towards a star schema because star schema uses the denormalized data which means uh it adds a lot of redundant cost volumes to some Dimension tables but but it makes squaring uh faster so reading operations are much much faster and for analytical purposes I'm assuming that we're going to be reading out of our database quite a lot so I think star schema is really well suited uh for our usage here however it is not optimal for storage right um on the flip side we have snowflake schema which utilizes normalized data which in turn reduces redund but it also involves more join to produce the same view that we were uh to to produce the same view of the data right so it is less optimal for read operations but in terms of storage it's a better alternative um for the sake of our exercise I I think since we are focusing on analytical um per uh analytical rates and purposes um we can start our data model um in in Star schema right yeah sounds good okay um okay so now that we have established uh we want to perform analytics on Airbnb data as well as we we're choosing a star schema let's talk about the kind of metrics we want to derive from this data I think it's extremely critical that we work backwards from what are we trying to gain out of this data uh so that it gives us an outline or a direction to think in terms of what is the kind of data that we want to capture around it um so you do you have any directions on um metrics that we can SE out of here that's a good question so I think it's good to scope that one as well so I'm primarily looking at um two key metrics here so one is like customer Obsession or engagement um so um so shortly this is to improve the experience so we can so from my Airbnb point of view the reviews could be very critical about the satisfaction of the customers so how can we use the reviews to improve the experience that's on a on a very very high level uh I would U I would let you Del into the details of how you want to do it uh second is uh business profitability business profitability is mostly about pricing uh Revenue optimization so as a business how can we improve our revenues uh and then is there a trend in terms of the reservation that's happening uh that we can uh that we can get from this data model data model as well yeah these are the two two main uh objectives um Fair Point um I think you're right like bang on uh customer satisfaction is one of the key um key key components of any business and you know it it is a direct uh relationship to to um increase a business profitability from customer feedback so we must gather data around uh customer reviews uh how the customer experience has been with respect to bookings were there any complaints what the res resolution was like so I definitely want to um capture data around let's say like I said reviews um and or maybe think about a review fact that captures all of this right um and talking about the second thing um we can uh captured some data around bookings or reservations and um and see what the reservation Trend analysis has been like we can um gather insights around uh what what like if there are any factors regarding GE geography of an area if there is any specific time period that um that has more bookings than others I think things like that can help us determine uh what the areas or opportunities of improving businesses so that's again a good one so for the time being let's capture all of this information in a table called booking fact and you also touched upon pricing um or Revenue optimization right so maybe um yep we can also um analyze data around the revenue uh and let's call it Revenue fact to keep it simplistic uh simplistic um so yep basically I want to have some data around Revenue um which is like from a particular booking uh how much um Revenue was extracted uh whether we have higher Revenue in certain areas um whether we're getting more business from certain uh booking sources and stuff like that so I think uh that can also provide us a provide us good insights around data um which which can reveal with what is a good um Revenue optimization area right right um So based on this let's categorize um our three main fact tables to to to to drive the data model right so it's going to be B booking fact Revenue fact and review fact let's dive deeper into each of these and let's talk about uh what kind of metrics we want to be present uh in these columns and and how we can um further use that data to derive insights from it right um sounds good okay so let's start with review fact um mhm so I'm thinking uh in review fact we can let me draw a high level UHD here all right uh let's call it review fact there you go uh I definitely want a review ID in there which can serve as a primary key uh for this table and oops we definitely want uh reviews on our listings so I'm thinking that we have a listing ID present in there um which um I I plan to like this can serve as a foreign key to my Dimension table uh which which I will call uh listing let me listing Tim um basically I I want to capture all the background and contextual information of entities in dimension tables and I want to capture all the quantitative uh and numerical data of um of of our facts in the fact tables uh we will join them using uh the keys present in each of these tables so uh this becomes listing key which is the primary key of this table but becomes the foreign key of uh review fact right um foreign key we can also think about who was the user who um provided this review uh can insert okay all right let me insert below uh which now becomes a foreign key to user dim a need for another user listing dim okay uh user dim user ID becomes my um this uh let's let's finish user DM while we are here um we can have um ATT attributes of a user um in this user DM so contact information name email uh attributes of a user basically country and like as we want to um joining date uh and Etc like whatever we want the information on user themm to be here um going back to listing them while we're here let us also think think of um attributes of a listing um so basically I'm thinking the number of rooms um maybe number of rooms we can also think about who the host is so uh again this becomes um this this calls the need of another dimension table where I want to um capture host information um okay uh host Dimension and then host ID becomes the primary key of this one um going back to listing them what else can we in so host ID property type whether it's an apartment or um a house um we can also include what city it is um um yeah I think that should be a good place for listing Dimension um now let's go back to completing host Dimension um here I'm trying to capture all the information or attributes of a host basically so again um this is a type of user if we think about it but for the sake of Simplicity uh let's keep a host Dimension separate to the guest Dimension and uh as a user so let's capture that information here we have host uh ID the name and email um whether or not they are a superhost what um when did they join what the country is and maybe more attributes around phone number uh what their response rate is kind of like gives an idea of um how responsive they are whether or not they're a super host they may that may even be linked so something to uh yeah um yeah I think to start off with uh that is good now we go back to review fact um so I definitely have a a a listing ID that this review is associated with a user ID who has provided this review uh maybe a review date um that makes me think I do want a date Dimension to be a part of my um data model as well um yeah definitely and un avoidable isn't it dat yes uh and um yeah so really useful um in a um in data models um when especially when um in in retail environments when the calendar dates are not the same as uh retail dates uh and financial year it's different so definitely has its own perks um so and more along the line maybe we can think about quarter if you want to do some analysis on a quarter basis uh week of year what whether or not is it a holiday we uh we definitely see a different um increased business in when there is a holiday so I think that can be a useful attribute in the future um so we have date your week month yeah I think that's a good place to start off with yeah okay um going back to review fact now that we have review date which becomes a yep uh we which which can be an FK to dat okay um the most important thing what the rating is uh so I I think I'm envisioning rating to be on a scale of 1 125 um and that is basically a non key column here um we can also think about capturing uh reviews in text so let's call it a text review um and we can also capture uh response time uh basically the time taken by the host to respond to this review so I think that uh brings us to a good place to capture information around reviews here let's see just quick quick question on the grain that you have for the review fact so so it's like um so would you would you prefer review date or is it like a are you thinking in terms of a time stamp for this one for a review um or just like a um no so for the sake of Simplicity I call it date but it's always beneficial to have time stamps in there uh and we can always you know think in terms of slow like if we want historical data to be present uh it is it is best that we act uh that we capture data at a time stamp level yeah yeah makes sense because um if if a same user wants to put two different reviews on a date uh so makees sense to know when when that particular review was put but makes sense yeah pleas go ah yeah um okay so I was thinking I could also connect uh each of these to the review fact to highlight the relationships the fact and dimension tables have with each other here uh listing listing is connected to host right so yes so it's going to be like this here and sense um Yep this one yep so yeah uh because the shape of this data model resembles the shape of a star is where the name is derived from but I think this is a pretty good outline of how we can uh derive insights from reviews um and and um basically look at um analyze the distribution of ratings for across listings or host um we can also identify the top performing hosts based on Revenue um occupancy rates but for that we probably need uh booking fact uh or Revenue fact where I want to capture the payment related data one quick question on the response time so uh so you mentioned like the review um so should we have any controls in place to for the users um to leave a review in terms of one if they have stayed uh should be should they be able to review so do we have to accommodate for that what what's your thoughts on that yeah that's a good point um yeah that's a good point I think we can include a booking ID in here uh which will ensure that only if a user has stayed with a specific host uh the user is able to leave a review um so I think that referential um constraint will will enable the control on that and this I think becomes the FK to booking fact so yeah that's a good sense speaking of uh we can dive deeper into booking fact and and let's see how that um looks okay um let's draw booking fact in here okay um so yes I start off with the primary key of this table and I'll call it booking ID a booking has to be associated with a listing um so this uh again we will uh link it to the same listing Dimension um we also have um I also Envision user information to be associated with the booking fact who the user was associated with a particular booking ID so I want I would want um I would want to use ID present in this fact table um as well as a host ID which we can derive from host dimension um a host can have multiple listings in there so having all of that in the same fact table provides us the clarity of like which listing are we talking about uh and which host um is this we can capture information on what the booking date is what the checking date is what the checkout date is information around that check out date um what the number of nights uh are um what the cost uh of the booking is um and henceforth we can use this to to look into our Revenue um booking Source um so it's it's it's uh it's invaluable to analyze uh what source is deriving a a significant chunk of of our bookings or reservations so I think that can really help and number of guests um so y I think that Pro that gives us a good picture to capture information on them let me see if I can bring the dimensions here no maybe not um but basically the dimensions linked to booking fact are clearly listing them host di user di um and date di of course because we're going to be using a key here so let us see hopefully I can draw and drop here oh awesome so J um so the status of the booking is something that you would want to capture as well is like could could be cancelled or just like yep that is a good point um yeah you're right um we can um we can um have another attribute here let me see uh that can capture the booking status it can be either pending confirmed or cancelled uh and yeah uh we can look into cancellations at a later point to analiz whether it is associated with a specific host or what the reasons were whereever applicable so yeah that's a good point let me um include that as well we can call it booking status okay sounds good and also uh just to capture below um canellation reason um wherever applicable so I expect some nulls here uh which is not really optimal uh if because we don't expect a lot of cancellations to be here but um just just to make quering faster let us keep cancellation reason as part of the same booking fact for the time being good okay um and then listing them so now I think we have almost everything we need here we can proceed to join sorry um all right in there okay um let us highlight the user ID host ID dats listing ID host ID and then all these date columns we can uh link to the dates here um yep all right okay um just of you so holistically I'm looking at booking fact um related to date Dimension user Dimension listing Dimension host Dimensions um and uh I definitely care about what uh what the cost associated with a particular booking has been whether what the status of it is how many guests there are what the uh um check and checkout dates are um so yeah I think that that provides me a good information on uh that puts me at a good place to analyze uh Revenue uh or analyze the booking trends like we just yes good and then to cover the third Point uh which is pricing so let's also talk about Revenue okay so Revenue fact um I start off with a revenue ID which serves as the primary key of this table um a revenue is has is associated with the listing ID so I want to keep it there as is um I don't think I I I don't think I need a user ID but I definitely uh want booking ID to be present in there so let me rename it to booking ID that becomes a forign key to my booking fact table um we also would like uh the most important thing which is the revenue amount generated uh from a particular uh list booking um and then what the payment method was we can think about currency since it's a global platform um and maybe a date of payment or yeah payment date I think that makes more sense um okay yeah so I think um that brings us to a good place um of when we want to analyze the revenue uh coming out of an um a particular booking right um this one I expected I Envision it to be um to be related to listing Dimension and Bing fact so you know what I think we should move it to the first page second page and uh let us bring it here all right um maybe I can move it to this part shuffling here but all right um so booking ID I will link it here my listing ID gets linked from this point and my revenue is a primary key so that works uh we can also link payment date to date Dimension and uh yep here we see two fact tables are related to each other yeah yeah that's um yeah okay um so yeah I kind of like envisioned my data model to be looking like this we have some common Dimension tables and we have two fact tables that will help us uh derive further insights yeah quick question so do we so I just like maybe it's very trival so for the location uh do you are you thinking about say if you want to do it like city-wise Revenue if you have like multiple things that could add more value yes absolutely that's a good point uh we can have another dimension table over here uh to capture all the demographics so we can uh have a location Dimension with with attributes like city state country uh anything that yep uh attributes we we care about so yep that is a good point okay city state and then country region more of on the same lines okay um so that's a good point uh if if I were to derive um insights on um let's say Revenue by location like like you said right um I can um let's let me write it down over here oving it okay um I can definitely use a booking fact um so since I'm looking into Revenue by let's say cities right um I would want booking fact because that one has um the cost Associated uh with with a particular booking right um I can join it um with the listing Dimension and from the listing um I can so I have city which is a location attributes I can even uh I can even further um maybe uh divided into location ID joining um I I I will join it with with the location Dimension uh using the location ID and fetch the city from uh listing Dimension so we can do something like this on on let's say listing on listing let me have have an alias in there listing and let's call it location uh dot location ID is equal to listing dot location ID and I want to join the booking fact and listing on listing ID uh so I have listing ID present in both of them mhm look something like this listing ID so now um I have a cost I have the cost that I can derive from booking fact I have the city that I can get from location dim so I will group it by City comma take a sum of all the cost that I get and that can serve as a as my total revenue that I'm grouping by across City Y across cities right so I think this yeah so this kind of gives me an idea of the revenue by cities we can uh extend the same kind of analysis for across regions countries um and and other attributes yeah and the the nice thing about this one is since we have the de Dimensions across all facts um that helps us to have a trend as well so the city um like cost Trend if I want to look at uh which is good um and also I think the host and U listing uh you have a host ID as well isn't it the listing has a host as well so that means like we can even do some queries like a top performing hosts so which is good um which host has generating a more Revenue right we want to we want to keep them act so so that's something that we can do as well yeah that's good yep yeah that's a good point um in fact it's an interesting insight to derive out of this data model if you want to take a look at which host has had the most cancellations we can definitely again leverage booking fact um and I'm sorry uh we can leverage booking fact um filter um filter on the booking status uh and I'm assuming it to be cancelled and then again hosted by hosting ID so I'm thinking maybe something like um here let's see um so again host by cancellations we can do something like um take booking fact oops uh take booking okay sorry about that take book in fact uh join it with the host Dimension so that we can gather information on um on the host IDs basically so host dimension. host ID join it with booking fact on host ID uh we only care about cancellation so I'm going to write uh the booking status to be cancelled yep um something like cancelled um and group it by host ID let it be from host Tim host ID and um in order to count uh the number of host IDs we can we were like we grouping it but host ID Comm account uh oops host ID come account the number of working IDs that are associated with the particular um column here so I think that gives us a good um number of cancellations by host yep yep Yes sounds good that's good uh nicely done uh so uh probably okay maybe uh could you also talk a bit bit about um uh the tradeoffs or challenges or things you would ideally want to take care uh if you were to design this data model for uh scale of rbnb yeah yeah that's a good point um so um we've kept it quite simplistic here uh the data model and um assumed that you you know um it is suitable for for the the scale of data that we have but as the scale increases of the data booking fact and revenue fact will grow enormously um that means that our qu our database will slow down whenever there is a a read operation performed so we need to be really careful around how we're indexing these tables in our database how we're partitioning it um so that we have some guardrails around uh query query optimizations and and so that we have guardrails around the read operations even with the scale of the data um okay would you have any recommendation on the partitioning uh what we what we can do for these type of facts um yes um I think it so it depends on um the purpose again if we are um if we care about latest data I I think date dimensions are good to Partition the data by uh however if we care about um partitioning the data uh if we care about analyzing the data for specific regions I think region uh can serve as a good partition key primary Keys also um are are a good candidate to be to be partitioning uh upon but the same time we should also consider the tradeoff of um not um making a specific column a hot partition we don't want to be writing to a specific partition um to do a lot of Rights on a specific partition where which basically leads to um data data being uh again making it a hard partition is is making it prone for failure um yes yeah so I think while we're partitioning we definitely want to keep these things in mind like uh what makes our database quicker for reading as well as not prone to failure uh while during rights oper right operations uh we also want to make sure that we're not inducing skewness within our data um so I think these are some good points to consider when we are deciding what our partition key is yes yes definitely I me think we touched quite a few points there uh which is good um yeah um so depending on the purpose um partitioning is a good strategy uh just like not blanket date I mean most of the facts we know it's like back partitioned by dates primarily I mean there's a reason to it but it's also there could be some specific facts which doesn't really have to be parti by trade like uh like you mentioned it could be location specific um yeah and also um read heavy partitions um the hard partition that you mentioned is also quite uh a good thing to take care um these are some of the things that you need to take take care before designing it's good uh otherwise once it's in production then you take care of all these different things yep yeah it could it could induce some sleepless nights yeah okay yeah anything else that you would want to take care on this design um yes um speaking of review fact um um for the purpose of this exercise we have assumed that the text uh data that we have is is confined and is structured but in the real world um I think we may have to deal with unstructured data so we might have to further refine uh this and you know process it in a different schema um alog together or maybe apply some text mining techniques there um so I think that is another opportunity to um that that comes along with uh text Data yep yep exactly I think um now yeah the definitely the review fact is going to be um review text is going to be definitely unstructured probably we can look at um primarily Json type of data there so it's good to so it's interesting that you touched about the text mining here uh so even though text mining is something that we can like say there is there's a lot of scope for the data science related things there but still from a from a data warehousing point of view if we could extract information like uh whether it's a positive review or do some sentiment analysis positive review negative review and then like along with your review fact which you have designed if we can kind of like give an indication that this is this seems to be a positive review there is a negative review this kind this gives another dimension for our uh queries uh analytical queries okay how many of those uh reviews that came for that particular listing uh were negative MH all right so which is another dimension right so yeah that's uh that's a good point that you touched on that X mining thing I think this is a great place to pause um so let's assume you are an interviewer what do you think you went well and then like is there something else that you would have um ideally added to this um design uh or you would like to um I think I I really liked the activity of thinking out loud and you know brainstorming with you uh it gave me a good direction uh and a good structure to think about um to think about things I I missed right so I I think that was uh that was really useful um also working backwards from the business perspective is uh for example like uh thinking about the kind of metrics uh we care about the kind of insights we want to drive out of this data helped me structure the data model so I think that was also a good that that went really well um I given more time I think we can also uh brainstorm and dive deeper into areas like competitive pricing uh how Airbnb is um performing uh at um at par or whether whether or whether or not Airbnb is overpricing or underpricing with respect to other competitors in the market um that can also potentially be a good area uh to to optimize uh business operations I think that one is quite interesting Airbnb has Dynamic pricing as well um in in the real world scenario so they they do have some discounting there so I think um Gathering that data and reviewing the data can also be a good place for uh us to to to get gather some interesting insights there yeah definitely and um um and you you did really well I mean we had a couple of objectives to start with I think we we kind of covered end to end for that um along with the queries and then we talked about few extensions uh and then there were few um few extensions and possibilities that this model gave thanks for that it was really good uh and uh yeah definitely uh competitive pricing is definitely there right so it's just like that is another interesting point um and we we can uh extend this data model um to handle that in terms of the discounts and stuff if you enjoy this interview visit Tri exponent.com to view our data engineering interview course where you'll have access to Library of interview questions expert coaching and peer-to-peer mock interviews where you can practice yourself good luck on your next interview and thanks for watchingdesign um data modeling for arbnb hello everyone welcome to another mock interview with exponent today we're going to talk a little bit about data botling uh with our guest Anushka thank you so much for being here Anushka would you take a moment to introduce yourself to our viewers yes absolutely um so hi and my name is Anushka Tak I am a data engineer with Amazon uh I've been with Amazon for a little over four years now um I started my data engineering Journey with the data Lake team um and this is where we uh built a data Lake to store org level data for analytics by data scientists data other data Engineers as well as product managers um then I moved into a data warehousing team uh which was more database Centric here I got to work with all the business teams and learn more about their core operational business um here uh we design and deliver Solutions building the data models from scratch and we translate business requirements into Data pipelines cool uh thanks for joining in again uh let's jump right into our question for today which is uh uh design um data modeling for arbnb okay I think we have a lot of data to play with uh with Airbnb so all right let's let's jump in before we jump in I definitely want to understand the purpose of this data um uh is it more transactional or analytical okay that's that's a good question so uh this is going to be uh for analytical purposes um let's say for data warehousing so uh so we're going to be doing a data modeling for analytical purposes okay that makes sense um so I think in terms of schema um I'm leaning towards a star schema because star schema uses the denormalized data which means uh it adds a lot of redundant cost volumes to some Dimension tables but but it makes squaring uh faster so reading operations are much much faster and for analytical purposes I'm assuming that we're going to be reading out of our database quite a lot so I think star schema is really well suited uh for our usage here however it is not optimal for storage right um on the flip side we have snowflake schema which utilizes normalized data which in turn reduces redund but it also involves more join to produce the same view that we were uh to to produce the same view of the data right so it is less optimal for read operations but in terms of storage it's a better alternative um for the sake of our exercise I I think since we are focusing on analytical um per uh analytical rates and purposes um we can start our data model um in in Star schema right yeah sounds good okay um okay so now that we have established uh we want to perform analytics on Airbnb data as well as we we're choosing a star schema let's talk about the kind of metrics we want to derive from this data I think it's extremely critical that we work backwards from what are we trying to gain out of this data uh so that it gives us an outline or a direction to think in terms of what is the kind of data that we want to capture around it um so you do you have any directions on um metrics that we can SE out of here that's a good question so I think it's good to scope that one as well so I'm primarily looking at um two key metrics here so one is like customer Obsession or engagement um so um so shortly this is to improve the experience so we can so from my Airbnb point of view the reviews could be very critical about the satisfaction of the customers so how can we use the reviews to improve the experience that's on a on a very very high level uh I would U I would let you Del into the details of how you want to do it uh second is uh business profitability business profitability is mostly about pricing uh Revenue optimization so as a business how can we improve our revenues uh and then is there a trend in terms of the reservation that's happening uh that we can uh that we can get from this data model data model as well yeah these are the two two main uh objectives um Fair Point um I think you're right like bang on uh customer satisfaction is one of the key um key key components of any business and you know it it is a direct uh relationship to to um increase a business profitability from customer feedback so we must gather data around uh customer reviews uh how the customer experience has been with respect to bookings were there any complaints what the res resolution was like so I definitely want to um capture data around let's say like I said reviews um and or maybe think about a review fact that captures all of this right um and talking about the second thing um we can uh captured some data around bookings or reservations and um and see what the reservation Trend analysis has been like we can um gather insights around uh what what like if there are any factors regarding GE geography of an area if there is any specific time period that um that has more bookings than others I think things like that can help us determine uh what the areas or opportunities of improving businesses so that's again a good one so for the time being let's capture all of this information in a table called booking fact and you also touched upon pricing um or Revenue optimization right so maybe um yep we can also um analyze data around the revenue uh and let's call it Revenue fact to keep it simplistic uh simplistic um so yep basically I want to have some data around Revenue um which is like from a particular booking uh how much um Revenue was extracted uh whether we have higher Revenue in certain areas um whether we're getting more business from certain uh booking sources and stuff like that so I think uh that can also provide us a provide us good insights around data um which which can reveal with what is a good um Revenue optimization area right right um So based on this let's categorize um our three main fact tables to to to to drive the data model right so it's going to be B booking fact Revenue fact and review fact let's dive deeper into each of these and let's talk about uh what kind of metrics we want to be present uh in these columns and and how we can um further use that data to derive insights from it right um sounds good okay so let's start with review fact um mhm so I'm thinking uh in review fact we can let me draw a high level UHD here all right uh let's call it review fact there you go uh I definitely want a review ID in there which can serve as a primary key uh for this table and oops we definitely want uh reviews on our listings so I'm thinking that we have a listing ID present in there um which um I I plan to like this can serve as a foreign key to my Dimension table uh which which I will call uh listing let me listing Tim um basically I I want to capture all the background and contextual information of entities in dimension tables and I want to capture all the quantitative uh and numerical data of um of of our facts in the fact tables uh we will join them using uh the keys present in each of these tables so uh this becomes listing key which is the primary key of this table but becomes the foreign key of uh review fact right um foreign key we can also think about who was the user who um provided this review uh can insert okay all right let me insert below uh which now becomes a foreign key to user dim a need for another user listing dim okay uh user dim user ID becomes my um this uh let's let's finish user DM while we are here um we can have um ATT attributes of a user um in this user DM so contact information name email uh attributes of a user basically country and like as we want to um joining date uh and Etc like whatever we want the information on user themm to be here um going back to listing them while we're here let us also think think of um attributes of a listing um so basically I'm thinking the number of rooms um maybe number of rooms we can also think about who the host is so uh again this becomes um this this calls the need of another dimension table where I want to um capture host information um okay uh host Dimension and then host ID becomes the primary key of this one um going back to listing them what else can we in so host ID property type whether it's an apartment or um a house um we can also include what city it is um um yeah I think that should be a good place for listing Dimension um now let's go back to completing host Dimension um here I'm trying to capture all the information or attributes of a host basically so again um this is a type of user if we think about it but for the sake of Simplicity uh let's keep a host Dimension separate to the guest Dimension and uh as a user so let's capture that information here we have host uh ID the name and email um whether or not they are a superhost what um when did they join what the country is and maybe more attributes around phone number uh what their response rate is kind of like gives an idea of um how responsive they are whether or not they're a super host they may that may even be linked so something to uh yeah um yeah I think to start off with uh that is good now we go back to review fact um so I definitely have a a a listing ID that this review is associated with a user ID who has provided this review uh maybe a review date um that makes me think I do want a date Dimension to be a part of my um data model as well um yeah definitely and un avoidable isn't it dat yes uh and um yeah so really useful um in a um in data models um when especially when um in in retail environments when the calendar dates are not the same as uh retail dates uh and financial year it's different so definitely has its own perks um so and more along the line maybe we can think about quarter if you want to do some analysis on a quarter basis uh week of year what whether or not is it a holiday we uh we definitely see a different um increased business in when there is a holiday so I think that can be a useful attribute in the future um so we have date your week month yeah I think that's a good place to start off with yeah okay um going back to review fact now that we have review date which becomes a yep uh we which which can be an FK to dat okay um the most important thing what the rating is uh so I I think I'm envisioning rating to be on a scale of 1 125 um and that is basically a non key column here um we can also think about capturing uh reviews in text so let's call it a text review um and we can also capture uh response time uh basically the time taken by the host to respond to this review so I think that uh brings us to a good place to capture information around reviews here let's see just quick quick question on the grain that you have for the review fact so so it's like um so would you would you prefer review date or is it like a are you thinking in terms of a time stamp for this one for a review um or just like a um no so for the sake of Simplicity I call it date but it's always beneficial to have time stamps in there uh and we can always you know think in terms of slow like if we want historical data to be present uh it is it is best that we act uh that we capture data at a time stamp level yeah yeah makes sense because um if if a same user wants to put two different reviews on a date uh so makees sense to know when when that particular review was put but makes sense yeah pleas go ah yeah um okay so I was thinking I could also connect uh each of these to the review fact to highlight the relationships the fact and dimension tables have with each other here uh listing listing is connected to host right so yes so it's going to be like this here and sense um Yep this one yep so yeah uh because the shape of this data model resembles the shape of a star is where the name is derived from but I think this is a pretty good outline of how we can uh derive insights from reviews um and and um basically look at um analyze the distribution of ratings for across listings or host um we can also identify the top performing hosts based on Revenue um occupancy rates but for that we probably need uh booking fact uh or Revenue fact where I want to capture the payment related data one quick question on the response time so uh so you mentioned like the review um so should we have any controls in place to for the users um to leave a review in terms of one if they have stayed uh should be should they be able to review so do we have to accommodate for that what what's your thoughts on that yeah that's a good point um yeah that's a good point I think we can include a booking ID in here uh which will ensure that only if a user has stayed with a specific host uh the user is able to leave a review um so I think that referential um constraint will will enable the control on that and this I think becomes the FK to booking fact so yeah that's a good sense speaking of uh we can dive deeper into booking fact and and let's see how that um looks okay um let's draw booking fact in here okay um so yes I start off with the primary key of this table and I'll call it booking ID a booking has to be associated with a listing um so this uh again we will uh link it to the same listing Dimension um we also have um I also Envision user information to be associated with the booking fact who the user was associated with a particular booking ID so I want I would want um I would want to use ID present in this fact table um as well as a host ID which we can derive from host dimension um a host can have multiple listings in there so having all of that in the same fact table provides us the clarity of like which listing are we talking about uh and which host um is this we can capture information on what the booking date is what the checking date is what the checkout date is information around that check out date um what the number of nights uh are um what the cost uh of the booking is um and henceforth we can use this to to look into our Revenue um booking Source um so it's it's it's uh it's invaluable to analyze uh what source is deriving a a significant chunk of of our bookings or reservations so I think that can really help and number of guests um so y I think that Pro that gives us a good picture to capture information on them let me see if I can bring the dimensions here no maybe not um but basically the dimensions linked to booking fact are clearly listing them host di user di um and date di of course because we're going to be using a key here so let us see hopefully I can draw and drop here oh awesome so J um so the status of the booking is something that you would want to capture as well is like could could be cancelled or just like yep that is a good point um yeah you're right um we can um we can um have another attribute here let me see uh that can capture the booking status it can be either pending confirmed or cancelled uh and yeah uh we can look into cancellations at a later point to analiz whether it is associated with a specific host or what the reasons were whereever applicable so yeah that's a good point let me um include that as well we can call it booking status okay sounds good and also uh just to capture below um canellation reason um wherever applicable so I expect some nulls here uh which is not really optimal uh if because we don't expect a lot of cancellations to be here but um just just to make quering faster let us keep cancellation reason as part of the same booking fact for the time being good okay um and then listing them so now I think we have almost everything we need here we can proceed to join sorry um all right in there okay um let us highlight the user ID host ID dats listing ID host ID and then all these date columns we can uh link to the dates here um yep all right okay um just of you so holistically I'm looking at booking fact um related to date Dimension user Dimension listing Dimension host Dimensions um and uh I definitely care about what uh what the cost associated with a particular booking has been whether what the status of it is how many guests there are what the uh um check and checkout dates are um so yeah I think that that provides me a good information on uh that puts me at a good place to analyze uh Revenue uh or analyze the booking trends like we just yes good and then to cover the third Point uh which is pricing so let's also talk about Revenue okay so Revenue fact um I start off with a revenue ID which serves as the primary key of this table um a revenue is has is associated with the listing ID so I want to keep it there as is um I don't think I I I don't think I need a user ID but I definitely uh want booking ID to be present in there so let me rename it to booking ID that becomes a forign key to my booking fact table um we also would like uh the most important thing which is the revenue amount generated uh from a particular uh list booking um and then what the payment method was we can think about currency since it's a global platform um and maybe a date of payment or yeah payment date I think that makes more sense um okay yeah so I think um that brings us to a good place um of when we want to analyze the revenue uh coming out of an um a particular booking right um this one I expected I Envision it to be um to be related to listing Dimension and Bing fact so you know what I think we should move it to the first page second page and uh let us bring it here all right um maybe I can move it to this part shuffling here but all right um so booking ID I will link it here my listing ID gets linked from this point and my revenue is a primary key so that works uh we can also link payment date to date Dimension and uh yep here we see two fact tables are related to each other yeah yeah that's um yeah okay um so yeah I kind of like envisioned my data model to be looking like this we have some common Dimension tables and we have two fact tables that will help us uh derive further insights yeah quick question so do we so I just like maybe it's very trival so for the location uh do you are you thinking about say if you want to do it like city-wise Revenue if you have like multiple things that could add more value yes absolutely that's a good point uh we can have another dimension table over here uh to capture all the demographics so we can uh have a location Dimension with with attributes like city state country uh anything that yep uh attributes we we care about so yep that is a good point okay city state and then country region more of on the same lines okay um so that's a good point uh if if I were to derive um insights on um let's say Revenue by location like like you said right um I can um let's let me write it down over here oving it okay um I can definitely use a booking fact um so since I'm looking into Revenue by let's say cities right um I would want booking fact because that one has um the cost Associated uh with with a particular booking right um I can join it um with the listing Dimension and from the listing um I can so I have city which is a location attributes I can even uh I can even further um maybe uh divided into location ID joining um I I I will join it with with the location Dimension uh using the location ID and fetch the city from uh listing Dimension so we can do something like this on on let's say listing on listing let me have have an alias in there listing and let's call it location uh dot location ID is equal to listing dot location ID and I want to join the booking fact and listing on listing ID uh so I have listing ID present in both of them mhm look something like this listing ID so now um I have a cost I have the cost that I can derive from booking fact I have the city that I can get from location dim so I will group it by City comma take a sum of all the cost that I get and that can serve as a as my total revenue that I'm grouping by across City Y across cities right so I think this yeah so this kind of gives me an idea of the revenue by cities we can uh extend the same kind of analysis for across regions countries um and and other attributes yeah and the the nice thing about this one is since we have the de Dimensions across all facts um that helps us to have a trend as well so the city um like cost Trend if I want to look at uh which is good um and also I think the host and U listing uh you have a host ID as well isn't it the listing has a host as well so that means like we can even do some queries like a top performing hosts so which is good um which host has generating a more Revenue right we want to we want to keep them act so so that's something that we can do as well yeah that's good yep yeah that's a good point um in fact it's an interesting insight to derive out of this data model if you want to take a look at which host has had the most cancellations we can definitely again leverage booking fact um and I'm sorry uh we can leverage booking fact um filter um filter on the booking status uh and I'm assuming it to be cancelled and then again hosted by hosting ID so I'm thinking maybe something like um here let's see um so again host by cancellations we can do something like um take booking fact oops uh take booking okay sorry about that take book in fact uh join it with the host Dimension so that we can gather information on um on the host IDs basically so host dimension. host ID join it with booking fact on host ID uh we only care about cancellation so I'm going to write uh the booking status to be cancelled yep um something like cancelled um and group it by host ID let it be from host Tim host ID and um in order to count uh the number of host IDs we can we were like we grouping it but host ID Comm account uh oops host ID come account the number of working IDs that are associated with the particular um column here so I think that gives us a good um number of cancellations by host yep yep Yes sounds good that's good uh nicely done uh so uh probably okay maybe uh could you also talk a bit bit about um uh the tradeoffs or challenges or things you would ideally want to take care uh if you were to design this data model for uh scale of rbnb yeah yeah that's a good point um so um we've kept it quite simplistic here uh the data model and um assumed that you you know um it is suitable for for the the scale of data that we have but as the scale increases of the data booking fact and revenue fact will grow enormously um that means that our qu our database will slow down whenever there is a a read operation performed so we need to be really careful around how we're indexing these tables in our database how we're partitioning it um so that we have some guardrails around uh query query optimizations and and so that we have guardrails around the read operations even with the scale of the data um okay would you have any recommendation on the partitioning uh what we what we can do for these type of facts um yes um I think it so it depends on um the purpose again if we are um if we care about latest data I I think date dimensions are good to Partition the data by uh however if we care about um partitioning the data uh if we care about analyzing the data for specific regions I think region uh can serve as a good partition key primary Keys also um are are a good candidate to be to be partitioning uh upon but the same time we should also consider the tradeoff of um not um making a specific column a hot partition we don't want to be writing to a specific partition um to do a lot of Rights on a specific partition where which basically leads to um data data being uh again making it a hard partition is is making it prone for failure um yes yeah so I think while we're partitioning we definitely want to keep these things in mind like uh what makes our database quicker for reading as well as not prone to failure uh while during rights oper right operations uh we also want to make sure that we're not inducing skewness within our data um so I think these are some good points to consider when we are deciding what our partition key is yes yes definitely I me think we touched quite a few points there uh which is good um yeah um so depending on the purpose um partitioning is a good strategy uh just like not blanket date I mean most of the facts we know it's like back partitioned by dates primarily I mean there's a reason to it but it's also there could be some specific facts which doesn't really have to be parti by trade like uh like you mentioned it could be location specific um yeah and also um read heavy partitions um the hard partition that you mentioned is also quite uh a good thing to take care um these are some of the things that you need to take take care before designing it's good uh otherwise once it's in production then you take care of all these different things yep yeah it could it could induce some sleepless nights yeah okay yeah anything else that you would want to take care on this design um yes um speaking of review fact um um for the purpose of this exercise we have assumed that the text uh data that we have is is confined and is structured but in the real world um I think we may have to deal with unstructured data so we might have to further refine uh this and you know process it in a different schema um alog together or maybe apply some text mining techniques there um so I think that is another opportunity to um that that comes along with uh text Data yep yep exactly I think um now yeah the definitely the review fact is going to be um review text is going to be definitely unstructured probably we can look at um primarily Json type of data there so it's good to so it's interesting that you touched about the text mining here uh so even though text mining is something that we can like say there is there's a lot of scope for the data science related things there but still from a from a data warehousing point of view if we could extract information like uh whether it's a positive review or do some sentiment analysis positive review negative review and then like along with your review fact which you have designed if we can kind of like give an indication that this is this seems to be a positive review there is a negative review this kind this gives another dimension for our uh queries uh analytical queries okay how many of those uh reviews that came for that particular listing uh were negative MH all right so which is another dimension right so yeah that's uh that's a good point that you touched on that X mining thing I think this is a great place to pause um so let's assume you are an interviewer what do you think you went well and then like is there something else that you would have um ideally added to this um design uh or you would like to um I think I I really liked the activity of thinking out loud and you know brainstorming with you uh it gave me a good direction uh and a good structure to think about um to think about things I I missed right so I I think that was uh that was really useful um also working backwards from the business perspective is uh for example like uh thinking about the kind of metrics uh we care about the kind of insights we want to drive out of this data helped me structure the data model so I think that was also a good that that went really well um I given more time I think we can also uh brainstorm and dive deeper into areas like competitive pricing uh how Airbnb is um performing uh at um at par or whether whether or whether or not Airbnb is overpricing or underpricing with respect to other competitors in the market um that can also potentially be a good area uh to to optimize uh business operations I think that one is quite interesting Airbnb has Dynamic pricing as well um in in the real world scenario so they they do have some discounting there so I think um Gathering that data and reviewing the data can also be a good place for uh us to to to get gather some interesting insights there yeah definitely and um um and you you did really well I mean we had a couple of objectives to start with I think we we kind of covered end to end for that um along with the queries and then we talked about few extensions uh and then there were few um few extensions and possibilities that this model gave thanks for that it was really good uh and uh yeah definitely uh competitive pricing is definitely there right so it's just like that is another interesting point um and we we can uh extend this data model um to handle that in terms of the discounts and stuff if you enjoy this interview visit Tri exponent.com to view our data engineering interview course where you'll have access to Library of interview questions expert coaching and peer-to-peer mock interviews where you can practice yourself good luck on your next interview and thanks for watching\n"

Airbnb Data Warehouse Schema - Data Engineering Mock Interview

Random Videos