Add Thesis

Predicting Airbnb user's desired travel destinations

Written by H. Ulfsson

Paper category

Term Paper


Computer Science




Thesis: Predicting the desired travel destination of AIRBNB users Today, companies such as Google mainly use machine learning applications to enhance the user experience when using their products, such as in the process of identifying and classifying emails as spam. Another application is to ensure that ads reach users who are most likely to be interested in the content of the ad. If we have enough data, we can make very accurate predictions. Today, many companies are using machine learning to understand their users at a deeper level by analyzing user-related data from different sources. Large amounts of data are collected every time we visit a website, and different companies are jointly collecting data, which increases the amount of data that companies can use for user analysis (Webb and Micheal J Pazzani, 2001). For example, this can be done through the interaction between Facebook or Google and the web page being visited. 1.1 Machine learning Machine learning is the ability for computers to "learn" from data and make predictions based on what they have learned. This is our practice of combining artificial intelligence computer pattern recognition. In the normal machine learning process, we usually have to go through two basic stages. The first stage is when we let the computer learn from data, and the second stage is when the computer can start to make predictions from new data. Several different techniques can be used in the learning part: supervised, unsupervised, and reinforcement.  Supervision is when we give the computer a set of data and list each correct result. We let the computer find the pattern of the data to produce a specific result, and each result is known in advance.  Unsupervised is when we give the computer some data but fail to provide it with the correct solution. It is the solution generated by the computer by dividing the data into groups based on similar attributes, for example.  Reinforcement learning is when the computer passes trial and error, each solution proposed by the computer is tested and received positive or negative feedback, and then the computer uses these feedback to further improve its solution. 1.2 Problem The problem that this report should solve is how we can create a model to predict user intent user data, such as the personal information they enter when creating an account or the data collected by the website on the user’s behavior on the website. This problem will be solved by The Kaggle website provides a practical question to answer, that is, Airbnb’s new user booking challenge. The challenge is to predict and rank the 5 most likely travel destinations for each user. There are 5 data sets of different importance related to this problem. When all users are predicted, it will be uploaded to the Kaggle website, which uses the NDCG scoring algorithm to calculate its correctness. 1.3 Existing research The amount of research on such a topic that has only recently emerged for commercial use is staggering. Machine learning has been a topic since the mid-1900s and has been developing almost since then. The research we will focus on and use in this report mainly focuses on the basics of machine learning and applies it when solving the problem of user intent prediction. Documents that select the basis of machine learning include "Machine Learning for User Modelling" (Webb & Micheal J Pazzani, 2001), "User Intentions Modeling in Web Applications Using Data Mining" (Chen, oa, 2002), and "The Changing Science of Machine Learning" "(Langley, 2011) and "C4.5: Machine Learning Programs" (Quinlan, 1993). When we delved into the actual algorithm we will use, we also found some interesting literature on this: "Multi-class classification method survey" (Aly, 2005), "Supervised machine learning: A review of classification techniques" (Kotsiantis, 2007 Years) and "XGBoost: A Scalable Tree Lifting System" (Chen and Gueststrin, 2016). We also studied different methods of validating the model: a "cross-validation procedure survey" (Arlot and Celisse, 2010) and a report covering the subject of NDCG scoring to measure the quality of our predictions (Ravikumar, Tewari, and Yang). When we first started dealing with such a problem, we had to make several decisions based on what we wanted to accomplish. Which programming language is being used, which machine learning method is best for the problem, and which algorithm may be the most accurate. We personally choose to use python as our programming language to do this because it has good support for our data set format and a large number of libraries for machine learning applications. We think other very viable languages ​​are R and Matlab . Our machine learning method will be supervised machine learning because the data set is designed for this method because we can access the training set with correct results and the test set with no results. The algorithm we are going to use must be fast and not take up a lot of memory, because we have a lot of data to process. We must also consider the fact that we want multiple categories, which means we want to basically find out the likelihood that the user wants to go to all destinations, not just which destination is the most likely. A good algorithm is the gradient boosting decision tree, which uses several smaller decision trees with scoring results, and each result gets a different score when traversing the tree. The specific algorithm chosen is XGBoost, which represents extreme gradient boosting. Due to its speed advantage, other boosting tree implementations were chosen (Kotsiantis, 2007). Read Less