Add Thesis

Price prediction of vinyl records using machine learning algorithms

Written by David Johansson

Paper category

Bachelor Thesis

Subject

Computer Science

Year

2020

Abstract

Thesis: Genres and artists When estimating the price of collections, it has been explained that factors such as the identity of the artist are very important (see chapters 1.2.2 and 1.2.3). In order to allow artists to be legally represented in the data set, it has been decided to include all releases of artists, but with some restrictions. The scope rules for each artist are the same, that is, all the releases in the albums, singles and EPs of each artist's Discogs file and the compilation section will be included in the data set. These parts represent the essence of the artist's record. However, there are other items related to the artist, such as appearing in a collection of multiple artists or unofficial releases, which are not related to this data set. 12 It can be assumed that economic models will be different under different cultural backgrounds within the scope of vinyl distribution. Therefore, the goal is to create a fairly uniform data set, that is limited to a certain cultural range, but also has the characteristics of diversity, so that in the end several segments can be observed in the context of the entire span. On this basis, it was decided to use a relatively broad definition of rock/metal as the overall scope, and use some well-defined sub-genres as its subdivisions. According to the common classification conventions of rock and metal music, a set of subdivisions (hereinafter strictly referred to as genres) have been determined: alternative metal, alternative rock, black metal, classic rock, 1314 death metal, doom metal, electronics, heavy metal, punk rock, Stoner rock 151617 and lash metal. The selection of the artist will be selected to represent the data 18 of each genre. This will be done by finding some of the most typical artists for each genre, and then expanding the roster by searching for similar artists on online resources to find 19 artists. 202.1.2 Attributes Discogs is the most comprehensive online resource for music distribution. It is not only the source of commodity information, but also a market where users can buy and sell any commodity in the database. It is structured so that for each specific issue of each title of each artist, there is a dedicated page that displays the known details of the project and the median price of previously sold copies on the website. The factors that affect the value of records are usually a combination of many different factors. Many of these factors are described on websites that specialize in the value of second-hand records [41]. This resource will be used to create a list of related variables. It will then determine which variables can be retrieved from Discogs, whether any additional variables can be obtained from Discogs, or whether any hedonic characteristics can be independently constructed using the data available on Discogs. The median price of each item, which will be the dependent variable of the data set, will also be retrieved from Discogs. It will be available in USD currency format. In the implementation process, the impact of each variable on performance will be tested. If a variable shows a general negative impact on the experimental results, it will be excluded from the final data set. 2.1.3 Data pruning After all the data has been collected, some tests will be performed to determine whether the data set should be pruned to improve performance. It will be checked whether artists with a small sample size have a significant influence on the results. Using a machine learning model trained with a data set and iterating on different values ​​of n, ranging from the minimum sample size of artists to 200, the performance will be tested, while excluding artists with less than n samples. The results will be checked to determine whether the minimum sample value required by the artist should be used. Similarly, considering that there may be large differences in the number of samples between artists, a test will be conducted to check the impact of setting an upper limit on the number of samples of an artist. There will be an iteration of n values ​​(range between 50 and 3000), where the artist samples will be trimmed to more than n samples for each artist to find out where/whether to set the limit for the best results. A test will also be conducted to examine the effect of outliers in form samples where the dependent variable is significantly higher than most samples. The goal is to find out whether a relatively small number of samples has a large negative impact on the overall performance. This will be done similarly to the above test. The value of n will be tested iteratively, where n is the maximum limit of samples to be included in the data set, and the performance of the model will be measured to find the ideal value of n. Read Less