Chapter 3 Data

3.1 Sources

      This data is collected by IMDb organization through their users and their user-generated content. Copyright owned by IMDb. We believe IMDb is a reliable source for getting rating information.
      We consider using this source because it is one of the biggest movie database on the web and is collected directly from the users. In addition to movie ratings, it also provides information such as the title, length, release year, genres associated with the title, number of votes, etc.
      We choose to import this data by using the ggplot2movies package in R directly since it is the easiest way. The data is published on 2015-08-25 and the maintainer is Hadley Wickham. The dataset contains 58,788 rows and 24 columns. Each column is either character type or numeric/integer type. Movie genres (“Action”, “Animation”, “Comedy”, “Drama”, “Documentary”, “Romance”, “Short”) are one-hot encoded and there exists movies with more than one genres.


      These datasets were collected from MovieLens by GroupLens. GroupLens is a research lab in the Department of Computer Science and Engineering at the University of Minnesota. There were 162541 users selected, and users were selected at random. All selected users had rated at least 20 movies. Because the variety and number of movies in this dataset are large, the results we got from this dataset would be more accurate and more convincing.
      This dataset was generated on November 21, 2019. There are 6 different datasets generated, which contain over 25 million ratings across 62423 movies and their genres. We would mainly focus on movie.csv and rating.csv. We import the datasets by ‘read_csv’ in the ‘readr’ package and combine the two tables by movies’ unique ID (using the “movieId” column).
      There are 25000095 rows and 4 columns in rating.csv and all the columns are numeric/integer type. Each row represents one rating of a movie rated by one user. There are 62423 rows and 3 columns in movies.csv and the columns are either character or numeric/integer type. One movie can have multiple genres, and each genre name is separated by “|” coded in one column. And we first need to separate these genres into different columns.

3.2 Cleaning / transformation

      Two data sets are used in this project, we will discuss them separately.

3.2.1 IMDb Datasets

      Take a look of the original data which is messy.
## # A tibble: 6 × 24
##   title     year length budget rating votes    r1    r2    r3    r4    r5    r6    r7    r8    r9
##   <chr>    <int>  <int>  <int>  <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 $         1971    121     NA    6.4   348   4.5   4.5   4.5   4.5  14.5  24.5  24.5  14.5   4.5
## 2 $1000 a…  1939     71     NA    6      20   0    14.5   4.5  24.5  14.5  14.5  14.5   4.5   4.5
## 3 $21 a D…  1941      7     NA    8.2     5   0     0     0     0     0    24.5   0    44.5  24.5
## 4 $40,000   1996     70     NA    8.2     6  14.5   0     0     0     0     0     0     0    34.5
## 5 $50,000…  1975     71     NA    3.4    17  24.5   4.5   0    14.5  14.5   4.5   0     0     0  
## 6 $pent     2000     91     NA    4.3    45   4.5   4.5   4.5  14.5  14.5  14.5   4.5   4.5  14.5
## # … with 9 more variables: r10 <dbl>, mpaa <chr>, Action <int>, Animation <int>, Comedy <int>,
## #   Drama <int>, Documentary <int>, Romance <int>, Short <int>
      In this data set, each genre type of movies is set to a variable. However, they are not different variables, they should be values of a common variable “genres”. Thus, we use pivot_longer to transform the data. (We only display columns that shows the transformation here.)
## # A tibble: 6 × 3
##   title                  rating genres   
##   <chr>                   <dbl> <chr>    
## 1 $                         6.4 Comedy   
## 2 $                         6.4 Drama    
## 3 $1000 a Touchdown         6   Comedy   
## 4 $21 a Day Once a Month    8.2 Animation
## 5 $40,000                   8.2 Comedy   
## 6 $pent                     4.3 Drama
      The columns r1-10 give percentile of users who rated this movie a 1. Since we are exploring the impact of personal interests on movies’ rating, we don’t mind extreme or biased ratings. So we will ignore these variables.
      Also, we believe that the movies rated by less than 200 IMDB users are unrepresentative, so we use filter() on the column “votes” to remove those movies.

3.2.2 MovieLens 25M Dataset

      We would mainly focus on the files movie.csv and rating.csv. There are 25 million rows and 4 columns in rating.csv, each row represents one rating of a movie rated by one user. There are 62423 rows and 3 columns in movies.csv, one movie can have multiple genres, and each genre name is separated by “|” coded in one column.
      Since the original data files are too large to upload and lead to overplotting, we decide to only process a random sample of them by using slice_sample().
      The ratings and genres of a movie are in separate files, so we need first combine the two data sets by movies’ ID which is unique and consistent. For every movie, we obtain a lot of scores rated by different users, we choose to compute the average rating for each movie.
##   X movieId                         title        genres average_rating
## 1 1  175661 The Hitman's Bodyguard (2017) Action|Comedy       4.291667
## 2 2    4020              Gift, The (2000)      Thriller       3.250000
## 3 3    2865             Sugar Town (1999)        Comedy       3.000000
## 4 4   73232  Girl in the Park, The (2007)         Drama       4.000000
## 5 5    2878             Hell Night (1981)        Horror       2.300000
## 6 6    2348          Sid and Nancy (1986)         Drama       3.478261
      As we can seen, one movie can have multiple genres, and each genre name is separated by “|” coded in one column. Since we are exploring the impact of genres on ratings, we separate those movies into different rows that each contains only one type of movie genres. Finally, we obtain the clean data that we’ll work with in the following parts.
## # A tibble: 6 × 5
##       X movieId title                         genres   average_rating
##   <int>   <int> <chr>                         <chr>             <dbl>
## 1     1  175661 The Hitman's Bodyguard (2017) Action             4.29
## 2     1  175661 The Hitman's Bodyguard (2017) Comedy             4.29
## 3     2    4020 Gift, The (2000)              Thriller           3.25
## 4     3    2865 Sugar Town (1999)             Comedy             3   
## 5     4   73232 Girl in the Park, The (2007)  Drama              4   
## 6     5    2878 Hell Night (1981)             Horror             2.3

3.3 Missing value analysis

3.3.1 IMDb Datasets

      The columns/variables we used have no missing values.

3.3.2 MovieLens 25M Dataset

      The columns/variables we used have no missing values.