Chapter 2 Proposal

2.1 Research topic

We are interested in the factors that may influence the ratings of a film (e.g. release time, genre, movie duration).

We guess that if a movie is released during the holiday seasons (e.g., Thanksgiving, Christmas) then the ratings might be higher than those released at other times because people have an elevated mood and hence might tolerate the quality of the movie. Also, depending on the year the movie was released, the audience’s judging criteria would also vary. For old movies, technological lag makes people care more about storytelling. Now, people may judge from more aspects.

Considering genres, documentary movies may be easier to obtain a high rating. On the one hand, audiences would easily resonate with this type of film which is more related to their real life. On the other hand, audiences will be more tolerant of filmmaking, for example, they won’t expect to see flashy special effects in a documentary movie.

The movie duration may also influence the ratings. The longer the movie, the easier it is for people to lose their patience. For instance, in a dull movie with a longer running time, people might get bored or even doze off. The audience will question whether the film’s pace drags on too much for a long film. And a short movie usually can’t give a complete background. We wonder if there exists a reasonable duration for movies.

In addition, we can compare the ratings of movies produced in different regions or production companies. Considering all those factors, we are curious about that are there certain features that are important for a film to have high ratings. We’ll try to answer this question by analyzing related datasets.

2.2 Data availability

IMDb Datasets
- Collected by IMDb organization through their users and their user-generated content. Copyright owned by IMDb. We believe IMDb is a reliable source for getting rating information.
- Each dataset is in a gzipped, tsv formatted file in the UTF-8 character set. They can be imported easily by ‘read_tsv’ in the ‘‘readr’ package, which can uncompress automatically for .gz format files. Also, we could import them by using the ggplot2movies package in R directly.
- The data is refreshed daily.
- We consider using this source because it contains up-to-date movie ratings collected directly from the users. In addition to movie ratings, it also provides information such as the title, length, region, language, release year, genres associated with the title, number of votes, etc. Then we’ll try to visualize the relationship between each factor and ratings. For example, we will compare movies with different genres (e.g., whether documentary movies would generally have higher ratings).

MovieLens 25M Dataset
- These datasets were collected from MovieLens by GroupLens. GroupLens is a research lab in the Department of Computer Science and Engineering at the University of Minnesota. There were 162541 users selected, and users were selected at random. All selected users had rated at least 20 movies.
- There are 6 different datasets are generated, which contain over 25 million ratings across 62423 movies and their genres. We would mainly focus on movie.csv and rating.csv. We will combine the two tables by movies’ unique ID (using the “movieId” column).
- Because the variety and number of movies in this dataset are large, the results we got from this dataset would be more accurate and more convincing. And we could use this dataset to analyze the relationship between genres and ratings.
- These datasets are ‘CSV’ files, which can be imported easily by ‘read_csv’ in the ‘‘readr’ package.
- These data were created by 162541 users between January 09, 1995 and November 21, 2019. Thus this dataset was last updated and generated on November 21, 2019.
- If we have questions about the dataset, we can contact the team through the website by sending a contact request.