What information is included in the dataset?
The data includes 6,235 entries of movies and television (TV) shows that were available on Netflix US as of 2019.
Attributes are placed into 13 columns:
​
-
show_id, a unique ID that corresponds to each show;
-
type, which categorizes a show as either a movie or a TV show;
-
show, the name of the show;
-
director, the director of the show;
-
cast, the main actors within the show;
-
country, where the show was produced;
-
date_added, the day when the show was added to the Netflix platform (1/1/08 – 1/18/2020);
-
release_year, the year when the show was released;
-
mpa_rating, which denotes the show’s age-suitability according to the Motion Picture Association’s rating system;
-
imdb_rating (from 0.0 to 10.0);
-
duration, which records the runtime of a movie in minutes and the number of seasons for TV shows;
-
listed_in, the specific genre(s) under which the show falls (e.g. Kids’ TV, Documentaries, Stand-Up Comedies); and
-
description, a brief overview of the plotline.
What information, events, or phenomena can your dataset illuminate?
By breaking down trends among different attributes, this dataset can illuminate:
-
How and why the ratio of movies to TV shows has changed over time, and whether this trend is replicated across other platforms;
-
The most popular genres over time;
-
Any correlations between MPA / IMDB ratings and genres;
-
Where shows are most commonly produced (by country and region), and whether a type of show has more diversification;
-
Common words included in show descriptions (textual analysis), which can reveal which topics viewers are most interested in;
-
The time of year that most shows are added, and whether that corresponds to any major events in the film industry (e.g. festivals or award ceremonies);
-
Distribution of movie and TV show ratings; and
-
The relationship between viewer preference or satisfaction (sentiment analysis) with Netflix’s decisions to add or remove shows in different genres.
​
Combined together, these results can provide some insight into what factors influence viewer preferences and how that informs Netflix’s decision to keep or remove films from its catalog.
What can't your dataset reveal?
While the dataset contains all the basic information about different shows (e.g. shows names, directors, descriptions, and ratings), there is limited focus on user-specific data. The only category that is remotely pertinent is the “country” column, which reveals each show’s location of production. However, consumer data such as age, location, gender, race, socioeconomic status are not included. Without relevant information, it is difficult to conduct in-depth analyses on how and whether viewer demographics have direct influences on their preferences.
Moreover, while we were hoping to delve deeper into possible ethical concerns surrounding Netflix’s predictive algorithm and interface design. As mentioned in our introduction, scholars have alleged that the platform’s personalized recommendations and infinite scrolling design can encourage binge-watching behaviors, especially among younger children and teenagers who have access to Netflix accounts. It would have been interesting to examine user metadata to see the average amounts of time that viewers spend on Netflix, or what time of day or year viewers are mostly likely to go on Netflix – then explore how Netflix uses this information to tailor the platform for higher subscriber retention.
How was the data generated, and by who?
The raw data was generated from a third-party search engine named Flixable, in which visitors search and browse a complete list of all the movies and TV shows streaming on Netflix. The dataset was created by Shivam Bansal, a full-stack data scientist from Singapore, who first created it for his own personal analysis then posted it on Kaggle for public access. (It is unlikely that he was paid to gather the data for analysis.) He lists “regular API calls” as the collection methodology and Flixable as a source, so we assume that he pulled his data from the Flixable site, which sources its information from Netflix itself.
What information is left out of the spreadsheet?
-
User demographics, as explained above
-
Average user ratings of each show, which would indicate whether such ratings have much bearing on Netflix’s decision to continuing a show’s licensing contract;
-
Countries outside of the US that the shows are available in, which could allow us to understand whether international appeal influences Netflix’s licensing decisions;
-
Whether a show is licensed or original;
-
By isolating Netflix’s original shows from the catalog, we can examine each’s show country of production, genre, topic, and language, then compare this data to Netflix’s viewer demographics to see how user statistics impact the platform’s original productions.
-
-
How long a show has remained on Netflix;
-
The dataset only lists when shows were added onto Netflix and doesn’t provide information on when and whether the listed shows were removed. This information could have provided some insight into what customers didn’t enjoy or the average length that a show stays on Netflix, which would helped us understand Netflix’s licensing process and the factors that guide their decisions. In addition, the dataset mostly contains shows added to Netflix between 2016 to January 2020, meaning that any trends in viewer preferences cannot be generalized outside of this time period.
-
-
The date when a show is due to leave or was removed from Netflix.
-
The catalog only contains that were available at the time of data collection, i.e. January 2020. Therefore, shows that were already removed are not included in our analysis. A cumulative dataset of all shows that were once available on the platform would have revealed further trends on how viewer preferences and Netflix’s business decisions have changed over time.
-
Dataset Ontology
​One of the main shortcomings of the data is that it is only inclusive of Netflix’s U.S. catalog; the ratings included, such as the Motion Picture Association film rating system, are also American-centric. In our analysis, we saw that most of the shows in the catalog are actually International (i.e. not domestically produced) – likely due to Netflix’s aggressive international expansion in recent years and its need to optimize shows for a wide range of audiences. However, we are unable to measure their international performance or how local audiences received them. Generalizations from the analysis are therefore limited to American audiences only.
As mentioned, as the data was collected in January 2020, the dataset is essentially a screenshot of all the shows available at that specific point in time. Any trends that we observe will be biased to the cultural context of 2020 and our current understanding of the streaming industry’s development. While we can infer explanations for our commentary by referencing additional sources and notable events that developed in the past few years, it is difficult to draw conclusions about trends based on the data itself.
We also noticed two flaws about how the data was organized:
​
-
A few columns contained more than one entry or type of data, including “director”, “cast”, and “listed_in". It is possible to find out how frequently each name/category appears by using simple regular expressions, requiring an additional step of data cleaning. Moreover, preliminary analysis shows that there are 555 levels, or combinations, of countries within the “country” variable, so adding some numerical variables to indicate which countries listed belong to broader regions (e.g. Australia, Europe, South America) could be more indicative.
-
The “duration” variable is perhaps the most inconsistent as it uses minutes as a metric of duration for movies and “seasons” for TV shows. The information would be clearer if categorized into “minutes” (length of film or average length of each episode), “episodes” (number of episodes), and “seasons” (number of seasons), allowing for further analysis on the correlation between the length of each show and viewer preference.