
Our first visualization is a bar chart with a list and count of the most commonly used words in lyrics in Billboard’s “Year-End Hot 100” for the years 1965-1969 only. This prevents commonly used words such as “is” from appearing in our two-word and three-word phrases.

We also removed the words in our “stopwords” data frame before creating bigrams and trigrams. Before determining the number of times each word was used in the 1960s, we removed the words in our “stopwords” data frame from the “Sixtieswords” dataset in order to prevent commonly used words such as “the” from appearing in our word count data. After, we created a data frame that included both the default list of stop words and a custom list of stop words.

Then, we created another dataset called “Sixtieswords” in which we separated each word in the lyrics column of our “Sixtieslyrics” dataset into a separate row. Because our “Sixties” vector only includes five years, our data and visualizations are not representative of the entire decade. Then, we created a separate dataset called “Sixtieslyrics” and filtered the “songs” data frame for rows that included the years in our “Sixties” vector. “nite” or “thingll” instead of “thing will” because it did not significantly affect our results.įirst, we created a vector called “Sixties” and included the years of the 1960’s, starting with 1965 (the first available year of data). In addition, we did not correct spelling errors or derivatives, such as “night” vs. However, we did not clean this data variation because it did not affect any of our visualizations or models. Newer songs include the word “featuring” before featured artists, while older songs just have a space between two artists. The data also includes two different variations for artist collaborations. Finally, we separated the lyric words into one word per row. Then, we removed the stop words from each decade’s dataset. We created a list of new stop words, such as “dont,” “im,” “youre,” “ill,” “gonna,” “aint,” “ive,” “youll,” and “wont.” We then combined this custom list of stop words with the default of list words. Using this, we created separate datasets for the song lyrics of different decades. Then, we created vectors for each decade, filtering the appropriate years into the appropriate vector. First, we omitted the “NA”s and blank spaces from the dataset. The data was mostly straightforward, but it required some cleaning and separation into different data sets. The third hypothesis we are testing is that the proportion of positive sentiment words and negative words are different in each decade, so we can accurately predict a song’s decade with a decision tree based on how similar its own proportion of positive or negative words is to a particular decade.

The second hypothesis we are testing is that the number of unique words will decrease each decade, indicating that songs are becoming more repetitive. However, we also hypothesize that common themes in songs will stay the same. The first hypothesis we are testing is that the top 15 words, top two-word phrases, and top three-word phrases have changed over the decades, with more profanity and less sophisticated words with each passing decade. Overall, our alternative hypothesis is that the word count, average number of unique words per song, top 15 words used, popular two-word or three-word phrases, and proportion of sentimental words changes each decade. Our null hypothesis is that regardless of the decade, the word count, average number of unique words per song, top 15 words used, top bigrams and trigrams, and proportion of sentimental words will remain the same. We were more interested in learning about trends behind the lyrics rather than trends behind artists or song names. Specifically, we looked at the differences among word count, average number of unique words per song, top 15 words used, popular two- or three-word phrases, and sentimental words for each decade.

We explored several hypotheses while analyzing this data set.
