8 Assignments
8.1 R Markdown
This page will guide you through creating and editing R Markdown documents. This is a useful tool for reporting your analysis (e.g. for homework assignments). Of course, there is also a cheat sheet for R-Markdown and this book contains a comprehensive discussion of the format.
The following video contains a short introduction to the R Markdown format.
Creating a new R Markdown document
In addition to the video, the following text contains a short description of the most important formatting options.
Let’s start to go through the steps of creating and .Rmd file and outputting the content to an HTML file.
If an R-Markdown file was provided to you, open it with R-Studio and skip to step 4 after adding your answers.
Open R-Studio
Create a new R-Markdown document
Save with appropriate name
3.1. Add your answers
3.2. Save again
“Knit” to HTML
Hand in appropriate file (ending in
.html
) on learn@WU
Text and Equations
R-Markdown documents are plain text files that include both text and R-code. Using RStudio they can be converted (‘knitted’) to HTML or PDF files that include both the text and the results of the R-code. In fact this website is written using R-Markdown and RStudio. In order for RStudio to be able to interpret the document you have to use certain characters or combinations of characters when formatting text and including R-code to be evaluated. By default the document starts with the options for the text part. You can change the title, date, author and a few more advanced options.
The default is text mode, meaning that lines in an Rmd document will be interpreted as text, unless specified otherwise.
Headings
Usually you want to include some kind of heading to structure your text. A heading is created using #
signs. A single #
creates a first level heading, two ##
a second level and so on.
It is important to note here that the #
symbol means something different within the code chunks as opposed to outside of them. If you continue to put a #
in front of all your regular text, it will all be interpreted as a first level heading, making your text very large.
Lists
Bullet point lists are created using *
, +
or -
. Sub-items are created by indenting the item using 4 spaces or 2 tabs.
* First Item
* Second Item
+ first sub-item
- first sub-sub-item
+ second sub-item
- First Item
- Second Item
- first sub-item
- first sub-sub-item
- second sub-item
- first sub-item
Ordered lists can be created using numbers and letters. If you need sub-sub-items use A)
instead of A.
on the third level.
1. First item
a. first sub-item
A) first sub-sub-item
b. second sub-item
2. Second item
- First item
- first sub-item
- first sub-sub-item
- second sub-item
- first sub-item
- Second item
Text formatting
Text can be formatted in italics (*italics*
) or bold (**bold**
). In addition, you can ad block quotes with >
> Lorem ipsum dolor amet chillwave lomo ramps, four loko green juice messenger bag raclette forage offal shoreditch chartreuse austin. Slow-carb poutine meggings swag blog, pop-up salvia taxidermy bushwick freegan ugh poke.
Lorem ipsum dolor amet chillwave lomo ramps, four loko green juice messenger bag raclette forage offal shoreditch chartreuse austin. Slow-carb poutine meggings swag blog, pop-up salvia taxidermy bushwick freegan ugh poke.
R-Code
R-code is contained in so called “chunks”. These chunks always start with three backticks and r
in curly braces ({r}
) and end with three backticks (
). Optionally, parameters can be added after the
r
to influence how a chunk behaves. Additionally, you can also give each chunk a name. Note that these have to be unique, otherwise R will refuse to knit your document.
Global and chunk options
The first chunk always looks as follows
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
It is added to the document automatically and sets options for all the following chunks. These options can be overwritten on a per-chunk basis.
Keep knitr::opts_chunk$set(echo = TRUE)
to print your code to the document you will hand in. Changing it to knitr::opts_chunk$set(echo = FALSE)
will not print your code by default. This can be changed on a per-chunk basis.
```{r cars, echo = FALSE}
summary(cars)
plot(dist~speed, cars)
```
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
```{r cars2, echo = TRUE}
summary(cars)
plot(dist~speed, cars)
```
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
A good overview of all available global/chunk options can be found here.
LaTeX Math
Writing well formatted mathematical formulas is done the same way as in LaTeX. Math mode is started and ended using $$
.
$$
f_1(\omega) = \frac{\sigma^2}{2 \pi},\ \omega \in[-\pi, \pi]
$$
\[ f_1(\omega) = \frac{\sigma^2}{2 \pi},\ \omega \in[-\pi, \pi] \]
(for those interested this is the spectral density of white noise)
Including inline mathematical notation is done with a single $
symbol.
${2\over3}$ of my code is inline.
\({2\over3}\) of my code is inline.
Take a look at this wikibook on Mathematics in LaTeX and this list of Greek letters and mathematical symbols if you are not familiar with LaTeX.
In order to write multi-line equations in the same math environment, use \\
after every line. In order to insert a space use a single \
. To render text inside a math environment use \text{here is the text}
. In order to align equations start with \begin{align}
and place an &
in each line at the point around which it should be aligned. Finally end with \end{align}
$$
\begin{align}
\text{First equation: }\ Y &= X \beta + \epsilon_y,\ \forall X \\
\text{Second equation: }\ X &= Z \gamma + \epsilon_x
\end{align}
$$
\[ \begin{align} \text{First equation: }\ Y &= X \beta + \epsilon_y,\ \forall X \\ \text{Second equation: }\ X &= Z \gamma + \epsilon_x \end{align} \]
Important symbols
Symbol | Code |
---|---|
\(a^{2} + b\) |
a^{2} + b
|
\(a^{2+b}\) |
a^{2+b}
|
\(a_{1}\) |
a_{1}
|
\(a \leq b\) |
a \leq b
|
\(a \geq b\) |
a \geq b
|
\(a \neq b\) |
a \neq b
|
\(a \approx b\) |
a \approx b
|
\(a \in (0,1)\) |
a \in (0,1)
|
\(a \rightarrow \infty\) |
a \rightarrow \infty
|
\(\frac{a}{b}\) |
\frac{a}{b}
|
\(\frac{\partial a}{\partial b}\) |
\frac{\partial a}{\partial b}
|
\(\sqrt{a}\) |
\sqrt{a}
|
\(\sum_{i = 1}^{b} a_i\) |
\sum_{i = 1}^{b} a_i
|
\(\int_{a}^b f(c) dc\) |
\int_{a}^b f(c) dc
|
\(\prod_{i = 0}^b a_i\) |
\prod_{i = 0}^b a_i
|
\(c \left( \sum_{i=1}^b a_i \right)\) |
c \left( \sum_{i=1}^b a_i \right)
|
The {}
after _
and ^
are not strictly necessary if there is only one character in the sub-/superscript. However, in order to place multiple characters in the sub-/superscript they are necessary.
e.g.
Symbol | Code |
---|---|
\(a^b = a^{b}\) |
a^b = a^{b}
|
\(a^b+c \neq a^{b+c}\) |
a^b+c \neq a^{b+c}
|
\(\sum_i a_i = \sum_{i} a_{i}\) |
\sum_i a_i = \sum_{i} a_{i}
|
\(\sum_{i=1}^{b+c} a_i \neq \sum_i=1^b+c a_i\) |
\sum_{i=1}^{b+c} a_i \neq \sum_i=1^b+c a_i
|
Greek letters
Greek letters are preceded by a \
followed by their name ($\beta$
= \(\beta\)). In order to capitalize them simply capitalize the first letter of the name ($\Gamma$
= \(\Gamma\)).
8.2 Assignment 1 (Solutions)
8.2.1 Load libraries and data
For your convenience the following code will load the required tidyverse
library as well as the data. Make sure to convert each of the variables you use for you analysis to the appropriate data types (e.g., Date
, factor
).
library(tidyverse)
library(scales)
options(scipen = 999) # disable scientific notation
music_data <- read.csv2("https://raw.githubusercontent.com/WU-RDS/MA2024/main/data/music_data_fin.csv")
str(music_data)
## 'data.frame': 66796 obs. of 31 variables:
## $ isrc : chr "BRRGE1603547" "USUM71808193" "ES5701800181" "ITRSE2000050" ...
## $ artist_id : int 3679 5239 776407 433730 526471 1939 210184 212546 4938 119985 ...
## $ streams : num 11944813 8934097 38835 46766 2930573 ...
## $ weeks_in_charts : int 141 51 1 1 7 226 13 1 64 7 ...
## $ n_regions : int 1 21 1 1 4 8 1 1 5 1 ...
## $ danceability : num 50.9 35.3 68.3 70.4 84.2 35.2 73 55.6 71.9 34.6 ...
## $ energy : num 80.3 75.5 67.6 56.8 57.8 91.1 69.6 24.5 85 43.3 ...
## $ speechiness : num 4 73.3 14.7 26.8 13.8 7.47 35.5 3.05 3.17 6.5 ...
## $ instrumentalness : num 0.05 0 0 0.000253 0 0 0 0 0.02 0 ...
## $ liveness : num 46.3 39 7.26 8.91 22.8 9.95 32.1 9.21 11.4 10.1 ...
## $ valence : num 65.1 43.7 43.4 49.5 19 23.6 58.4 27.6 36.7 76.8 ...
## $ tempo : num 166 191.2 99 91 74.5 ...
## $ song_length : num 3.12 3.23 3.02 3.45 3.95 ...
## $ song_age : num 228.3 144.3 112.3 50.7 58.3 ...
## $ explicit : int 0 0 0 0 0 0 0 0 1 0 ...
## $ n_playlists : int 450 768 48 6 475 20591 6 105 547 688 ...
## $ sp_popularity : int 51 54 32 44 52 81 44 8 59 68 ...
## $ youtube_views : num 145030723 13188411 6116639 0 0 ...
## $ tiktok_counts : int 9740 358700 0 13 515 67300 0 0 653 3807 ...
## $ ins_followers_artist : int 29613108 3693566 623778 81601 11962358 1169284 1948850 39381 9751080 343 ...
## $ monthly_listeners_artist : int 4133393 18367363 888273 143761 15551876 16224250 2683086 1318874 4828847 3088232 ...
## $ playlist_total_reach_artist: int 24286416 143384531 4846378 156521 90841884 80408253 7332603 24302331 8914977 8885252 ...
## $ sp_fans_artist : int 3308630 465412 23846 1294 380204 1651866 214001 10742 435457 1897685 ...
## $ shazam_counts : int 73100 588550 0 0 55482 5281161 0 0 39055 0 ...
## $ artistName : chr "Luan Santana" "Alessia Cara" "Ana Guerra" "Claver Gold feat. Murubutu" ...
## $ trackName : chr "Eu, Você, O Mar e Ela" "Growing Pains" "El Remedio" "Ulisse" ...
## $ release_date : chr "2016-06-20" "2018-06-14" "2018-04-26" "2020-03-31" ...
## $ genre : chr "other" "Pop" "Pop" "HipHop/Rap" ...
## $ label : chr "Independent" "Universal Music" "Universal Music" "Independent" ...
## $ top10 : int 1 0 0 0 0 1 0 0 0 0 ...
## $ expert_rating : chr "excellent" "good" "good" "poor" ...
8.2.2 Task 1
- Determine the most popular song by the artist “BTS”.
- Create a new
data.frame
that only contains songs by “BTS” (Bonus: Also include songs that feature both BTS and other artists, see e.g., “BTS feat. Charli XCX”) - Save the
data.frame
sorted by success (number of streams) with the most popular songs occurring first.
# provide your code here 1.
bts_data <- music_data %>%
filter(artistName == "BTS") %>%
arrange(-streams) %>%
select(artistName, trackName, streams) %>%
head(1)
bts_data
## 2.
bts_data <- music_data %>%
filter(str_detect(artistName, "BTS")) %>%
arrange(-streams) %>%
select(artistName, trackName, streams) %>%
head(1)
bts_data
8.2.3 Task 2
Create a new data.frame
containing the 100 most streamed songs.
# provide your code here
top100 <- bts_data <- music_data %>%
arrange(-streams) %>%
select(artistName, trackName, streams) %>%
head(100)
top100
8.2.4 Task 3
- Determine the most popular genres.
- Group the data by genre and calculate the total number of streams within each genre.
- Sort the result to show the most popular genre first.
- Create a bar plot in which the heights of the bars correspond to the total number of streams within a genre (Bonus: order the bars by their height)
# provide your code here
genre_data <- music_data %>%
group_by(genre) %>%
summarize(total_streams = sum(streams)) %>%
arrange(-total_streams)
genre_data
ggplot(genre_data, aes(x = reorder(genre, total_streams), y = total_streams)) +
geom_bar(stat = "identity") +
coord_flip() + # optional: makes horizontal bars
labs(x = "Genre", y = "Total Streams") +
theme_minimal()
8.2.5 Task 4
- Rank the music labels by their success (total number of streams of all their songs)
- Show the total number of streams as well as the average and the median of all songs by label. (Bonus: Also add the artist and track names and the number of streams of each label’s top song to the result)
# provide your code here
label_data <- music_data %>%
group_by(label) %>%
dplyr::summarize(total_streams = sum(streams),
avg_streams = mean(streams), med_streams = median(streams)) %>%
arrange(-total_streams)
label_data
label_data <- music_data %>%
group_by(label) %>%
dplyr::summarize(total_streams = sum(streams),
avg_streams = mean(streams), med_streams = median(streams),
top_song_artist = artistName[which.max(streams)],
top_song_title = trackName[which.max(streams)],
top_song_streams = max(streams)) %>%
arrange(-total_streams)
label_data
8.2.6 Task 5
- How do genres differ in terms of song features (audio features + song length + explicitness + song age)?
- Select appropriate summary statistics for each of the variables and highlight the differences between genres using the summary statistics.
- Create an appropriate plot showing the differences of “energy” across genres.
# provide your code here
plot_data <- music_data %>%
group_by(genre) %>%
summarize(across(danceability:explicit, list(avg = mean,
std.dev = sd, median = median, pct_10 = \(x)
quantile(x, 0.1), pct_90 = \(x)
quantile(x, 0.9))))
plot_data
ggplot(music_data, aes(x = fct_reorder(factor(genre),
energy, median), y = energy)) + geom_boxplot(fill = "steelblue",
color = "gray30", alpha = 0.8, outlier.color = "firebrick",
outlier.alpha = 0.6) + labs(title = "Distribution of Song Energy by Genre",
subtitle = "Genres ordered by median energy level",
x = NULL, y = "Energy (0 = low, 1 = high)") + scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
theme_minimal(base_size = 13) + theme(plot.title = element_text(face = "bold",
size = 14), plot.subtitle = element_text(size = 11,
color = "gray40"), axis.title.y = element_text(face = "bold"),
panel.grid.minor = element_blank(), panel.grid.major.x = element_blank())
8.2.7 Task 6
Visualize the number of songs by label.
# provide your code here
library(scales)
plot_data <- music_data %>%
group_by(label) %>%
summarize(n_songs = n_distinct(isrc)) %>%
mutate(label = fct_reorder(label, n_songs))
plot_data
ggplot(data = plot_data, aes(x = n_songs, y = label)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = comma(n_songs)), hjust = -0.1, size = 3.5) +
labs(x = "Number of Songs", y = "Label") +
theme_minimal() +
theme(plot.margin = margin(5, 35, 5, 20)) +
scale_x_continuous(
labels = comma,
limits = c(0, max(plot_data$n_songs) * 1.15), # add space for labels
expand = expansion(mult = c(0, 0.02))
)
8.2.8 Task 7
Visualize the average monthly artist listeners (monthly_listeners_artist
) by genre.
# provide your code here
plot_data <- music_data %>%
group_by(genre) %>%
summarize(avg_m_listeners = mean(monthly_listeners_artist)) %>%
mutate(genre = fct_reorder(factor(genre), avg_m_listeners))
ggplot(plot_data, aes(y = fct_reorder(genre, avg_m_listeners),
x = avg_m_listeners)) + geom_col(fill = "steelblue",
alpha = 0.8) + geom_text(aes(label = comma(round(avg_m_listeners,
0))), hjust = -0.1, size = 3.5, color = "gray20") +
scale_x_continuous(labels = comma, expand = expansion(mult = c(0,
0.05))) + coord_cartesian(clip = "off") + labs(title = "Average Monthly Artist Listeners by Genre",
x = "Average Monthly Listeners", y = NULL) + theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold",
size = 14), axis.title.x = element_text(face = "bold"),
axis.text.y = element_text(size = 11), panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(), plot.margin = margin(5,
50, 5, 10))
8.2.9 Task 8
Create a histogram of the variable “valence”.
# provide your code here
ggplot(music_data, aes(x = valence)) + geom_histogram(bins = 30,
fill = "steelblue", color = "white", alpha = 0.8) +
labs(title = "Distribution of Song Valence", subtitle = "Valence measures the musical positiveness conveyed by a track",
x = "Valence (positiveness of the song)", y = "Number of Songs") +
scale_x_continuous(labels = number_format(accuracy = 0.1)) +
theme_minimal(base_size = 13) + theme(plot.title = element_text(face = "bold",
size = 14), plot.subtitle = element_text(size = 11,
color = "gray40"), axis.title = element_text(face = "bold"),
panel.grid.minor = element_blank(), panel.grid.major.x = element_blank())
8.2.10 Task 9
Create a scatter plot showing youtube_views
and shazam_counts
(Bonus: add a linear regression line). Interpret the plot briefly.
# provide your code here
ggplot(music_data, aes(x = youtube_views, y = shazam_counts)) +
geom_point(alpha = 0.7, color = "steelblue", size = 2) +
geom_smooth(method = "lm", se = TRUE, color = "darkred",
linewidth = 1) + scale_x_continuous(labels = comma,
name = "YouTube Views") + scale_y_continuous(labels = comma,
name = "Shazam Counts") + labs(title = "Relationship Between YouTube Views and Shazam Counts",
subtitle = "Each point represents a song; red line shows the fitted linear trend",
caption = "Source: Internal music data") + theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold",
size = 14), plot.subtitle = element_text(size = 11,
color = "gray40"), axis.title = element_text(face = "bold"),
panel.grid.minor = element_blank(), panel.grid.major = element_line(color = "gray90"),
plot.caption = element_text(size = 9, color = "gray50"))
On average Youtube views and Shazam counts show a positive coefficient in the linear regression. However, the relationship does not appear to be linear.