Kara Belknap & Cassio Monti 2022-11-5
This report contains Exploratory Data Analysis (EDA) about the lifestyle data channel and a modeling section applying different regression methods which attempt to predict trends about article sharing on the Mashable website.
The objective of this analysis is to provide a comprehensive overview
about publication metrics and their relationship with the number of
shares that those publications presented during the study period. These
data have been collected from Mashable website, one of the largest news
websites from which the content of all the lifestyle channel
articles published in 2013 and 2014 was extracted. The full data
description can be found
here.
These data were originally collected and analyzed by Fernandes et
al. (2015), in which the authors performed
classification task comparing several machine learning algorithms. In
the present study, a subset of the data used by Fernandes et al.(2015)
corresponding to the data channel lifestyle is used for regression
purposes. The response variable is the number of shares
that the
papers presented after publication. In other words, we will try to
predict the number of shares that the papers will have before
publication and evaluate the prediction of each selected model based on
some common metrics, such as RMSE (Root Mean Squared Error), RSquared
(Coefficient of Determination), and MAE (Mean Absolute Error) applied to
the test set. To perform the regression, the methods Random Forest,
Boosting, Multiple Linear Regression, and LASSO regression will be used.
More information about the methods will be provided in the corresponded
sections.
Some metrics have been calculated based on the information obtained from Mashable website. For instance, the Latent Dirichlet Allocation (LDA) was applied to the data set to identify the 5 top relevant topics and then measure the closeness of the current article to such topic. There are 5 relevance of topic metrics according to LDA:
LDA_00
: Closeness to LDA topic 0LDA_01
: Closeness to LDA topic 1LDA_02
: Closeness to LDA topic 2LDA_03
: Closeness to LDA topic 3LDA_04
: Closeness to LDA topic 4Additionally, some quality metrics related to the keywords have been calculated and will be used in this analysis. These metrics represent the average number of shares for publications with worst, best, and average keywords. The classification of keywords under these groups was made by the authors of the original paper. The keyword metrics are shown below.
kw_avg_min
: Worst keyword (avg. shares)kw_avg_max
: Best keyword (avg. shares)kw_avg_avg
: Avg. keyword (avg. shares)Article content metrics were also used in this study. These are general metrics about the body of the publication that can influence the number of shares of that paper. The content summary metrics are shown below.
num_videos
: Number of videosn_tokens_content
: Number of words in the contentn_non_stop_unique_tokens
: Rate of unique non-stop words in the
contentnum_hrefs
: Number of linksnum_self_hrefs
: Number of links to other articles published by
Mashableaverage_token_length
: Average length of the words in the contentThese data were collected during 2013 and 2014 on daily basis. To
represent time dependent information, a binary variable indicating
whether the publication was made in a weekend or weekday, is_weekend
is used.
Before we can begin our analysis, we must load in the following packages:
library(tidyverse)
library(caret)
library(GGally)
library(knitr)
Tidyverse
is used for data management and plotting through dplyr and
ggplot packages. Caret
package is used for data splitting and
modeling. GGally
is used for nice correlation and exploratory plots
assisting in the visualization. knitr
package is used to provide nice
looking tables.
Using the data file OnlineNewsPopularity.csv
, we will read in the data
and add a new column corresponding to the type of data channel from
which the data was classified. The new variable will be called
dataChannel
. Note that there are some rows that are unclassified
according to the six channels of interest and those are indicated by
other
. The data indicated by other
was excluded from all reports
since the data had not been assigned to one of our channels of interest.
Once the data column is created, we can easily subset the data using the
filter
function to create a new data set for each data channel. We
removed the original data_channel_is_*
columns as well as two
non-predictive columns url
and timedelta
.
# reading in the data set
rawData <- read_csv("../OnlineNewsPopularity.csv")
# creating new variable to have more comprehensive names for data channels.
rawDataChannel <- rawData %>%
mutate(dataChannel = ifelse(data_channel_is_lifestyle == 1, "lifestyle",
ifelse(data_channel_is_entertainment == 1, "entertainment",
ifelse(data_channel_is_bus == 1, "bus",
ifelse(data_channel_is_socmed == 1, "socmed",
ifelse(data_channel_is_tech == 1, "tech",
ifelse(data_channel_is_world == 1, "world",
"other"))))))) %>%
select(-data_channel_is_lifestyle, -data_channel_is_entertainment,
-data_channel_is_bus, -data_channel_is_socmed, -data_channel_is_tech,
-data_channel_is_world, -url, -timedelta)
# assigning channel data to R objects.
lifestyleData <- rawDataChannel %>%
filter(dataChannel == "lifestyle")
entertainmentData <- rawDataChannel %>%
filter(dataChannel == "entertainment")
busData <- rawDataChannel %>%
filter(dataChannel == "bus")
socmedData <- rawDataChannel %>%
filter(dataChannel == "socmed")
techData <- rawDataChannel %>%
filter(dataChannel == "tech")
worldData <- rawDataChannel %>%
filter(dataChannel == "world")
To select the appropriate data channel based on the params$channel
, we
created a function selectData
which would return the appropriate data
set and assign it to the data set activeData
. This will be the file we
will use for the remainder of the report.
# function to assign automated calls for the different data channels
selectData <- function(dataChannel) {
if (dataChannel == "lifestyle"){
return(lifestyleData)
}
if (dataChannel == "entertainment"){
return(entertainmentData)
}
if (dataChannel == "bus"){
return(busData)
}
if (dataChannel == "socmed"){
return(socmedData)
}
if (dataChannel == "tech"){
return(techData)
}
if (dataChannel == "world"){
return(worldData)
}
}
# activating corresponding data set.
dataChannelSelect <- params$channel
activeData <- selectData(dataChannelSelect)
In this section, we will perform EDA for the data channel lifestyle.
This section splits the data set into training and test sets for the
proportion of 70/30. The data summarizing will be conducted on the
training set. To split the data, the function createDataPartition()
,
from caret
package, was used with the argument p=0.7
to represent
70% of the data should be in the split. The function set.seed(555)
was
used to fix the random seed. The code below shows the creation of
training and test sets.
set.seed(555)
trainIndex <- createDataPartition(activeData$shares, p = 0.7, list = FALSE)
activeTrain <- activeData[trainIndex, ]
activeTest <- activeData[-trainIndex, ]
In this section we will perform a very important step of EDA, the
outlier detection and cleaning. In order to accomplish this task, we
will use the studentized residuals from a linear regression using the
rstandard()
function. Linear models can also be useful for EDA when
analyzing the residuals. This analysis is famous for looking for values
above 2 and below -2 for the standardized residuals in the student
distribution scale, which means that if a residual goes above 2 or below
-2, it is considered an outlier and it is recommended to be deleted. The
code below shows the steps to use this function and cleans the detected
outliers from the training set.
# selecting variables of importance
var_sel = select(activeTrain,starts_with("LDA_"), average_token_length,
is_weekend, n_tokens_content, n_non_stop_unique_tokens, num_hrefs,
num_self_hrefs, num_videos, average_token_length, kw_avg_min,
kw_avg_max, kw_avg_avg, is_weekend)
# fitting a MLR with all important predictors
outlier_mod = lm(activeTrain$shares~.,data=var_sel)
# finding values greater than smaller than -2
a=(1:length(rstandard(outlier_mod)))[rstandard(outlier_mod) > 2]
b=(1:length(rstandard(outlier_mod)))[rstandard(outlier_mod) < -2]
# cleaning these values in the training set.
activeTrain = activeTrain[-c(a,b),]
A new created object in this section aims to summarize publications
during weekdays and weekends and create factor levels for them to match
with shares
variable. The functions ifelse()
was used to vectorize
the IF-ELSE statements associated to mutate()
, which took care of
creating and appending the new variable to the data set. The function
factor()
was used to explicitly coerce the days of week into levels of
the newly created categorical variable “Day”.
# IF-ELSE statements
statsData <- activeTrain %>%
mutate(Day = ifelse(weekday_is_monday == 1, "Monday",
ifelse(weekday_is_tuesday == 1, "Tuesday",
ifelse(weekday_is_wednesday == 1, "Wednesday",
ifelse(weekday_is_thursday == 1, "Thursday",
ifelse(weekday_is_friday == 1, "Friday",
ifelse(weekday_is_saturday == 1, "Saturday",
ifelse(weekday_is_sunday == 1, "Sunday",
"missingdata")))))))) %>%
mutate(Weekend = ifelse(is_weekend == 1, "Yes", "No"))
# Assigning factor levels
statsData$Day <- factor(statsData$Day,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))
The following table gives us information about the summary statistics
for the number of shares for articles in the data channel lifestyle. The
summary()
function was used to extract these metrics.
summary(activeTrain$shares)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28 1100 1600 2907 3100 24100
The following table gives us information about the average, median, and
standard deviation for the number of shares based on whether the post
was made on a weekend or a weekday. The variable “weekend” was grouped,
via grouped_by()
, and for each level the sum, average, median, and
standard deviation of shares were calculated via sum()
, mean()
,
meadian()
, sd()
, and summarise()
functions. The summary table is
shown below.
statsData %>%
group_by(Weekend) %>%
summarise(sumShares = sum(shares), avgShares = mean(shares), medShares = median(shares), sdShares = sd(shares)) %>%
kable(caption = "Statistics for Shares for Weekend or Weekdays")
Weekend | sumShares | avgShares | medShares | sdShares |
---|---|---|---|---|
No | 3237634 | 2769.576 | 1500 | 3420.077 |
Yes | 960117 | 3491.335 | 2100 | 3688.831 |
Statistics for Shares for Weekend or Weekdays
Likewise, this table gives us information about the number of shares by
the day of the week. The same functions were used here, but applied to
levels of variable “Day”. Also, the quantities maximum max()
and
minimum min()
number of shares by levels of “Day” were calculated.
statsData %>%
group_by(Day) %>%
arrange(Day) %>%
summarise(sumShares = sum(shares), avgShares = mean(shares), medShares = median(shares), sdShares = sd(shares), maxShares = max(shares),
minShares = min(shares)) %>%
kable(caption = "Statistics for Shares Across Days of Week")
Day | sumShares | avgShares | medShares | sdShares | maxShares | minShares |
---|---|---|---|---|---|---|
Monday | 619580 | 2964.498 | 1500 | 3655.682 | 24100 | 109 |
Tuesday | 577182 | 2542.652 | 1500 | 2806.972 | 19800 | 93 |
Wednesday | 729469 | 2662.296 | 1500 | 3444.118 | 22300 | 95 |
Thursday | 700320 | 2835.304 | 1500 | 3434.143 | 20800 | 28 |
Friday | 611083 | 2882.467 | 1500 | 3729.982 | 23700 | 127 |
Saturday | 445250 | 3478.516 | 2000 | 3748.858 | 23100 | 446 |
Sunday | 514867 | 3502.497 | 2100 | 3648.567 | 21000 | 613 |
Statistics for Shares Across Days of Week
Next, we will analyse the frequency of occurrence of publications on each day of the week. The one-way contingency table below presents those frequencies.
table(statsData$Day)
##
## Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## 209 227 274 247 212 128 147
Another discrete analysis performed here is the two-way contingency
table related to the discretization of the response variable if we
divided shares
into two categories. The function cut()
was used for
this end. In this case, we count the frequency of the number of
publications in the weekend versus weekdays with the two levels of
response variable. These levels represent the number of shares between
the minimum and average number of shares (on the left) and between the
average and maximum number of shares (on the right). The table below
shows the frequencies. In the table below, 0 (zero) represents weekdays
and 1 (one) represents weekends.
cutoff = cut(activeTrain$shares,
breaks = c(min(activeTrain$shares),
mean(activeTrain$shares),
max(activeTrain$shares)),
labels = c(paste0("(",round(min(activeTrain$shares),2),
", ",round(mean(activeTrain$shares),2),
"]"),
paste0("(",round(mean(activeTrain$shares),2),
", ",round(max(activeTrain$shares),2),
"]"))
)
table(activeTrain$is_weekend, cutoff) %>%
kable(caption = "Frequency of Shares in Weekend vs in Weekdays")
(28, 2907.03] | (2907.03, 24100] | |
---|---|---|
0 | 880 | 288 |
1 | 179 | 96 |
Frequency of Shares in Weekend vs in Weekdays
An important EDA analysis for regression tasks is the correlation
matrix. The function cor()
is used in this section to return the top
10 most correlated potential predictor variables with the response
variable shares
according to Pearson’s Correlation Coefficient. The
code below presents the process of obtaining these variables and their
respective correlations with the response variable. The correlations are
clearly small for this case, which may difficult the modeling process
and produce low quality of prediction metrics.
var_sel = select(activeTrain,starts_with("LDA_"), average_token_length,
is_weekend, n_tokens_content, n_non_stop_unique_tokens, num_hrefs,
num_self_hrefs, num_videos, average_token_length, kw_avg_min,
kw_avg_max, kw_avg_avg, is_weekend)
# correlation matrix
correlation = cor(activeTrain$shares, var_sel)
# sorting the highest correlations
p = sort(abs(correlation), decreasing = T)
# getting column ID
var_id = unlist(lapply(1:10,
function(i) which(abs(correlation) == p[i])))
# collecting variable names
var_cor = colnames(correlation)[var_id]
#combining names with correlations
tbcor = cbind(var_cor, correlation[var_id])
# converting to tibble
tbcor = as_tibble(tbcor)
# updating column names
colnames(tbcor)=c("Variables","Correlation")
# rounding the digits
tbcor$Correlation = round(as.numeric(tbcor$Correlation),3)
# nice printing with kable
kable(tbcor, caption = "Top 10 Response Correlated Variables")
Variables | Correlation |
---|---|
kw_avg_avg | 0.165 |
LDA_03 | 0.117 |
average_token_length | -0.104 |
num_hrefs | 0.098 |
n_non_stop_unique_tokens | -0.093 |
kw_avg_min | 0.089 |
is_weekend | 0.081 |
num_videos | 0.080 |
kw_avg_max | 0.074 |
n_tokens_content | 0.070 |
Top 10 Response Correlated Variables
The variables that present highest correlation with the response
variable shares
are kw_avg_avg, LDA_03, average_token_length,
num_hrefs, n_non_stop_unique_tokens, kw_avg_min, is_weekend, num_videos,
kw_avg_max, n_tokens_content. These variables will be studied in more
depth via PCA to understand the orientation of the most important
potential predictors. The code below presents the PCA analysis as part
of the EDA. The 10 PCs displayed in the table below correspond to the
most variable combination of the 10 predictors, which the PC1 has the
most variation in the data, PC2 presents the second most variation and
so on. The coefficients associated to each variable are the loadings and
they give the idea of importance of that particular variable to the
variance of the 10 predictor variables. The negative values only mean
that the orientation of the weights are opposite in the same PC. Since
the first PC has the largest variability, it is possible to say that the
variables with more weights in PC1 might be the most important variables
that contribute more with the variance of the predictors. This variables
are expected to present large influence on the explanation of the
variance of the response variable. The table below show these numbers.
id = which(colnames(activeTrain) %in% var_cor)
# PCA
PC = prcomp(activeTrain[,id], center = TRUE, scale = TRUE)
pc_directions=as.data.frame(PC$rotation)
kable(pc_directions, caption="Principal Components for EDA", digits = 3)
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
---|---|---|---|---|---|---|---|---|---|---|
n_tokens_content | -0.262 | 0.290 | -0.452 | -0.165 | 0.392 | -0.054 | -0.425 | 0.453 | 0.075 | -0.260 |
n_non_stop_unique_tokens | 0.412 | -0.554 | 0.073 | -0.055 | 0.036 | 0.055 | -0.069 | 0.138 | 0.210 | -0.666 |
num_hrefs | -0.348 | -0.111 | -0.513 | -0.202 | 0.039 | 0.002 | 0.429 | -0.527 | -0.006 | -0.316 |
num_videos | -0.097 | -0.083 | 0.190 | 0.269 | 0.702 | 0.591 | 0.130 | -0.093 | 0.043 | 0.077 |
average_token_length | 0.196 | -0.579 | -0.401 | -0.263 | 0.182 | -0.042 | -0.088 | 0.094 | -0.187 | 0.557 |
kw_avg_min | -0.022 | 0.110 | 0.369 | -0.766 | 0.041 | 0.241 | -0.004 | 0.001 | -0.437 | -0.116 |
kw_avg_max | -0.387 | -0.318 | 0.232 | 0.293 | 0.144 | -0.380 | -0.331 | -0.171 | -0.527 | -0.165 |
kw_avg_avg | -0.448 | -0.244 | 0.331 | -0.315 | 0.040 | -0.170 | -0.161 | -0.099 | 0.657 | 0.179 |
is_weekend | -0.248 | -0.156 | -0.165 | 0.095 | -0.506 | 0.637 | -0.457 | -0.085 | -0.026 | -0.002 |
LDA_03 | -0.425 | -0.240 | 0.057 | 0.081 | -0.195 | 0.074 | 0.512 | 0.659 | -0.109 | -0.018 |
Principal Components for EDA
id2 = which(abs(pc_directions$PC1) %in%
sort(abs(pc_directions$PC1),dec= T)[1:3])
It is possible to see that the three most important variables in PC1 are
n_non_stop_unique_tokens, kw_avg_avg, LDA_03 from the table above. These
variables are considered the most important variables in terms of
variance of the predictor variables. Although the metrics for prediction
are expected to be poor, these variables are expected to show the most
influence to the explanation of the variance of the response shares
.
The plot below presents histograms, scatter plots, and correlations in a
bivariate structure of the top 5 variables chosen in the correlation
analysis. Notice the shape of the distributions and the values of the
correlations relative to the response variable shares
.
# bivariate correlation plot
cor_data <- cbind(select(activeTrain,shares),var_sel[,var_id[1:5]])
ggpairs(cor_data)
The biplot below presents the PC1 and PC2 from the PCA analysis. The
function ggplot()
was used to create the plot and the segments created
via geom_segment()
were re-scaled so that we could better see the
variable names. The most variation in the data is contained in the PC1,
hence, the most important variables in the data are approximately
oriented towards the axis of PC1 and, therefore, may be good predictors
for the shares
response. Likewise, for PC2, which contains the second
most variability in the data set, the variables that are oriented
approximately towards the axis of PC2 are the second most important
variables.
pc_df<-data.frame(PC$x)
# plotting PC1 and PC2 for the top 5 variables
# biplot(PC, cex = 1)
ggplot(pc_directions)+
geom_point(data = pc_df, mapping = aes(x=PC1, y=PC2))+
geom_segment(aes(x = 0, y = 0, yend = 50 * PC2, xend = 50 * PC1))+
geom_label(mapping = aes(x = 51 * PC1, y = 51 * PC2, label = row.names(pc_directions)))
The scatter plots below show the different levels of the variables
related to the LDA metrics, from 0 to 4, and graphs the relationship
with the response variable shares
. The function ggplot()
is used to
create the plot frame and geom_point()
, geom_smooth
, and
facet_wrap()
function are used to plot the scatter points, the smooth
GAM (Generalized Additive Models) lines, and split the data by LDA type,
respectively. It is possible to see the behavior of the response
variable in relation to each LDA types. A common analysis using scatter
plots is related to the pattern shown by the smooth curve fitted to the
points. If this curve shows a flat or constant line parallel to the
predictor axis, then the predictor has little contribution to the
explanation of the variance of the response variable.
LDA.dat = activeTrain %>%
select(shares, starts_with("LDA")) %>%
pivot_longer(cols = LDA_00:LDA_04, names_to = "LDA", values_to = "values")
# relationship between shares and LDA levels (facet_wrap+smooth)
ggplot(LDA.dat, aes(y = shares, x = values))+
geom_point() + geom_smooth(method = "loess")+ facet_wrap(~LDA)+
labs(x = "LDA Values", y = "Shares", title = "Shares by LDA Types")
The scatter plots below show the different types of the variables
related to the Keyword metrics and graphs the relationship with the
response variable shares
. The function ggplot()
is used to create
the plot frame and geom_point()
, geom_smooth
, and facet_wrap()
function are used to plot the scatter points, the smooth GAM
(Generalized Additive Models) lines, and split the data by keyword type,
respectively. It is possible to see the behavior of the response
variable in relation to each of the 3 keyword metric types.
# relationship between shares and keyword metrics
kw.dat = activeTrain %>%
select(shares, kw_avg_max, kw_avg_avg, kw_avg_min) %>%
pivot_longer(cols = 2:4, names_to = "keyword", values_to = "values")
# relationship between shares and keyword metrics types (facet_wrap+smooth)
ggplot(kw.dat, aes(y = shares, x = values))+
geom_point() + geom_smooth(method = "loess")+ facet_wrap(~keyword)+
labs(x = "Keyword Metric Values", y = "Shares", title = "Shares by Keyword Metric Types")
The scatter plots below show the different types of the variables
related to the Content metrics and graphs the relationship with the
response variable shares
. The function ggplot()
is used to create
the plot frame and geom_point()
, geom_smooth
, and facet_wrap()
function are used to plot the scatter points, the smooth GAM
(Generalized Additive Models) lines, and split the data by content type,
respectively. It is possible to see the behavior of the response
variable in relation to each of the 4 content metric types.
# relationship between shares and content metrics (facet_wrap+smooth)
cont.dat = activeTrain %>%
select(shares, num_videos, n_tokens_content, n_non_stop_unique_tokens,
num_hrefs, num_self_hrefs, average_token_length) %>%
pivot_longer(cols = 2:7, names_to = "content", values_to = "values")
# relationship between shares and content metrics types (facet_wrap+smooth)
ggplot(cont.dat, aes(y = shares, x = values))+
geom_point() + geom_smooth(method = "loess")+ facet_wrap(~content)+
labs(x = "Content Metric Values", y = "Shares", title = "Shares by Content Metric Types")
The following box plot shows the distribution of shares for this data
channel. The main chunk of the data can be seen within the “box”
representing the region encompassing the first and third quartiles. For
some cases, there are possible outliers in the data that make
distortions to the box plot and this mentioned “box” looks thinner than
usual. If this happens, then it means that the possible outliers are
much larger than the main chunk of data. The outliers usually appear as
individual points in the box plot. The graph below shows this pattern
for the response variable shares
. However, the data might not have
outliers and the highlighted data points are in fact part of the data.
This stresses the importance of knowing about the subject and data set
in order to perform statistical analysis.
boxSharesGraph <- ggplot(statsData, aes(y = shares))
boxSharesGraph + geom_boxplot() +
ggtitle(paste("Number of Shares for Data Channel:", dataChannelSelect)) +
ylab("Number of Shares") +
xlab("Data Channel")
The following graph shows the number of shares compared to the number of words in the title. The output is colored by the day of the week.
titlewordcountGraph <- ggplot(statsData, aes(x = n_tokens_title, y = shares))
titlewordcountGraph + geom_point(aes(color = Day)) +
ggtitle("Number of Shares vs. Number of Words in Title") +
ylab("Number of Shares") +
xlab("Number of Words in Title")
The following plot shows the number of shares by the rate of positive words in the article. A positive trend would indicate that articles with more positive words are shared more often than articles with negative words.
positivewordrateGraph <- ggplot(statsData, aes(x = rate_positive_words, y = shares))
positivewordrateGraph + geom_point(aes(color = Day)) +
ggtitle("Number of Shares vs. Rate of Positive Words") +
ylab("Number of Shares") +
xlab("Rate of Positive Words")
The following plot shows the total number of shares as related to the parameter title subjectivity. A positive trend would indicate that articles are shared more often when the title is subjective. A negative trend would indicate that articles are shared more often when the title is less subjective.
titleSubjectivityGraph <- ggplot(statsData, aes(x = title_subjectivity, y = shares))
titleSubjectivityGraph + geom_point(aes(color = n_tokens_title)) +
ggtitle("Number of Shares vs. Title Subjectivity") +
ylab("Number of Shares") +
xlab("Title Subjectivity") +
labs(color = "Word Count in Title")
In this section, we will perform regression for prediction purposes for
the data channel lifestyle. All models were fitted using 5-fold
Cross-Validation via train()
function from caret
package. All
variables were scaled and centered as well.
The variables selected below are those described in the introduction of
this study and will be used in the modeling section. The function
select()
was used to subset the corresponding variables from the
training and test sets and two new objects are created specially for the
modeling section, dfTrain
and dfTest
.
dfTrain = activeTrain %>%
select(shares, starts_with("LDA_"), average_token_length,
is_weekend, n_tokens_content, n_non_stop_unique_tokens, num_hrefs,
num_self_hrefs, num_videos, average_token_length, kw_avg_min,
kw_avg_max, kw_avg_avg, is_weekend)
dfTest = activeTest %>%
select(shares, starts_with("LDA_"), average_token_length,
is_weekend, n_tokens_content, n_non_stop_unique_tokens, num_hrefs,
num_self_hrefs, num_videos, average_token_length, kw_avg_min,
kw_avg_max, kw_avg_avg, is_weekend)
Linear regression is a modeling technique by which one attempts to model
a response variable (in this case shares
) with one or more explanatory
variables using a straight line. A model with only one explanatory
variable is called simple linear regression (SLR). In simple linear
regression, the response variable is predicted by an intercept and a
regression coefficient multiplied by the value of your explanatory
variable. The goal of regression is to determine the intercept and the
regression coefficients. This is done by fitting a straight line across
all of the data with the goal of minimizing the residuals sum of squares
via Least Squares method. The model is fit by minimizing the sum of
squared errors (SSE).
The same concept can be applied to multiple linear regression (MLR), which has more than one explanatory variable. In this case, the goal is to determine the intercept and a regression coefficient corresponding to each explanatory variable in an attempt to minimize the sum of squared errors.
In R, MLR is generally done with the function lm
. There are also a
variety of other methods that fall under the umbrella of MLR. One of
these methods, LASSO regression, will be explored as part of this
analysis.
lm
Here, modeling for linear regression is done with the caret
package
using the method lm
. The summary
function gives us the regression
coefficients.
lmFit = train(shares~., data = dfTrain,
method="lm",
preProcess = c("center","scale"),
trControl = trainControl(method="CV",number=5))
summary(lmFit)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5985.9 -1723.3 -1043.9 270.8 20379.0
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2907.0298 89.5076 32.478 < 2e-16 ***
## LDA_00 -33.7647 94.1045 -0.359 0.719798
## LDA_01 -0.7295 91.8764 -0.008 0.993666
## LDA_02 -43.3569 96.5121 -0.449 0.653328
## LDA_03 126.9410 108.2720 1.172 0.241221
## LDA_04 NA NA NA NA
## average_token_length -454.9769 122.0528 -3.728 0.000201 ***
## is_weekend 161.1937 93.7802 1.719 0.085859 .
## n_tokens_content 192.0149 104.0070 1.846 0.065074 .
## n_non_stop_unique_tokens 188.3122 134.7045 1.398 0.162340
## num_hrefs 292.4034 112.4253 2.601 0.009395 **
## num_self_hrefs -50.5621 96.8320 -0.522 0.601637
## num_videos 227.3017 91.3995 2.487 0.012999 *
## kw_avg_min 214.2312 106.6862 2.008 0.044826 *
## kw_avg_max -22.8487 122.3882 -0.187 0.851930
## kw_avg_avg 348.0264 128.9820 2.698 0.007052 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3401 on 1429 degrees of freedom
## Multiple R-squared: 0.05567, Adjusted R-squared: 0.04642
## F-statistic: 6.018 on 14 and 1429 DF, p-value: 1.103e-11
The following table shows the output training metrics for this linear regression.
lm_out = data.frame(lmFit$results)
kable(lm_out, caption = "Output Training Metrics for Linear Regression",
digits = 3)
intercept | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|
TRUE | 3445.242 | 0.034 | 2185.643 | 421.502 | 0.04 | 166.187 |
Output Training Metrics for Linear Regression
The following shows the RMSE, RSquared, and MAE values for the model as it performed on predicting the test set.
metric_lm = postResample(pred = predict(lmFit, newdata = dfTest),
obs = dfTest$shares)
metric_lm
## RMSE Rsquared MAE
## 5.484429e+03 1.384488e-02 2.567296e+03
glmnet
The linear regression chosen for this next model is based on penalized
regression via LASSO regression. This method has a particular advantage
of having a triangular shape of parameters search space so that it
allows the estimated coefficients to be zero. This is due to LASSO
optimization that has in the loss function the penalty associating the
sum of the absolute value of the parameters multiplied by lambda
, the
penalty term (hyperparameter). Hence, LASSO regression is also a
variable selection method. In this application, we will test the
prediction capability of LASSO regression only. It was tested a sequence
of values for the Regularization Parameter (lambda
), a tuning
parameter, from 0 to 10 by 1 via seq(0,10,1)
assigned to the
tuneGrid =
argument in the train()
function from caret
package. The
code below presents the estimated coefficients for the best
hyperparameter.
LASSO = train(shares~., data = dfTrain,
method="glmnet",
preProcess = c("center","scale"),
tuneGrid = expand.grid(alpha = 1, lambda = seq(0,10,1)),
trControl = trainControl(method="CV",number=5))
coef(LASSO$finalModel, LASSO$bestTune$lambda)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 2907.029778
## LDA_00 -21.989521
## LDA_01 .
## LDA_02 -40.175884
## LDA_03 122.220732
## LDA_04 .
## average_token_length -415.136545
## is_weekend 153.703544
## n_tokens_content 166.281015
## n_non_stop_unique_tokens 139.376080
## num_hrefs 266.978243
## num_self_hrefs -32.396689
## num_videos 217.785951
## kw_avg_min 210.706914
## kw_avg_max -0.216876
## kw_avg_avg 333.938941
The best lambda
for this model is 10 and this value can be seen in the
table below which summarizes all the metrics for the 5-fold
cross-validation.
lasso_out = data.frame(LASSO$results)
kable(lasso_out, caption = "Output Training Metrics for LASSO",
digits = 3)
alpha | lambda | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|---|
1 | 0 | 3439.824 | 0.035 | 2177.647 | 281.958 | 0.039 | 124.165 |
1 | 1 | 3439.824 | 0.035 | 2177.647 | 281.958 | 0.039 | 124.165 |
1 | 2 | 3439.652 | 0.035 | 2177.399 | 281.885 | 0.040 | 124.142 |
1 | 3 | 3439.454 | 0.035 | 2177.104 | 281.796 | 0.040 | 124.100 |
1 | 4 | 3439.276 | 0.035 | 2176.831 | 281.708 | 0.040 | 124.076 |
1 | 5 | 3439.097 | 0.035 | 2176.564 | 281.608 | 0.040 | 124.038 |
1 | 6 | 3438.920 | 0.035 | 2176.296 | 281.514 | 0.040 | 124.003 |
1 | 7 | 3438.688 | 0.036 | 2176.023 | 281.427 | 0.040 | 123.959 |
1 | 8 | 3438.421 | 0.036 | 2175.752 | 281.338 | 0.040 | 123.904 |
1 | 9 | 3438.174 | 0.036 | 2175.498 | 281.232 | 0.040 | 123.831 |
1 | 10 | 3437.909 | 0.036 | 2175.233 | 281.136 | 0.040 | 123.762 |
Output Training Metrics for LASSO
The plot below shows the RMSE by Regularization Parameter (lambda
). It
is easy to see that RMSE is minimized when lambda
= 10.
plot(LASSO)
The validation step for LASSO regression is applied on the test set
after predicting the response variable for unseen data (test set). By
using predict()
and postResample()
functions, the metrics RMSE (Root
Means Squared Error), Rsquared (Coefficient of Determination), and MAE
(Mean Absolute Error) are calculated and displayed below.
metric_LASSO = postResample(pred = predict(LASSO, newdata = dfTest),
obs = dfTest$shares)
metric_LASSO
## RMSE Rsquared MAE
## 5.484450e+03 1.355832e-02 2.563645e+03
The next two models, Random Forest and Boosted Tree, are both types of tree-based modeling methods. Generally speaking, in a tree-based modeling method, the predictor space is split into regions, with different predictions for each region. In the case of regression trees where the goal is to predict a continuous response, the mean of observations for a given region is typically used to make the predictions.
To make the predictions, the trees are split using recursive binary splitting. For every possible value of each predictor, find the residual sum of squares (RSS) and try to minimize that. The process is repeated with each split. Often, trees are grown very large and need to be cut back using cost complexity pruning. This ensures that the model is not overfit and will work well on prediction of new data.
In this section, we attempt to model the data using a Random Forest model, which is a type of ensemble learning which averages multiple tree models in order to lower the variance of the final model and thus improve our prediction.
In a random forest model, we first begin by creating multiple trees from
bootstrap samples. A random subset of predictors is used to create each
bootstrap sample. The predictors are selected randomly to prevent the
trees from being correlated. If the random subset was not used (as in
another tree based method called bagging), the trees would likely all
choose the same predictors for the first split. Choosing the splits
randomly avoids this correlation. The number of predictors is specified
by mtry
. The maximum number of predictors for a regression model is
generally chosen to be the total number of predictors divided by 3. Once
the bootstrap sample statistics are collected, they are averaged and
used to select a final model.
Random forest models use “out of bag” error to test the data using samples from the original data set that were not included in a particular bootstrap data set.
For the random forest model, we will use the train
function from the
caret
package. We set the mtry
to select 1-5 predictors.
train.control = trainControl(method = "cv", number = 5)
rfFit <- train(shares~.,
data = dfTrain,
method = "rf",
trControl = train.control,
preProcess = c("center","scale"),
tuneGrid = data.frame(mtry = 1:5))
rfFit$bestTune$mtry
## [1] 1
The best mtry
for this particular model was 1.
The following plot shows the RMSE values for each of the tune. The objective in random forest modeling is to choose the model with the lowest RMSE.
plot(rfFit)
The following table shows training metrics for the random forest model. Again, the best model is the one that minimizes RMSE.
rf_out = data.frame(rfFit$results)
kable(rf_out, caption = "Output Training Metrics for Random Forest",
digits = 3)
mtry | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|
1 | 3440.326 | 0.027 | 2190.962 | 336.718 | 0.020 | 82.815 |
2 | 3473.762 | 0.020 | 2240.135 | 329.290 | 0.011 | 91.406 |
3 | 3484.549 | 0.022 | 2257.626 | 316.541 | 0.014 | 81.308 |
4 | 3501.722 | 0.020 | 2271.472 | 315.905 | 0.012 | 84.558 |
5 | 3512.716 | 0.019 | 2278.493 | 320.238 | 0.011 | 90.148 |
Output Training Metrics for Random Forest
Now we will run a prediction on our test data split that we obtained when we split the data based on a 70/30 split. The table shows the RMSE value for the test data set, which is an indication of how well our model worked to predict data that was not included when training the original model. We can compare this model against other models to find the model with the lowest RMSE.
RF_pred <- predict(rfFit, newdata = activeTest)
metric_rf = postResample(RF_pred, activeTest$shares)
metric_rf
## RMSE Rsquared MAE
## 5.476735e+03 1.395719e-02 2.570579e+03
In this section the Ensemble Learning algorithm Boosting will be
trained. Boosted tree method is one of the Tree based models most used
in data science because it presents a fitting strategy that improves
sequentially throughout the iterations. Boosting uses single trees
fitted (each single tree has d
splits) on the training data and
produces predictions off of that training. The residuals of this
prediction is, then, used as response variable for the next single tree
training step. New predictions are done for this new model as well and
so on. This process occurs several times during B
iterations and the
predictions are updated during the fitting process, being driven by the
shrinkage parameter, also called growth rate, lambda
. These training
features of Boosting make this method to produce a reduction in the
variance of the predictions as well as gains in precision, mainly over
Random Forest, Bagging, and single tree. The shrinkage parameter will be
set as 0.1 and n.minobsinnode set as 10. The parameters n.tree and
interaction.depth will be chosen based on 5-fold cross-validation. The
former will be chosen from a sequence from 25 to 200, counting by 25.
The latter will be chosen from a sequence from 1 to 4. The code below
shows the training and tuning procedure and prints out the resultant
values of the two considered hyperparameters.
tunG = expand.grid(n.trees = seq(25,200,25),
interaction.depth = 1:4,
shrinkage = 0.1,
n.minobsinnode = 10)
gbmFit <- train(shares~.,
data = dfTrain,
method = "gbm",
preProcess = c("center","scale"),
trControl = train.control,
tuneGrid = tunG,
verbose = FALSE
)
gbmFit$bestTune$n.trees
## [1] 25
gbmFit$bestTune$interaction.depth
## [1] 1
The best n.trees and interaction.depth parameters for this model are 25 and 1, respectively. These values can be seen in the table below, which summarizes all the metrics for the 5-fold cross-validation. It is easy to see that these values minimize the RMSE.
gbm_out = data.frame(gbmFit$results)
gbm_out <- gbm_out %>%
arrange(RMSE)
kable(gbm_out, caption = "Output Training Metrics for Boosting",
digits = 3, row.names = FALSE)
shrinkage | interaction.depth | n.minobsinnode | n.trees | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|---|---|---|
0.1 | 1 | 10 | 25 | 3449.933 | 0.030 | 2174.450 | 368.728 | 0.049 | 135.697 |
0.1 | 2 | 10 | 25 | 3455.866 | 0.027 | 2170.931 | 376.052 | 0.051 | 137.940 |
0.1 | 1 | 10 | 50 | 3457.997 | 0.027 | 2177.234 | 372.453 | 0.042 | 146.910 |
0.1 | 1 | 10 | 75 | 3472.210 | 0.026 | 2199.803 | 364.873 | 0.038 | 145.933 |
0.1 | 3 | 10 | 25 | 3473.521 | 0.018 | 2187.612 | 379.471 | 0.022 | 147.319 |
0.1 | 1 | 10 | 125 | 3477.967 | 0.025 | 2192.704 | 362.301 | 0.034 | 142.334 |
0.1 | 1 | 10 | 100 | 3482.249 | 0.023 | 2180.636 | 360.786 | 0.033 | 134.922 |
0.1 | 1 | 10 | 150 | 3485.774 | 0.022 | 2199.727 | 368.116 | 0.026 | 133.880 |
0.1 | 2 | 10 | 50 | 3492.024 | 0.027 | 2191.440 | 376.442 | 0.050 | 138.930 |
0.1 | 3 | 10 | 50 | 3493.507 | 0.020 | 2199.520 | 352.444 | 0.023 | 121.830 |
0.1 | 2 | 10 | 75 | 3495.792 | 0.023 | 2190.924 | 387.315 | 0.034 | 154.573 |
0.1 | 1 | 10 | 175 | 3496.085 | 0.020 | 2196.355 | 368.271 | 0.027 | 137.891 |
0.1 | 4 | 10 | 25 | 3497.465 | 0.013 | 2187.718 | 362.511 | 0.014 | 108.323 |
0.1 | 3 | 10 | 75 | 3503.285 | 0.018 | 2200.925 | 343.144 | 0.023 | 128.892 |
0.1 | 1 | 10 | 200 | 3505.557 | 0.016 | 2206.767 | 369.438 | 0.019 | 140.029 |
0.1 | 2 | 10 | 100 | 3516.840 | 0.018 | 2201.927 | 365.095 | 0.027 | 128.877 |
0.1 | 2 | 10 | 125 | 3523.975 | 0.019 | 2213.731 | 376.036 | 0.026 | 141.880 |
0.1 | 3 | 10 | 100 | 3532.045 | 0.018 | 2222.869 | 351.136 | 0.024 | 127.556 |
0.1 | 4 | 10 | 50 | 3534.795 | 0.013 | 2210.058 | 353.325 | 0.013 | 131.730 |
0.1 | 2 | 10 | 150 | 3546.924 | 0.016 | 2235.680 | 366.506 | 0.021 | 113.751 |
0.1 | 2 | 10 | 175 | 3554.888 | 0.016 | 2239.687 | 368.040 | 0.020 | 124.827 |
0.1 | 3 | 10 | 125 | 3557.803 | 0.016 | 2239.807 | 353.695 | 0.022 | 132.820 |
0.1 | 2 | 10 | 200 | 3562.372 | 0.015 | 2239.476 | 354.919 | 0.019 | 109.678 |
0.1 | 4 | 10 | 75 | 3565.097 | 0.012 | 2221.548 | 365.663 | 0.013 | 147.885 |
0.1 | 3 | 10 | 150 | 3567.004 | 0.016 | 2250.812 | 338.489 | 0.022 | 126.104 |
0.1 | 4 | 10 | 100 | 3581.380 | 0.011 | 2240.138 | 340.901 | 0.011 | 130.338 |
0.1 | 3 | 10 | 175 | 3606.334 | 0.012 | 2287.999 | 336.079 | 0.017 | 125.706 |
0.1 | 4 | 10 | 125 | 3608.715 | 0.009 | 2260.849 | 353.990 | 0.011 | 139.472 |
0.1 | 3 | 10 | 200 | 3613.193 | 0.012 | 2304.850 | 337.817 | 0.016 | 136.405 |
0.1 | 4 | 10 | 150 | 3641.772 | 0.009 | 2293.797 | 344.355 | 0.012 | 135.185 |
0.1 | 4 | 10 | 175 | 3652.867 | 0.009 | 2314.115 | 333.610 | 0.013 | 132.709 |
0.1 | 4 | 10 | 200 | 3671.079 | 0.008 | 2330.907 | 328.781 | 0.011 | 137.399 |
Output Training Metrics for Boosting
The plot below shows the RMSE by Number of Boosting Iterations and display Max Tree Depth lines for the 5-fold CV performed. It is easy to see that RMSE is minimized when n.trees = 25 and interaction.depth = 1.
plot(gbmFit)
The validation step for Boosting is applied on the test set after
predicting the response variable for unseen data (test set). By using
predict()
and postResample()
functions, the metrics RMSE (Root Means
Squared Error), Rsquared (Coefficient of Determination), and MAE (Mean
Absolute Error) are calculated and displayed below.
gbm_pred <- predict(gbmFit, newdata = activeTest)
metric_boosting = postResample(gbm_pred, activeTest$shares)
metric_boosting
## RMSE Rsquared MAE
## 5480.7363930 0.0146463 2559.5313252
For the overall comparison among all 4 created models in previous sections, the test set was used for predictions and some quality of fit metrics were calculated based on these prediction on unseen data. The code below shows the function that returns the name of the best model based on RMSE values estimated on the test set. The code below displays the table comparing all 4 models.
bestMethod = function(x){
bestm = which.min(lapply(1:length(x), function(i) x[[i]][1]))
out = switch(bestm,
"Random Forest",
"Boosting",
"LASSO Regression",
"Multiple Linear Regression")
return(out)
}
tb = data.frame(RF = metric_rf, Boosting = metric_boosting,
LASSO = metric_LASSO, Linear = metric_lm)
kable(tb, caption = "Accuracy Metric by Ensemble Method on Test Set",
digits = 3)
RF | Boosting | LASSO | Linear | |
---|---|---|---|---|
RMSE | 5476.735 | 5480.736 | 5484.450 | 5484.429 |
Rsquared | 0.014 | 0.015 | 0.014 | 0.014 |
MAE | 2570.579 | 2559.531 | 2563.645 | 2567.296 |
Accuracy Metric by Ensemble Method on Test Set
After comparing all the 4 models fit throughout this analysis, the best model was chosen based on the RMSE value, such that the model with minimum RMSE is the “winner”. Therefore, the best model is Random Forest based on RMSE metric. The RMSE, coefficient of determination, and MAE metrics for all 4 models can be seen in the table above.
K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.