Prepare, analyze, and visualize data with R

Arithmetic Artist


R is as good as any other language in terms of database interfaces and the ability to integrate with other programming languages and programs. R communicates with MySQL, PostgreSQL, Oracle, and Teradata, among others, with native drivers, and with virtually any database (including Microsoft SQL Server) via ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity).

Various solutions for Big Data applications can integrate R into Hadoop on your own hardware or on Amazon Web Services. R also has interfaces that link it with other programming languages: For example, R can exchange objects with Java (JRI (Java/R Interface) or rJava). The reticulate package lets you insert Python code into R, and you can reverse the process with the Python rpy2 package.

Support for R is growing for other statistical solutions and programs from the business intelligence sector, as well. For example, SPSS, SAS, MicroStrategy, and Power BI provide R interfaces and benefits from their wealth of functions.

For people who come from other programming languages (e.g., Python, C++, Java), R takes some getting used to. As mentioned in an earlier example, you do not have to specify type explicitly when creating a new object. R automatically converts data to another type under certain conditions. For example, when two objects of different types are compared, they are converted to the more general type (e.g., 2 == "2" leads to the result TRUE), so for full control, it makes more sense to use the identical() function for comparisons. If you are not aware of this feature, you could see unexpected results.

Loops slow down code, so instead, R developers should choose a vectorized approach. For example, the apply() functions let you apply the same function to all elements of a vector, matrix, list, and the like. For programming enthusiasts with extensive experience in other languages, this can mean some acclimatization.

Last but not least, like many older languages, R has some peculiarities that can only be explained historically. For example, when importing datasets, string variables are stored by default as factor (categorical data with a limited number of values) and not as simple characters (string data). At the time R was created, analysis of categorical variables was more common than analysis of real text, so strings at that time usually reflected categories. Today, therefore, the argument

stringsAsFactors = FALSE

usually has to be passed in when reading strings (e.g., with read.csv()).

No uniform naming scheme has been able to assert itself thus far, either: Function names of the statistical package from the core distribution contain dots (t.test()), underscores (get_all_vars()), lowercase (addmargins()), and CamelCase (setNames()).

From Science to Business

R clearly has its roots in the scientific community; it was developed at a university and is still widely used in science. Scientific publications that include new statistical methods often include an R package that implements the method.

In addition to R, of course, scientists also like to use other tools when they prepare and analyze data. Some of these tools offer graphical user interfaces but limited functionality (e.g., SPSS [2], which is very popular in psychology and social sciences). Like SPSS, other tools are subject to a fee, including Stata [3] (econometrics) and SAS [4] (biometrics, clinical research, banking).

Python is widespread in the computer science environment and especially in machine learning, where it leaves hardly anything to be desired, even compared with R. In terms of statistical applications, the main difference between the two languages is the variety of packages available in R. In Python, data science applications are concentrated in a smaller number of more consolidated packages, but more specialized or newer methods are often not available.

Whether you prefer R or Python depends in the end on the technical background of the team. If you have a background in computer science, you might already have knowledge of Python and want to use it in a professional context. People with a mathematical or scientific background, on the other hand, are more likely to have experience with R.

R owes its leap from scientific software into the business world to the proliferation of data. Companies seek to profit from their data and optimize their business processes. The employees of a company faced with the task of analyzing data, often use Microsoft Excel. As a very powerful program, Excel can also meet many requirements. However, if more complex data processing becomes necessary or statistical models need to be developed, users can easily reach the limits of Excel.

Getting started with R offers more advantages than just the multitude of methods for data preparation and analysis. For example, R Markdown can be used to create reproducible and parameterized reports in PDF, Word, or HTML format. The use of LaTeX elements in R Markdown reports opens countless possibilities, such as the adaptation of reports to a corporate identity [5]; however, no LaTeX knowledge is required for working with R Markdown.

The Shiny package makes it possible to develop interactive web applications [6], which means R can cover tasks that would otherwise require tools such as Tableau [7] or Power BI [8]. However, automating processes with the help of R scripts already offers enormous benefits for many newcomers compared with the repeated execution of the same click sequence.

Regression Analysis Example

The final example reads Google Analytics data from a CSV file that has the number of sessions on the INWT Statistics [9] website since the beginning of 2015 (Figure 4). First, the script filters and prepares the data (Listing 6); then, with the data for 2017 and 2018, the code models the number of sessions, depending on the month and day of the week, and initially withholds the first few weeks of 2019 as test data.

Listing 6

Time Series Analysis

> # Load required packages
> library(dplyr)
> library(ggplot2)
> library(ggthemes)
> # Read data from csv file
> dat <- read.csv("data/Google_Analytics_Sessions.csv", stringsAsFactors = FALSE)
> # First view of the dataset
> str(dat)
'data.frame': 794 obs. of 2 variables:
$ Day : chr "01.01.17" "02.01.17" "03.01.17" "04.01.17" ...
$ Sessions: num 105 304 355 394 356 338 255 302 461 489 ...
> # Convert the date stored as character to a date (with base package)
> dat$Day <- as.Date(dat$Day, format = "%d.%m.%y")
> # Graphical representation of the time series with the built-in graphics package (Figure 5)
> plot(dat$Day, dat$Sessions, type = "l")
> # Data preparation with dplyr
> datPrep <- dat %>%
+   mutate(
+ # Set values below 50 to NA, because this is known to be a measurement error.
+ Sessuibs = ifelse(Sessions < 50, NA, Sessions),
+     # Determine weekday
+     wday = weekdays(Day),
+     # Determine month
+     month = months(Day)
+   )
> # Split into training and test data
> datPrep$isTrainData <- datPrep$Day < "2019-01-01"
> # Simple linear model. The character variables "wday" and "month" will be
> # automatically interpreted as categorial variables.
> # The model only uses the data up to the end of 2018, i.e. for which the variable
> # isTrainData has the value TRUE
> mod1 <- lm(data = datPrep[datPrep$isTrainData, ], formula = Meetings ~ wday + month)
> # Model summary
> summary(mod1)
lm(formula = Sessions ~ wday + month, data = datPrep[datPrep$isTrainData, ])
    Min      1Q  Median      3Q     Max
-464.80  -61.88   -6.52   62.38  479.19
               Estimate Std. Error t value Pr(>|t|)
(Intercept)     517.340     18.270  28.316  < 2e-16 ***
wdayThursday    -31.046     16.119  -1.926   0.0545 .
wdayFriday     -116.981     16.122  -7.256 1.08e-12 ***
wdayWednesday    -2.817     16.111  -0.175   0.8612
wdayMonday       -7.527     16.030  -0.470   0.6388
wdaySaturday   -262.736     16.164 -16.255  < 2e-16 ***
wdaySunday     -202.964     16.077 -12.624  < 2e-16 ***
monthAugust     -16.028     20.777  -0.771   0.4407
monthDecember    -6.106     20.768  -0.294   0.7688
monthFebruary   109.349     21.304   5.133 3.72e-07 ***
monthJanuary    133.990     20.770   6.451 2.10e-10 ***
monthJuly       211.030     20.849  10.122  < 2e-16 ***
monthJune       167.145     22.755   7.345 5.85e-13 ***
monthMay         24.528     21.411   1.146   0.2524
monthMarch      -10.467     20.858  -0.502   0.6159
monthNovember    99.135     20.943   4.734 2.68e-06 ***
monthOctober    -27.591     20.770  -1.328   0.1845
monthSeptember   53.475     21.030   2.543   0.0112 *
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 114.2 on 685 degrees of freedom
  (27 observations deleted due to missingness)
Multiple R-squared:  0.5528,     Adjusted R-squared:  0.5417
F-statistic: 49.81 on 17 and 685 DF,  p-value: < 2.2e-16
> # Calculation of the values predicted by the model (for the whole period)
> datPrep$pred <- predict(mod1, newdata = datPrep)
> # Graphical representation of true and predicted values with ggplot2
> p <- ggplot(datPrep) +
+   geom_line(aes(x = Day, y = Sessions, linetype = !isTrainData)) +
+   geom_line(aes(x = Day, y = pred, colour = !isTrainData)) +
+   scale_color_manual(values = c("#2B4894", "#cd5364"), limits = c(FALSE, TRUE)) +
+   labs(colour = "Test period", linetype = "Test period")
> p # Equivalent to print(p)
> # Alternative model with splines
> library(mgcv)
> library(lubridate)
> datPrep$dayOfYear <- yday(datPrep$Day)
> datPrep$linTrend <- 1:nrow(datPrep)
> mod2 <- gam(data = datPrep[datPrep$isTrainData, ],
+             formula = Sessions ~
+             wday + linTrend + s(dayOfYear, bs = "cc", k = 20, by = as.factor(wday)))
> summary(mod2)
Family: gaussian
Link function: identity
Sessions ~ wday + linTrend + s(dayOfYear, bs = "cc", k = 20,
    by = as.factor(wday))
Parametric coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)     532.0423    12.3948  42.925  < 2e-16 ***
wdayThursday    -32.2007    14.1368  -2.278   0.0231 *
wdayFriday     -119.6377    14.1354  -8.464  < 2e-16 ***
wdayWednesday    -4.9203    14.1359  -0.348   0.7279
wdayMonday       -5.8354    14.0639  -0.415   0.6784
wdaySaturday   -265.6973    14.1727 -18.747  < 2e-16 ***
wdaySunday     -204.1019    14.1028 -14.472  < 2e-16 ***
linTrend          0.1254     0.0205   6.118 1.71e-09 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(dayOfYear):as.factor(wday)Tuesday    15.32     18 6.413  < 2e-16 ***
s(dayOfYear):as.factor(wday)Thursday   12.97     18 3.902 1.89e-10 ***
s(dayOfYear):as.factor(wday)Friday     12.15     18 3.953 4.59e-11 ***
s(dayOfYear):as.factor(wday)Wednesday  13.85     18 5.314 9.31e-15 ***
s(dayOfYear):as.factor(wday)Monday     16.64     18 8.382  < 2e-16 ***
s(dayOfYear):as.factor(wday)Saturday   11.29     18 3.307 3.00e-09 ***
s(dayOfYear):as.factor(wday)Sunday     12.92     18 4.843 1.02e-13 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) =  0.648   Deviance explained = 69.9%
GCV =  11749  Scale est. = 10025     n = 703
> datPrep$pred2 <- predict(mod2, newdata = datPrep)
> ggplot(datPrep) +
+   geom_line(aes(x = Day, y = Sessions, linetype = !isTrainData)) +
+   geom_line(aes(x = Day, y = pred2, colour = !isTrainData)) +
+   scale_color_manual(values = c("#2B4894", "#cd5364"), limits = c(FALSE, TRUE)) +
+   labs(colour = "Test period", linetype = "Test period")
Figure 4: Graphical representation of the temporal course of session numbers on the INWT Statistics website. Some peculiarities can already be identified. All years start with relatively strong traffic after the Christmas lull.

In the Estimate column of the model summary, you can see that Saturdays and Sundays have very few website visits and that June and July are particularly strong months. Interpretation is always by comparison with the Tuesday or April reference class, which is set automatically. All told, the model can already explain greater than 50 percent of the variation (Multiple R-Squared = 0.5528).

The example then computes the values predicted by the model, including the first weeks of 2019, so you can investigate how well the model works with new data that was not used for the estimate. Finally, the actual number of sessions and the number predicted by the model are displayed graphically (Figure 5).

Figure 5: The first model showing the actual number of sessions (black) and the number predicted by a regression model (blue in the model period, red in the forecast period).

The figure clearly shows that the very simple model still has some weaknesses; for example, it does not predict the slump around Christmas and New Years. Another step checks whether adding holiday variables improves the model – visually or by using model measures such as R-squared (ideally on test data that was not used to estimate the model).

The model improves significantly if you map the annual trend nonlinearly with a spline rather than the month variable and add a linear trend (Figure 6). Whether the model will deteriorate if the days are combined into "weekday vs. weekend/holiday" instead of considering them individually is open to question.

Figure 6: The annual trend modeled nonlinearly with a spline already reproduces the data better than the simpler model in Figure 5.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs

Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.