## Prepare, analyze, and visualize data with R

# Arithmetic Artist

#### Integration

R is as good as any other language in terms of database interfaces and the ability to integrate with other programming languages and programs. R communicates with MySQL, PostgreSQL, Oracle, and Teradata, among others, with native drivers, and with virtually any database (including Microsoft SQL Server) via ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity).

Various solutions for Big Data applications can integrate R into Hadoop on your own hardware or on Amazon Web Services. R also has interfaces that link it with other programming languages: For example, R can exchange objects with Java (JRI (Java/R Interface) or rJava). The *reticulate*
package lets you insert Python code into R, and you can reverse the process with the Python *rpy2*
package.

Support for R is growing for other statistical solutions and programs from the business intelligence sector, as well. For example, SPSS, SAS, MicroStrategy, and Power BI provide R interfaces and benefits from their wealth of functions.

For people who come from other programming languages (e.g., Python, C++, Java), R takes some getting used to. As mentioned in an earlier example, you do not have to specify type explicitly when creating a new object. R automatically converts data to another type under certain conditions. For example, when two objects of different types are compared, they are converted to the more general type (e.g., `2 == "2"`

leads to the result `TRUE`

), so for full control, it makes more sense to use the `identical()`

function for comparisons. If you are not aware of this feature, you could see unexpected results.

Loops slow down code, so instead, R developers should choose a vectorized approach. For example, the `apply()`

functions let you apply the same function to all elements of a vector, matrix, list, and the like. For programming enthusiasts with extensive experience in other languages, this can mean some acclimatization.

Last but not least, like many older languages, R has some peculiarities that can only be explained historically. For example, when importing datasets, string variables are stored by default as `factor`

(categorical data with a limited number of values) and not as simple characters (string data). At the time R was created, analysis of categorical variables was more common than analysis of real text, so strings at that time usually reflected categories. Today, therefore, the argument

stringsAsFactors = FALSE

usually has to be passed in when reading strings (e.g., with `read.csv()`

).

No uniform naming scheme has been able to assert itself thus far, either: Function names of the statistical package from the core distribution contain dots (`t.test()`

), underscores (`get_all_vars()`

), lowercase (`addmargins()`

), and CamelCase (`setNames()`

).

#### From Science to Business

R clearly has its roots in the scientific community; it was developed at a university and is still widely used in science. Scientific publications that include new statistical methods often include an R package that implements the method.

In addition to R, of course, scientists also like to use other tools when they prepare and analyze data. Some of these tools offer graphical user interfaces but limited functionality (e.g., SPSS [2], which is very popular in psychology and social sciences). Like SPSS, other tools are subject to a fee, including Stata [3] (econometrics) and SAS [4] (biometrics, clinical research, banking).

Python is widespread in the computer science environment and especially in machine learning, where it leaves hardly anything to be desired, even compared with R. In terms of statistical applications, the main difference between the two languages is the variety of packages available in R. In Python, data science applications are concentrated in a smaller number of more consolidated packages, but more specialized or newer methods are often not available.

Whether you prefer R or Python depends in the end on the technical background of the team. If you have a background in computer science, you might already have knowledge of Python and want to use it in a professional context. People with a mathematical or scientific background, on the other hand, are more likely to have experience with R.

R owes its leap from scientific software into the business world to the proliferation of data. Companies seek to profit from their data and optimize their business processes. The employees of a company faced with the task of analyzing data, often use Microsoft Excel. As a very powerful program, Excel can also meet many requirements. However, if more complex data processing becomes necessary or statistical models need to be developed, users can easily reach the limits of Excel.

Getting started with R offers more advantages than just the multitude of methods for data preparation and analysis. For example, R Markdown can be used to create reproducible and parameterized reports in PDF, Word, or HTML format. The use of LaTeX elements in R Markdown reports opens countless possibilities, such as the adaptation of reports to a corporate identity [5]; however, no LaTeX knowledge is required for working with R Markdown.

The Shiny package makes it possible to develop interactive web applications [6], which means R can cover tasks that would otherwise require tools such as Tableau [7] or Power BI [8]. However, automating processes with the help of R scripts already offers enormous benefits for many newcomers compared with the repeated execution of the same click sequence.

#### Regression Analysis Example

The final example reads Google Analytics data from a CSV file that has the number of sessions on the INWT Statistics [9] website since the beginning of 2015 (Figure 4). First, the script filters and prepares the data (Listing 6); then, with the data for 2017 and 2018, the code models the number of sessions, depending on the month and day of the week, and initially withholds the first few weeks of 2019 as test data.

Listing 6

Time Series Analysis

> # Load required packages > library(dplyr) > library(ggplot2) > library(ggthemes) > > # Read data from csv file > dat <- read.csv("data/Google_Analytics_Sessions.csv", stringsAsFactors = FALSE) > > # First view of the dataset > str(dat) 'data.frame': 794 obs. of 2 variables: $ Day : chr "01.01.17" "02.01.17" "03.01.17" "04.01.17" ... $ Sessions: num 105 304 355 394 356 338 255 302 461 489 ... > > # Convert the date stored as character to a date (with base package) > dat$Day <- as.Date(dat$Day, format = "%d.%m.%y") > > # Graphical representation of the time series with the built-in graphics package (Figure 5) > plot(dat$Day, dat$Sessions, type = "l") > > # Data preparation with dplyr > datPrep <- dat %>% + mutate( + # Set values below 50 to NA, because this is known to be a measurement error. + Sessuibs = ifelse(Sessions < 50, NA, Sessions), + # Determine weekday + wday = weekdays(Day), + # Determine month + month = months(Day) + ) > > # Split into training and test data > datPrep$isTrainData <- datPrep$Day < "2019-01-01" > > # Simple linear model. The character variables "wday" and "month" will be > # automatically interpreted as categorial variables. > # The model only uses the data up to the end of 2018, i.e. for which the variable > # isTrainData has the value TRUE > mod1 <- lm(data = datPrep[datPrep$isTrainData, ], formula = Meetings ~ wday + month) > > # Model summary > summary(mod1) Call: lm(formula = Sessions ~ wday + month, data = datPrep[datPrep$isTrainData, ]) Residuals: Min 1Q Median 3Q Max -464.80 -61.88 -6.52 62.38 479.19 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 517.340 18.270 28.316 < 2e-16 *** wdayThursday -31.046 16.119 -1.926 0.0545 . wdayFriday -116.981 16.122 -7.256 1.08e-12 *** wdayWednesday -2.817 16.111 -0.175 0.8612 wdayMonday -7.527 16.030 -0.470 0.6388 wdaySaturday -262.736 16.164 -16.255 < 2e-16 *** wdaySunday -202.964 16.077 -12.624 < 2e-16 *** monthAugust -16.028 20.777 -0.771 0.4407 monthDecember -6.106 20.768 -0.294 0.7688 monthFebruary 109.349 21.304 5.133 3.72e-07 *** monthJanuary 133.990 20.770 6.451 2.10e-10 *** monthJuly 211.030 20.849 10.122 < 2e-16 *** monthJune 167.145 22.755 7.345 5.85e-13 *** monthMay 24.528 21.411 1.146 0.2524 monthMarch -10.467 20.858 -0.502 0.6159 monthNovember 99.135 20.943 4.734 2.68e-06 *** monthOctober -27.591 20.770 -1.328 0.1845 monthSeptember 53.475 21.030 2.543 0.0112 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 114.2 on 685 degrees of freedom (27 observations deleted due to missingness) Multiple R-squared: 0.5528, Adjusted R-squared: 0.5417 F-statistic: 49.81 on 17 and 685 DF, p-value: < 2.2e-16 > # Calculation of the values predicted by the model (for the whole period) > datPrep$pred <- predict(mod1, newdata = datPrep) > > # Graphical representation of true and predicted values with ggplot2 > p <- ggplot(datPrep) + + geom_line(aes(x = Day, y = Sessions, linetype = !isTrainData)) + + geom_line(aes(x = Day, y = pred, colour = !isTrainData)) + + scale_color_manual(values = c("#2B4894", "#cd5364"), limits = c(FALSE, TRUE)) + + labs(colour = "Test period", linetype = "Test period") > p # Equivalent to print(p) > > # Alternative model with splines > library(mgcv) > library(lubridate) > datPrep$dayOfYear <- yday(datPrep$Day) > datPrep$linTrend <- 1:nrow(datPrep) > mod2 <- gam(data = datPrep[datPrep$isTrainData, ], + formula = Sessions ~ + wday + linTrend + s(dayOfYear, bs = "cc", k = 20, by = as.factor(wday))) > summary(mod2) Family: gaussian Link function: identity Formula: Sessions ~ wday + linTrend + s(dayOfYear, bs = "cc", k = 20, by = as.factor(wday)) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 532.0423 12.3948 42.925 < 2e-16 *** wdayThursday -32.2007 14.1368 -2.278 0.0231 * wdayFriday -119.6377 14.1354 -8.464 < 2e-16 *** wdayWednesday -4.9203 14.1359 -0.348 0.7279 wdayMonday -5.8354 14.0639 -0.415 0.6784 wdaySaturday -265.6973 14.1727 -18.747 < 2e-16 *** wdaySunday -204.1019 14.1028 -14.472 < 2e-16 *** linTrend 0.1254 0.0205 6.118 1.71e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df F p-value s(dayOfYear):as.factor(wday)Tuesday 15.32 18 6.413 < 2e-16 *** s(dayOfYear):as.factor(wday)Thursday 12.97 18 3.902 1.89e-10 *** s(dayOfYear):as.factor(wday)Friday 12.15 18 3.953 4.59e-11 *** s(dayOfYear):as.factor(wday)Wednesday 13.85 18 5.314 9.31e-15 *** s(dayOfYear):as.factor(wday)Monday 16.64 18 8.382 < 2e-16 *** s(dayOfYear):as.factor(wday)Saturday 11.29 18 3.307 3.00e-09 *** s(dayOfYear):as.factor(wday)Sunday 12.92 18 4.843 1.02e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.648 Deviance explained = 69.9% GCV = 11749 Scale est. = 10025 n = 703 > datPrep$pred2 <- predict(mod2, newdata = datPrep) > ggplot(datPrep) + + geom_line(aes(x = Day, y = Sessions, linetype = !isTrainData)) + + geom_line(aes(x = Day, y = pred2, colour = !isTrainData)) + + scale_color_manual(values = c("#2B4894", "#cd5364"), limits = c(FALSE, TRUE)) + + labs(colour = "Test period", linetype = "Test period")

In the Estimate column of the model summary, you can see that Saturdays and Sundays have very few website visits and that June and July are particularly strong months. Interpretation is always by comparison with the `Tuesday`

or `April`

reference class, which is set automatically. All told, the model can already explain greater than 50 percent of the variation (Multiple R-Squared = 0.5528).

The example then computes the values predicted by the model, including the first weeks of 2019, so you can investigate how well the model works with new data that was not used for the estimate. Finally, the actual number of sessions and the number predicted by the model are displayed graphically (Figure 5).

The figure clearly shows that the very simple model still has some weaknesses; for example, it does not predict the slump around Christmas and New Years. Another step checks whether adding holiday variables improves the model – visually or by using model measures such as R-squared (ideally on test data that was not used to estimate the model).

The model improves significantly if you map the annual trend nonlinearly with a spline rather than the month variable and add a linear trend (Figure 6). Whether the model will deteriorate if the days are combined into "weekday vs. weekend/holiday" instead of considering them individually is open to question.

## Buy this article as PDF

(incl. VAT)