Basics: Introduction to R and R Commander (Rcmdr) and R functions for basic statistical models

 

(a) Preliminaries

 

Statistics is everywhere! Do you see it even in this news item?

Nanos Poll

  • As of mid-June 2013, Liberals had the support of 34.2% of voters, Conservatives 29.4%, and NDP 25.3%.
  • The article states that Nanos surveyed 816 committed voters and the poll is accurate plus or minus 3.5 percentage points, 19 times out of 20.
  • This is an example of Confidence Intervals. So, we are 95% sure that the true proportion of Liberal support is approximately somewhere between 30.7% (34.2% - 3.5%) and 37.7% (34.2% + 3.5%).  [Technical note: The symmetric CI is valid only when the sample proportion is at 0.50.] We sometimes write the interval as [30.7,37.7].
  • This is an example of Confidence Intervals which we will discuss in Chapter 9.

a.1 Installations

 

As of 2018-07-27, current version for R is 3.5.1, and for Rcmdr 2.4-4.

 

a.1.1 Software Installation Videos

 

If you want to see and hear videos where I explain how to install MegaStat software, please visit the following link:

The Install Videos. These videos show the installations of R and Rcmdr on PC and Mac computers.

 

 

a.1.2 Software Installation Documentation

 

I would advise consulting John Fox's installation instructions for Rcmdr for further details.

 

Important! If you want to save your datasets and other files, please follow these instructions:

 

Instructions for downloading and installing R amd Rcmdr on Windows. [Link for downloading R.]

 

Here are the screenshots for the above steps to install R and Rcmdr:

 

First, uninstall earlier versions of R (if applicable):

Screenshots of step-by-step instructions to uninstall earlier version of R

 

Install R:

Screenshots of step-by-step instructions to install R

 

 

Install Rcmdr (from within R):

Screenshot of step-by-step instructions to install Rcmdr

 

Note: For R's Mac OS X and Linux/Unix installation instructions, please click here.

 

Note: For Rcmdr's Mac OS X and Linux/Unix installation instructions, please click here.

 

 

***

 

Wolfgang Jank's book Business Analytics for Managers (Use R!) may be useful in Chapters 1 and 12.

 

 

 

(b) R functions

 

The "theory" (i.e., background material) behind the techniques described below will be given after each example.

 

b.1 Graphics in R

 

 

Example: Let's use the Direct marketing data set [Table 2.6 DirectMarketing.csv] (as a .csv file) to plot some amazing graphs via Rcmdr. Graphics obtained from Rcmdr in this dataset are here as a .pdf file.

 

 

Exercise: Now use this House price data set [Table 2.1 HousePrices.csv] (as a .csv file) and generate graphics as we did above. Graphics obtained from Rcmdr in this dataset are here.

 

 

The next set of problems can also be solved using MegaStat as we will do in Chapters 1 to 10 in MFin604.

 

b.2 Probability calculations

 

Binomial distribution

 

 

Example: Historical records indicate that 40% of all customers who enter a discount store make a purchase. What is the probability that two of the next three customers will make a purchase?

 

This is a binomial problem. The result is 0.288 as shown in this file. This is obtained in Rcmdr via the steps, Distributions > Discrete Distributions > Binomial Distribution > ... Try to replicate the results.

 

 

Exercise: Here is a more challenging problem from healthcare area involving the testing of a new drug. Find the solution using Rcmdr. (Answer: 0.74)

 

***** 

Normal distribution

 

 

Example: Distribution of fuel efficiencies of a new model car, X, is normal. Assuming that X ~ N(7.13,0.27^2) find the probability Pr(6.75 < X < 7.49).  Here is the result from Rcmdr.

 

 

Exercise: Weekly demand at a grocery store for a particular brand of energy drink cans is N(1000,100^2). How many cans should be stocked so that there is a 5% chance of a stockout? (Answer: 1165 cans)

 

 

 

b.3 Confidence intervals

 

 

Example: (Population mean) A company's financial health is (usually) measured using its debt-to-equity ratio. A bank has collected n = 15 of its commercial accounts in this file. Let's find the 95% confidence interval (CI) using Rcmdr. (This is done via the t-test command which includes the CI. Here is the result

 

 

Exercise: Refer to your favourite statistics text for a solved problem on the CI for population mean. Use the steps above to check the solution.

 

Exercise: Let's find a 95% CI for the true average weight of SlimPhone. The population s.d. is known to be 0.6 gr. We take a sample of n = 5, and find the sample mean as 70.12 gr. (Answer: [69.594,70.646])

 

*****

 

Example: (Population proportion) The CI for population proportion is easy to obtain. Suppose you poll 1000 people and 340 of them state that they would vote Liberal, if the election were held today. Here is what we do to find a 95% CI:

 

> prop.test(340,1000)

 

1-sample proportions test with continuity correction

data: 340 out of 1000, null probability 0.5
X-squared = 101.761, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.3108142 0.3704312

sample estimates:
p
0.34

So, the sample proportion is 0.34, with a 95% CI of [0.3108,0.3704], i.e., a margin of error of about 3%.

 

 

Exercise: Refer to your favourite statistics text for a solved problem on the CI for population proportion. Use the steps above to check the solution.

 

Note: The material below will not be discussed in the Workshop. But as this material may be useful in the MFin 604 course, it will remain online.

 


 

 

b.4 Hypothesis testing

 

Hypothesis testing in R (with one or two populations) still uses the t.test function described above. We now discuss a problem with two populations.

 

 

Example: The data set concerns the weight losses experienced by dieters using the Atkins diet or the conventional diet. We want to test the hypothesis that there is no difference between the two methods. Here are the results from Rcmdr.

 

 

Exercise: Refer to your favourite statistics text for a solved problem on hypotheis testing for population mean(s). Use the steps above to check the solution.

 

Exercise: Use the following data values to test the hypothesis that true mean is 750 vs. the hypothesis that it differs from 750: (801,814,784,836,820) What is the p-value? (Answer: p = .0023; so reject the null)

 

b.5 ANOVA

 

The final example involves analysis of variance (ANOVA) where we want to test the null hypothesis that three or more population means are equal. Once again, Rcmdr solves this problem quite admirably.

 

 

Example: (One-way ANOVA) Let's consider a problem from agricultural testing in an experimental farm. We want to test the null hypothesis that low (L), medium (M) or high (H) fertilizer levels do not differ in their effects on the average yield of a new plant. We have 18 small plots, and on six randomly selected plots we use L, the other six M and the last six H. Here is the data set for this example. Rcmdr uses the R function aov and finds these results.

 

 

Exercise: Refer to your favourite statistics text for a solved problem on one-way ANOVA. Use the steps above to check the solution.

 

Exercise: Use this hypothetical dataset with four samples (A, B, C, and D) to do a one-way ANOVA on the equality of the means. What is the p-value? (Answer: p = 0.00146, so reject the null)

 

*****

 

 

Example: (Two-way ANOVA) Note that the above example is a one-factor ANOVA problem. R can of course deal with multi-factor problems, too. Consider the data in this file where we have two factors: Shelf display height (Bottom, Middle, Top) and shelf display width (Regular, Wide). The numbers in each cell correspond to sample group means (sales) for the last six months from three different stores. This is a 2 x 3 (or, 3 x 2) design.

 

We want to test three hypotheses: (i) There is no interaction between the factors, (ii) Factor 1 (height) has no effect on sales, (iii) Factor 2 (width) has no effect on sales. These results show that (i) little or no interaction exists between height and width (high p-value), (ii) height affects the sales (p-value near zero), (iii) width does not affect sales (high p-value). Here, the p-values are denoted by Pr(F>).

 

 

Exercise: Refer to your favourite statistics text for a solved problem on two-way ANOVA. Use the steps above to check the solution.

 

Exercise: Use this dataset with two factors (gender and machine type possible affecting productivity) to test the hypotheses. As before, there will be an F-ratio for each factor and also for the interaction. It may be that there is no difference between Male and Female, or between machines A and B. But if men do better on machine B and women do better on machine A, then there will be interaction. (Answer: In fact, you will notice that there is no difference between genders and the machines, but there is a Gender:Machine interaction with a very low p-value.)