Session 1

Basics: Introduction to R and R Commander (Rcmdr) and R functions for basic statistical models

(a) Preliminaries

Statistics is everywhere! Do you see it even in this news item?

Nanos Poll

As of mid-June 2013, Liberals had the support of 34.2% of voters, Conservatives 29.4%, and NDP 25.3%.
The article states that Nanos surveyed 816 committed voters and the poll is accurate plus or minus 3.5 percentage points, 19 times out of 20.
This is an example of Confidence Intervals.

a.1 Installations

I would advise consulting John Fox's installation instructions for Rcmdr for further details.

Important! If you want to save your datasets and other files, please follow these instructions:

Before downloading any of the datasets, you should create a folder (preferably under "My Documents") and give it a name similar to the dataset.
Save your dataset to the folder you created.
After starting Rcmdr, click "File > Change Working Directory..." and point at the folder you have created for that dataset.
Any files you save from within Rcmdr will then appear in the folder you have created.

Instructions for downloading and installing R amd Rcmdr on Windows. [Link for downloading R.]

Here are the screenshots for the steps to install R and Rcmdr:

First, uninstall earlier versions of R (if applicable):

Screenshots of step-by-step instructions to uninstall earlier version of R

Install R:

Screenshots of step-by-step instructions to install R

Note 1: After the "Select Additional Tasks" window, R will install several files on your computer.

Note 2: After "Completing the R for Windows ..." window, R is installed on your computer. Now go to your desktop and choose "Run as adminstrator" on the R icon.

Install Rcmdr (from within R):

Screenshot of step-by-step instructions to install Rcmdr

Note: For R's Mac OS X and Linux/Unix installation instructions, please click here.

Note: For Rcmdr's Mac OS X and Linux/Unix installation instructions, please click here.

As of 2016-08-03, current version for R is 3.3.1, and for Rcmdr 2.3-0.

***

Wolfgang Jank's book Business Analytics for Managers (Use R!) may be useful in Chapters 1 and 12.

(b) R functions

The "theory" (i.e., background material) behind the techniques described below will be given after each example.

b.1 Graphics in R

Example: Let's use the Direct marketing data set [Table 2.6 DirectMarketing.csv] (as a .csv file) to plot some amazing graphs via Rcmdr. Graphics obtained from Rcmdr in this dataset are here as a .pdf file.

R documentation for scatterplotMatrix.
Wikipedia.

Exercise: Now use this House price data set [Table 2.1 HousePrices.csv] (as a .csv file) and generate graphics as we did above. Graphics obtained from Rcmdr in this dataset are here.

The next set of problems can also be solved using MegaStat as we will do in Chapters 1 to 10 in MFin604.

b.2 Probability calculations

Binomial distribution

Example: Historical records indicate that 40% of all customers who enter a discount store make a purchase. What is the probability that two of the next three customers will make a purchase?

This is a binomial problem. The result is 0.288 as shown in this file. This is obtained in Rcmdr via the steps, Distributions > Discrete Distributions > Binomial Distribution > ... Try to replicate the results.

R documentation for binomial distribution.
Background material for binomial probabilities. Wikipedia.

Exercise: Here is a more challenging problem from healthcare area involving the testing of a new drug. Find the solution using Rcmdr. (Answer: 0.74)

*****

Normal distribution

Example: Distribution of fuel efficiencies of a new model car, X, is normal. Assuming that X ~ N(7.13,0.27^2) find the probability Pr(6.75 < X < 7.49). Here is the result from Rcmdr.

R documentation for normal distribution.
Background material for normal distribution. Wikipedia.

Exercise: Weekly demand at a grocery store for a particular brand of energy drink cans is N(1000,100^2). How many cans should be stocked so that there is a 5% chance of a stockout? (Answer: 1165 cans)

b.3 Confidence intervals

Example: (Population mean) A company's financial health is (usually) measured using its debt-to-equity ratio. A bank has collected n = 15 of its commercial accounts in this file. Let's find the 95% confidence interval (CI) using Rcmdr. (This is done via the t-test command which includes the CI. Here is the result

R documentation for t.test.

Background material for confidence interval for population mean (normal population). Wikipedia.

Exercise: Refer to your favourite statistics text for a solved problem on the CI for population mean. Use the steps above to check the solution.

Exercise: Let's find a 95% CI for the true average weight of SlimPhone. The population s.d. is known to be 0.6 gr. We take a sample of n = 5, and find the sample mean as 70.12 gr. (Answer: [69.594,70.646])

*****

Example: (Population proportion) The CI for population proportion is easy to obtain. Suppose you poll 1000 people and 340 of them state that they would vote Liberal, if the election were held today. Here is what we do to find a 95% CI:

> prop.test(340,1000)

1-sample proportions test with continuity correction

data: 340 out of 1000, null probability 0.5
X-squared = 101.761, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.3108142 0.3704312
sample estimates:
p
0.34

So, the sample proportion is 0.34, with a 95% CI of [0.3108,0.3704], i.e., a margin of error of about 3%.

R documentation for prop.test.
Background material for confidence interval for population proportion. Wikipedia.

Exercise: Refer to your favourite statistics text for a solved problem on the CI for population proportion. Use the steps above to check the solution.

Note: The material below will not be discussed in the Workshop. But as this material may be useful in the MFin 604 course, it will remain online.

b.4 Hypothesis testing

Hypothesis testing in R (with one or two populations) still uses the t.test function described above. We now discuss a problem with two populations.

Example: The data set concerns the weight losses experienced by dieters using the Atkins diet or the conventional diet. We want to test the hypothesis that there is no difference between the two methods. Here are the results from Rcmdr.

R documentation for t.test.
Background material for hypothesis testing. Wikipedia.

Exercise: Refer to your favourite statistics text for a solved problem on hypotheis testing for population mean(s). Use the steps above to check the solution.

Exercise: Use the following data values to test the hypothesis that true mean is 750 vs. the hypothesis that it differs from 750: (801,814,784,836,820) What is the p-value? (Answer: p = .0023; so reject the null)

b.5 ANOVA

The final example involves analysis of variance (ANOVA) where we want to test the null hypothesis that three or more population means are equal. Once again, Rcmdr solves this problem quite admirably.

Example: (One-way ANOVA) Let's consider a problem from agricultural testing in an experimental farm. We want to test the null hypothesis that low (L), medium (M) or high (H) fertilizer levels do not differ in their effects on the average yield of a new plant. We have 18 small plots, and on six randomly selected plots we use L, the other six M and the last six H. Here is the data set for this example. Rcmdr uses the R function aov and finds these results.

R documentation for aov.
Background material for ANOVA. Wikipedia.

Exercise: Refer to your favourite statistics text for a solved problem on one-way ANOVA. Use the steps above to check the solution.

Exercise: Use this hypothetical dataset with four samples (A, B, C, and D) to do a one-way ANOVA on the equality of the means. What is the p-value? (Answer: p = 0.00146, so reject the null)

*****

Example: (Two-way ANOVA) Note that the above example is a one-factor ANOVA problem. R can of course deal with multi-factor problems, too. Consider the data in this file where we have two factors: Shelf display height (Bottom, Middle, Top) and shelf display width (Regular, Wide). The numbers in each cell correspond to sample group means (sales) for the last six months from three different stores. This is a 2 x 3 (or, 3 x 2) design.

We want to test three hypotheses: (i) There is no interaction between the factors, (ii) Factor 1 (height) has no effect on sales, (iii) Factor 2 (width) has no effect on sales. These results show that (i) little or no interaction exists between height and width (high p-value), (ii) height affects the sales (p-value near zero), (iii) width does not affect sales (high p-value). Here, the p-values are denoted by Pr(F>).

R documentation for aov.
Wikipedia article on two-way ANOVA.

Exercise: Refer to your favourite statistics text for a solved problem on two-way ANOVA. Use the steps above to check the solution.

Exercise: Use this dataset with two factors (gender and machine type possible affecting productivity) to test the hypotheses. As before, there will be an F-ratio for each factor and also for the interaction. It may be that there is no difference between Male and Female, or between machines A and B. But if men do better on machine B and women do better on machine A, then there will be interaction. (Answer: In fact, you will notice that there is no difference between genders and the machines, but there is a Gender:Machine interaction with a very low p-value.)