Statistics is everywhere! Do you see it even in this news item?
As of 2018-07-27, current version for R is 3.5.1, and for Rcmdr 2.4-4.
If you want to see and hear videos where I explain how to install MegaStat software, please visit the following link:
The Install Videos. These videos show the installations of R and Rcmdr on PC and Mac computers.
I would advise consulting John Fox's installation instructions for Rcmdr for further details.
Important! If you want to save your datasets and other files, please follow these instructions:
Instructions for downloading and installing R amd Rcmdr on Windows. [Link for downloading R.]
Here are the screenshots for the above steps to install R and Rcmdr:
First, uninstall earlier versions of R (if applicable):
Screenshots of step-by-step instructions to uninstall earlier version of R
Install R:
Screenshots of step-by-step instructions to install R
Install Rcmdr (from within R):
Screenshot of step-by-step instructions to install Rcmdr
Note: For R's Mac OS X and Linux/Unix installation instructions, please click here.
Note: For Rcmdr's Mac OS X and Linux/Unix installation instructions, please click here.
***
Wolfgang Jank's book Business Analytics for Managers (Use R!) may be useful in Chapters 1 and 12.
The "theory" (i.e., background material) behind the techniques described below will be given after each example.
Example: Let's use the Direct marketing data set [Table 2.6 DirectMarketing.csv] (as a .csv file) to plot some amazing graphs via Rcmdr. Graphics obtained from Rcmdr in this dataset are here as a .pdf file.
Exercise: Now use this House price data set [Table 2.1 HousePrices.csv] (as a .csv file) and generate graphics as we did above. Graphics obtained from Rcmdr in this dataset are here.
Binomial distribution
Example: Historical records indicate that 40% of all customers who enter a discount store make a purchase. What is the probability that two of the next three customers will make a purchase?
This is a binomial problem. The result is 0.288 as shown in this file. This is obtained in Rcmdr via the steps, Distributions > Discrete Distributions > Binomial Distribution > ... Try to replicate the results.
Exercise: Here is a more challenging problem from healthcare area involving the testing of a new drug. Find the solution using Rcmdr. (Answer: 0.74)
*****
Example: Distribution of fuel efficiencies of a new model car, X, is normal. Assuming that X ~ N(7.13,0.27^2) find the probability Pr(6.75 < X < 7.49). Here is the result from Rcmdr.
Exercise: Weekly demand at a grocery store for a particular brand of energy drink cans is N(1000,100^2). How many cans should be stocked so that there is a 5% chance of a stockout? (Answer: 1165 cans)
Example: (Population mean) A company's financial health is (usually) measured using its debt-to-equity ratio. A bank has collected n = 15 of its commercial accounts in this file. Let's find the 95% confidence interval (CI) using Rcmdr. (This is done via the t-test command which includes the CI. Here is the result
Exercise: Refer to your favourite statistics text for a solved problem on the CI for population mean. Use the steps above to check the solution.
Exercise: Let's find a 95% CI for the true average weight of SlimPhone. The population s.d. is known to be 0.6 gr. We take a sample of n = 5, and find the sample mean as 70.12 gr. (Answer: [69.594,70.646])
*****
Example: (Population proportion) The CI for population proportion is easy to obtain. Suppose you poll 1000 people and 340 of them state that they would vote Liberal, if the election were held today. Here is what we do to find a 95% CI:
> prop.test(340,1000)
1-sample proportions test with continuity correction
data: 340 out of 1000, null probability 0.5
X-squared = 101.761, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.3108142 0.3704312
sample estimates:
p
0.34
So, the sample proportion is 0.34, with a 95% CI of [0.3108,0.3704], i.e., a margin of error of about 3%.
Exercise: Refer to your favourite statistics text for a solved problem on the CI for population proportion. Use the steps above to check the solution.
Note: The material below will not be discussed in the Workshop. But as this material may be useful in the MFin 604 course, it will remain online.
Hypothesis testing in R (with one or two populations) still uses the t.test function described above. We now discuss a problem with two populations.
Example: The data set concerns the weight losses experienced by dieters using the Atkins diet or the conventional diet. We want to test the hypothesis that there is no difference between the two methods. Here are the results from Rcmdr.
Exercise: Refer to your favourite statistics text for a solved problem on hypotheis testing for population mean(s). Use the steps above to check the solution.
Exercise: Use the following data values to test the hypothesis that true mean is 750 vs. the hypothesis that it differs from 750: (801,814,784,836,820) What is the p-value? (Answer: p = .0023; so reject the null)
The final example involves analysis of variance (ANOVA) where we want to test the null hypothesis that three or more population means are equal. Once again, Rcmdr solves this problem quite admirably.
Example: (One-way ANOVA) Let's consider a problem from agricultural testing in an experimental farm. We want to test the null hypothesis that low (L), medium (M) or high (H) fertilizer levels do not differ in their effects on the average yield of a new plant. We have 18 small plots, and on six randomly selected plots we use L, the other six M and the last six H. Here is the data set for this example. Rcmdr uses the R function aov and finds these results.
Exercise: Refer to your favourite statistics text for a solved problem on one-way ANOVA. Use the steps above to check the solution.
Exercise: Use this hypothetical dataset with four samples (A, B, C, and D) to do a one-way ANOVA on the equality of the means. What is the p-value? (Answer: p = 0.00146, so reject the null)
*****
Example: (Two-way ANOVA) Note that the above example is a one-factor ANOVA problem. R can of course deal with multi-factor problems, too. Consider the data in this file where we have two factors: Shelf display height (Bottom, Middle, Top) and shelf display width (Regular, Wide). The numbers in each cell correspond to sample group means (sales) for the last six months from three different stores. This is a 2 x 3 (or, 3 x 2) design.
We want to test three hypotheses: (i) There is no interaction between the factors, (ii) Factor 1 (height) has no effect on sales, (iii) Factor 2 (width) has no effect on sales. These results show that (i) little or no interaction exists between height and width (high p-value), (ii) height affects the sales (p-value near zero), (iii) width does not affect sales (high p-value). Here, the p-values are denoted by Pr(F>).
Exercise: Refer to your favourite statistics text for a solved problem on two-way ANOVA. Use the steps above to check the solution.
Exercise: Use this dataset with two factors (gender and machine type possible affecting productivity) to test the hypotheses. As before, there will be an F-ratio for each factor and also for the interaction. It may be that there is no difference between Male and Female, or between machines A and B. But if men do better on machine B and women do better on machine A, then there will be interaction. (Answer: In fact, you will notice that there is no difference between genders and the machines, but there is a Gender:Machine interaction with a very low p-value.)