Question Details
[answered] Module 10 Indicator/ Dummy variable Regressions, and stepwi
The correct answer is Yes-- I just need to understand what numbers are used to come to this conclusion.
Is this relation statistically significant?
Yes or No
Module 10
Indicator/ Dummy variable Regressions, and stepwise procedures
WORKSHEET
Module Content:
Chapter 7 (Readings: Sections 1 to 3): Pages 273 to 293; Chapter 8 (Readings: Sections 1
to 4) Pages 311 to 328. As before, the Text is the primary reading; previews and other material
provided in this worksheet are of supplementary nature. Concepts: Listed below are previewed using these power points.
Qualitative variables
Interactions
Coding of seasonal data with/ without interactions
Introduction to variable screening procedures Tools:
Excel with Data Analysis, KPK Macros, usual Tables and Stepwise macros. Illustration and assessment work:
This assignment is about creating and using Indicator/ Dummy variables as predictors in
MLRs (Text: pages 273 to 284). The number of possible formulations for these predictor
variables multiply manifold as we propose, and include interactions among them. We will look at
variable screening procedures to limit their number, only as a first step toward model-building.
As before, we will use three problems/datasets from the Text for this Worksheet. First:
problem 7.1 (page 285) utilizing the dataset HARRIS7, to run a multiple regression and answer
the questions asked in the Text. Please note that some output generated by SAS is given in the
Text. We use Excel to replicate it, and generate additional error diagnostics on the side. Then we
run a second model looking for possible improvement to the Model, adding gender-interaction
variables. The mock-quiz in this Worksheet is based on these two models. Please use similar
approach to the second problem 7.7 (page 301) to analyze the work orders data at the Texas
Christian University (WORKORD7). ?BUILDING? type will need to be coded using dummy
variables. Additional explanatory variables are NOT required to answer the questions raised in
the text. The preparatory quiz (PQ010) will be based on this second problem. Finally, the third
problem is 7.10 (page 302) using the dataset BIGTEX7 to explore possible gender discrimination
in wages. Only to ensure uniformity, an exact procedure to follow is laid out later in this
Worksheet. The assessment quiz (TQ010) will be based on this problem.
Problem 7.1 uses only one dummy/ logical variable. The data relates to salaries of Harris
Corp. employees (SALARY) by their education (EDUCAT), prior experience (EXPER), service
(MONTHS), and gender (coded as MALES = 1 if male; = 0 if female). [For this coding,
?female? is the base or reference; the coefficient of ?MALES? will thus give us the estimate of
the difference between male and female salaries, considering the effects of the other three
variables (EDUCAT, EXPER, and MONTHS).] First, run the MLR for SALARY using all the
given independent variables, at 5% significance level. The output (given on next page) can be
used to answer all the questions (parts a through e). Normal plot of errors is ensures normality.
Observe the low R-square value. To explore any possible improvement, we will run a second regression including additional variables representing interactions between gender and each of
the other three variables. (Note that this model assumes persistent gender discrimination, even
after hiring.) While running this second MLR using the KPK macros, please check VIF?s for any
multicollinearity. The output for the first part follows:
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations 0.71
0.51
0.49
507.42
93 ANOVA
df
4
88
92 SS
23665351.39
22657938.93
46323290.32 Coefficients
3526.42
90.02
1.27
23.41
722.46 Standard
Error
327.73
24.69
0.59
5.20
117.82 Regression
Residual
Total Intercept
EDUCAT
EXPER
MONTHS
MALES MS
5916337.85
257476.58 t Stat
10.76
3.65
2.16
4.50
6.13 The error diagnostics part of the output (partially reproduced) should look as follows:
RESIDUAL OUTPUT PROBABILITY OUTPUT Observation
1
2
3
- Predicted
SALARY
4630.07
4646.30
5315.20
- Residuals
-730.07
-626.30
-1025.20
- Standard
Residuals
-1.47
-1.26
-2.07
- Percentile
0.54
1.61
2.69
- SALARY
3900
4020
4290
- The probability plot below verifies (visually) the normality of errors (straight line trend). SALARY Normal Probability Plot
10000
5000
0
0 20 40 60 80 Sample Percentile 100 120 As was stated earlier, we create the interaction variables, as products of MALES with each of the
other three variables, viz., EDUCAT, EXPER, and MONTHS. The formulae in the worksheet
shown below present a simple way to achieve this. (These relations may be defined in the first
row (#2) and copied easily through the spreadsheet.)
SALARY
3900
4020
- EDUCAT
12
10
- EXPER
0
44
- MONTHS
1
7
- MALES
0
0
- EDXMA
=B2*$E2
=B3*$E3
=B4*$E4 EXXMA
=C2*$E2
=C3*$E3
=C4*$E4 MOXMA
=D2*$E2
=D3*$E3
=D4*$E4 The output from this second MLR (being shown only in part) follows:
SUMMARY
Multiple R
R Square
Adjusted R Square
Standard Error
Observations 0.724554
0.524979
0.48586
508.7995
93 ANOVA
df
Regression
Residual
Total Intercept
EDUCAT
EXPER
MONTHS
MALES
EDXMA
EXXMA
MOXMA 7
85
92 SS
24318752.5
22004537.8
46323290.3 Coefficients
3758.743
69.61014
1.771588
21.20949
-218.322
73.12391
-0.79381
3.838269 Standard
Error
376.023935
28.5216381
0.78138422
6.34393677
842.13325
59.3035951
1.22999108
11.3778491 MS
3474108
258876.9 F
13.41992 t Stat
9.99602
2.440608
2.267243
3.34327
-0.25925
1.233044
-0.64538
0.337346 P-value
5.23E-16
0.016739
0.025913
0.001233
0.796071
0.220959
0.52042
0.736688 VIF
1.505966
1.791457
1.504053
57.49965
53.67531
3.190911
3.935882 A sample of the questions you may be expected to answer (using these and similar outputs) is
given below:
The first five (5) questions in this quiz relate to the model:
SALARY = ?o + ?1* EDUCAT + ?2 *EXPER + ?3 *MONTHS + ?4 *MALES
1. What is the minimum mean SALARY one may expect in Harris Corp.?
A. 3616.44
B. 3526.42
C. 4363.58
D. 5687.89
E. 3578.74
2. Based on the error diagnostics generated, the assumptions related to the errors appear:
A. validated
B. violated.
3. What percentage of variation in salaries is explained by the regression?
A. 71
B. 49
C. 51
D. 100
E. 53
4. What is the difference between starting male and female salaries?
A. 90.02
B. 1.27
C. 23.41
D. 722.46
E. 218 5. The standard error suggests that using this model, the SALARY could be predicted within (?)
??. dollars (rounded to the nearest 100) at 95% confidence, for this data set. (Please fill in the
closest number.)
A. 2000
B. 1500
C. 1000
D. 500
E. 2500
Using both the model outputs, the Interaction variables are to be collectively tested for their
significance.
6. What is the improvement in R2 observed, in using the Interaction model?
A. 0.5%
B. 1.5%
C. 2.5%
D. 5%
E. 48%
7. Collectively, the Interaction variables appear to be statistically ?.. in the model.
A. significant
B. insignificant
8. If we look at just the VIFs, which is the variable causing the highest (the most)
multicollinearity?
A. EDUCAT
B. EXPER
C. MONTHS
D. MALES E. None of these
9. Which two variables do you suspect will exhibit the highest pair-wise correlation?
A. MALES and EDUCAT B. EDUCAT and EDXMA
C. MALES and EDXMA
D. EXXMA and MOXMA E. INTERCEPT and MOXMA.
10. Based on the statistically better model, what salary will you predict for a male with 12 years
of EDUCATion, 6 years of EXPERience, and time from hiring = 15 MONTHS?
A. 3620 B. 3530 C. 4360 D. 5690 E. 10000 Solutions key: 1:B, 2:A, 3:C, 4:D, 5:C, 6:B, 7:B (as none is individually significant, a collective
test is superfluous), 8:D (MALES exhibits the highest VIF), 9:C (as these are the only two
variables that have high VIFs, they can only be correlated to each other: implies males are
consistently lower/ higher educated in the dataset.) 10:D (by substitution into the first equation,
the better statistical choice). Please use this link to hear an explanation of these solutions. Practice work:
For this part, please use WORKORD7. The variable BUILDING type is to be modeled using
Dummy variables (as done with GENDER, in the earlier example). Create three variables (say,
RESID, ATHLT, and ACADM; note that we will be holding Administrative buildings as the
base, with this coding). The three variables viz., RESID, ATHLT, and ACADM could be created
in columns D, E, and F (to stay adjacent to the other independent variables), their values could be
defined using the Excel function ?IF?. The function may be defined in the second row, and
copied through the spreadsheet. When coded in this manner, the data should look as:
DAYS
7
9
3
4
1 HOURS
0.50
6.50
1.00
0.50
1.00 MATERIAL
0
117
6
0
4 RESID
1
0
1
1
1 ATHLT
0
0
0
0
0 ACADM
0
1
0
0
0 BUILDING
1
3
1
1
1 As before, a possible view of the formulae version of the spreadsheet (partial) is given below: DAYS
7
9
3
4 HOURS
0.5
6.5
1
0.5 MATERIAL
0
117
6
0 RESID
=IF($G2=1,1,0)
=IF($G3=1,1,0)
=IF($G4=1,1,0)
=IF($G5=1,1,0) ATHLT
=IF($G2=2,1,0)
=IF($G3=2,1,0)
=IF($G4=2,1,0)
=IF($G5=2,1,0) ACADM
=IF($G2=3,1,0)
=IF($G3=3,1,0)
=IF($G4=3,1,0)
=IF($G5=3,1,0) BUILDING
1
3
1
1 Please perform the MLR for DAYS on all the other variables (Exclude BUILDING as it has been
coded using indicators). Use KPK macros with the VIF option. Part of the output is shown below
for your guidance.
ANOVA
Regression
Residual
Total df
5
66
- SS
1208.657
5520.875 Intercept
HOURS
MATERIAL
RESID
ATHLT
ACADM Coefficients
2.366
0.016
2.647
0.315
- Standard
Error
0.264
2.875
3.215 MS
65.337 t Stat
0.912
3.074
2.183
0.052
0.889 This example will be used for the preparatory quiz, PQ010. Now, KPK macros do not perform
STEPWISE procedures illustrated in Chapter 8. So, we will use the two macros Statpro and Regr
to perform the stepwise selection of variables. (Alternatively, these and more details on them
may be obtained from http://smgpublish.bu.edu/pekoz/statpro.htm). Please save the routines and
open them in Excel; open the dataset MEDICORP8, and follow through add ins > Regressions>
Stepwise etc. choosing default settings where applicable, and get the following output.
Results of stepwise regression for SALES
Step 1 - Entering variable: ADV
Summary measures
Multiple R
0.9003
R-Square
0.8106
Adj R-Square
0.8024
StErr of Est
101.4173
ANOVA Table
Source
Explained
Unexplained df
1
23 SS
1012407.9431
236565.7969 MS
1012407.9431
10285.4694 F
98.4309 p-value
0.0000 Coefficient
-157.3301
2.7721 Std Err
145.1912
0.2794 t-value
-1.0836
9.9212 p-value
0.2898
0.0000 Lower
limit
-457.6810
2.1941 Regression coefficients
Constant
ADV Upper
limit
143.0208
3.3501 Step 2 - Entering variable: BONUS 0.9246
0.8549
0.8418
90.7485 Change
0.0243
0.0443
0.0394
-10.6688 % Change
2.7%
5.5%
4.9%
-10.5% df
2
22 SS
1067797.3181
181176.4219 MS
533898.6591
8235.2919 F
64.8306 p-value
0.0000 Coefficient
-516.4443
2.4732
1.8562 Std Err
189.8757
0.2753
0.7157 t-value
-2.7199
8.9832
2.5934 p-value
0.0125
0.0000
0.0166 Lower
limit
-910.2224
1.9022
0.3719 Summary measures
Multiple R
R-Square
Adj R-Square
StErr of Est
ANOVA Table
Source
Explained
Unexplained Regression coefficients
Constant
ADV
BONUS Upper
limit
-122.6662
3.0441
3.3405 Note that these results match those in Figures 8.7 (Text page 323, using MINITAB) and 8.10
(Text page 326 and 327, using SAS). You are expected to have an appreciation for the
FORWARD SELECTION, BACKWARD ELIMINATION, and their combination viz., the
STEPWISE procedures, though you will not be tested/quizzed on these. Please run the
FORWARD and BACKWARD procedures on this dataset, in order to have a feel for their
workings.
Please have a copy of the results for the MLR of DAYS using WORKORD7 data, with your
answers to problem 7.7, as you proceed to take PQ010. The Lead:
Analysis of variance that we have seen here is a more versatile procedure. In the next Module,
we will extend its application to more general settings: introducing the General Linear Modeling
procedure. Competency assessment:
Please use the data provided in the file named BIGTEX7 (problem 7.10, Text page 302). Run a regression
for SALARY on GENDER. Note that Gender appears to make significant difference to Salary. (Partial
outputs are given here only for your guidance and uniformity of the analysis. Please obtain the full results
on your own.)
ANOVA
df
Regression
Residual
Total Intercept
GENDER 1
45
46 SS
10726240
26580189
37306429 Coefficients
1385.174
955.6594 Standard
Error
160.254
224.2602 MS
10726240
590670.9 F
18.15942 t Stat
8.643616
4.261387 P-value
4.07E-11
0.000102 But this is before accounting for the other (likely, more significant) variables such as Education and
Service life. As Education is correlated to Position (R=0.8, even with ordinal data; verify using
?Correlation? from Data Analysis in Excel: also note that Graduate degree holder does not perform
Manual labor etc.) we need to use only one of EDUCAT and POSITION. So, for the second step, we
code in and use Positions 1- 4 as 4 Dummy variables (Position 5 as base) and along with YEARS and
GENDER run the second regression. This time we note only the Position variables are significant.
ANOVA
df
6
40
46 SS
27459020
9847408
37306429 Coefficients
2955.18
376.3449
36.65237
-1481.11
-1806.03
-1867.45
-1039.47 Standard
Error
533.3785
324.6119
40.13876
493.3294
394.5313
500.9966
366.1611 Regression
Residual
Total Intercept
GENDER
YEARS
pos1
pos2
pos3
pos4 MS
4576503
246185.2 F
18.58968 t Stat
5.540493
1.159369
0.913141
-3.00228
-4.57765
-3.72747
-2.83884 P-value
2.09E-06
0.25318
0.366639
0.004602
4.5E-05
0.000598
0.007081 So, for the third model we use only the 4 significant Position variables.
ANOVA
df
4
42
46 SS
27035243
10271185
37306429 Coefficients
3570
-1920
-1984.29
-2425.35
-1258 Standard
Error
221.157
361.1478
289.5624
251.5866
255.3701 Regression
Residual
Total Intercept
pos1
pos2
pos3
pos4 MS
6758811
244552 F
27.63752 t Stat
16.14238
-5.31638
-6.8527
-9.64023
-4.92618 P-value
1.29E-19
3.79E-06
2.36E-08
3.31E-12
1.35E-05 As we look for the significance of Gender, we run a fourth model with Position variables and GENDER.
ANOVA
df
Regression
Residual
Total Intercept
GENDER
pos1
pos2
pos3
pos4 5
41
46 SS
27253744
10052684
37306429 Coefficients
3335.053
293.6842
-1685.05
-2043.02
-2190.41
-1277.58 Standard
Error
333.136
311.1021
438.986
296.5394
354.122
256.5414 MS
5450749
245187.4 F
22.23095 t Stat
10.01109
0.944012
-3.83851
-6.88955
-6.18546
-4.98001 P-value
1.42E-12
0.350693
0.00042
2.36E-08
2.36E-07
1.2E-05 With these four MLR?s, we should be able to answer the ?Gender Discrimination? question posed in the
problem. The Module?s competency assessment quiz will use this problem and the full regression outputs.
Please note that this Module utilizes/ derives from all previous work, especially the nested models and the
ability to test for the significance of a few selected predictor variables, be such variables quantitative or
qualitative. Please take the module assessment quiz (TQ010):
The quiz covers the module material, and should be taken on the Date/ time-window posted in the
syllabus. Please be warned that this quiz may NOT be identical, but just at the same level of
understanding as PQ010 and the earlier ones. Please take special note that ?Incremental models?
approach is used in testing for GENDER and YEARS in presence of the position variables, together and
individually. This file may be printed using this link.
Solution details:
Answered
QUALITY
Approved
ANSWER RATING
This question was answered on: Sep 18, 2020
Solution~0001013976.zip (25.37 KB)
This attachment is locked
We have a ready expert answer for this paper which you can use for in-depth understanding, research editing or paraphrasing. You can buy it or order for a fresh, original and plagiarism-free copy from our tutoring website www.aceyourhomework.com (Deadline assured. Flexible pricing. TurnItIn Report provided)

Pay using PayPal (No PayPal account Required) or your credit card . All your purchases are securely protected by .
About this Question
STATUSAnswered
QUALITYApproved
DATE ANSWEREDSep 18, 2020
EXPERTTutor
ANSWER RATING
GET INSTANT HELP/h4>
We have top-notch tutors who can do your essay/homework for you at a reasonable cost and then you can simply use that essay as a template to build your own arguments.
You can also use these solutions:
- As a reference for in-depth understanding of the subject.
- As a source of ideas / reasoning for your own research (if properly referenced)
- For editing and paraphrasing (check your institution's definition of plagiarism and recommended paraphrase).
NEW ASSIGNMENT HELP?
Order New Solution. Quick Turnaround
Click on the button below in order to Order for a New, Original and High-Quality Essay Solutions. New orders are original solutions and precise to your writing instruction requirements. Place a New Order using the button below.
WE GUARANTEE, THAT YOUR PAPER WILL BE WRITTEN FROM SCRATCH AND WITHIN YOUR SET DEADLINE.
