## [answered] Module 10 Indicator/ Dummy variable Regressions, and stepwi

The correct answer is Yes-- I just need to understand what numbers are used to come to this conclusion.

Is this relation statistically significant?

Yes or No

Module 10

Indicator/ Dummy variable Regressions, and stepwise procedures

WORKSHEET

Module Content:

Chapter 7 (Readings: Sections 1 to 3): Pages 273 to 293; Chapter 8 (Readings: Sections 1

to 4) Pages 311 to 328. As before, the Text is the primary reading; previews and other material

provided in this worksheet are of supplementary nature. Concepts: Listed below are previewed using these power points.

Qualitative variables

Interactions

Coding of seasonal data with/ without interactions

Introduction to variable screening procedures Tools:

Excel with Data Analysis, KPK Macros, usual Tables and Stepwise macros. Illustration and assessment work:

This assignment is about creating and using Indicator/ Dummy variables as predictors in

MLRs (Text: pages 273 to 284). The number of possible formulations for these predictor

variables multiply manifold as we propose, and include interactions among them. We will look at

variable screening procedures to limit their number, only as a first step toward model-building.

As before, we will use three problems/datasets from the Text for this Worksheet. First:

problem 7.1 (page 285) utilizing the dataset HARRIS7, to run a multiple regression and answer

the questions asked in the Text. Please note that some output generated by SAS is given in the

Text. We use Excel to replicate it, and generate additional error diagnostics on the side. Then we

run a second model looking for possible improvement to the Model, adding gender-interaction

variables. The mock-quiz in this Worksheet is based on these two models. Please use similar

approach to the second problem 7.7 (page 301) to analyze the work orders data at the Texas

Christian University (WORKORD7). ?BUILDING? type will need to be coded using dummy

variables. Additional explanatory variables are NOT required to answer the questions raised in

the text. The preparatory quiz (PQ010) will be based on this second problem. Finally, the third

problem is 7.10 (page 302) using the dataset BIGTEX7 to explore possible gender discrimination

in wages. Only to ensure uniformity, an exact procedure to follow is laid out later in this

Worksheet. The assessment quiz (TQ010) will be based on this problem.

Problem 7.1 uses only one dummy/ logical variable. The data relates to salaries of Harris

Corp. employees (SALARY) by their education (EDUCAT), prior experience (EXPER), service

(MONTHS), and gender (coded as MALES = 1 if male; = 0 if female). [For this coding,

?female? is the base or reference; the coefficient of ?MALES? will thus give us the estimate of

the difference between male and female salaries, considering the effects of the other three

variables (EDUCAT, EXPER, and MONTHS).] First, run the MLR for SALARY using all the

given independent variables, at 5% significance level. The output (given on next page) can be

used to answer all the questions (parts a through e). Normal plot of errors is ensures normality.

Observe the low R-square value. To explore any possible improvement, we will run a second regression including additional variables representing interactions between gender and each of

the other three variables. (Note that this model assumes persistent gender discrimination, even

after hiring.) While running this second MLR using the KPK macros, please check VIF?s for any

multicollinearity. The output for the first part follows:

SUMMARY OUTPUT

Regression Statistics

Multiple R

R Square

Standard Error

Observations 0.71

0.51

0.49

507.42

93 ANOVA

df

4

88

92 SS

23665351.39

22657938.93

46323290.32 Coefficients

3526.42

90.02

1.27

23.41

722.46 Standard

Error

327.73

24.69

0.59

5.20

117.82 Regression

Residual

Total Intercept

EDUCAT

EXPER

MONTHS

MALES MS

5916337.85

257476.58 t Stat

10.76

3.65

2.16

4.50

6.13 The error diagnostics part of the output (partially reproduced) should look as follows:

RESIDUAL OUTPUT PROBABILITY OUTPUT Observation

1

2

3

- Predicted

SALARY

4630.07

4646.30

5315.20

- Residuals

-730.07

-626.30

-1025.20

- Standard

Residuals

-1.47

-1.26

-2.07

- Percentile

0.54

1.61

2.69

- SALARY

3900

4020

4290

- The probability plot below verifies (visually) the normality of errors (straight line trend). SALARY Normal Probability Plot

10000

5000

0

0 20 40 60 80 Sample Percentile 100 120 As was stated earlier, we create the interaction variables, as products of MALES with each of the

other three variables, viz., EDUCAT, EXPER, and MONTHS. The formulae in the worksheet

shown below present a simple way to achieve this. (These relations may be defined in the first

row (#2) and copied easily through the spreadsheet.)

SALARY

3900

4020

- EDUCAT

12

10

- EXPER

0

44

- MONTHS

1

7

- MALES

0

0

- EDXMA

=B2*\$E2

=B3*\$E3

=B4*\$E4 EXXMA

=C2*\$E2

=C3*\$E3

=C4*\$E4 MOXMA

=D2*\$E2

=D3*\$E3

=D4*\$E4 The output from this second MLR (being shown only in part) follows:

SUMMARY

Multiple R

R Square

Standard Error

Observations 0.724554

0.524979

0.48586

508.7995

93 ANOVA

df

Regression

Residual

Total Intercept

EDUCAT

EXPER

MONTHS

MALES

EDXMA

EXXMA

MOXMA 7

85

92 SS

24318752.5

22004537.8

46323290.3 Coefficients

3758.743

69.61014

1.771588

21.20949

-218.322

73.12391

-0.79381

3.838269 Standard

Error

376.023935

28.5216381

0.78138422

6.34393677

842.13325

59.3035951

1.22999108

11.3778491 MS

3474108

258876.9 F

13.41992 t Stat

9.99602

2.440608

2.267243

3.34327

-0.25925

1.233044

-0.64538

0.337346 P-value

5.23E-16

0.016739

0.025913

0.001233

0.796071

0.220959

0.52042

0.736688 VIF

1.505966

1.791457

1.504053

57.49965

53.67531

3.190911

3.935882 A sample of the questions you may be expected to answer (using these and similar outputs) is

given below:

The first five (5) questions in this quiz relate to the model:

SALARY = ?o + ?1* EDUCAT + ?2 *EXPER + ?3 *MONTHS + ?4 *MALES

1. What is the minimum mean SALARY one may expect in Harris Corp.?

A. 3616.44

B. 3526.42

C. 4363.58

D. 5687.89

E. 3578.74

2. Based on the error diagnostics generated, the assumptions related to the errors appear:

A. validated

B. violated.

3. What percentage of variation in salaries is explained by the regression?

A. 71

B. 49

C. 51

D. 100

E. 53

4. What is the difference between starting male and female salaries?

A. 90.02

B. 1.27

C. 23.41

D. 722.46

E. 218 5. The standard error suggests that using this model, the SALARY could be predicted within (?)

??. dollars (rounded to the nearest 100) at 95% confidence, for this data set. (Please fill in the

closest number.)

A. 2000

B. 1500

C. 1000

D. 500

E. 2500

Using both the model outputs, the Interaction variables are to be collectively tested for their

significance.

6. What is the improvement in R2 observed, in using the Interaction model?

A. 0.5%

B. 1.5%

C. 2.5%

D. 5%

E. 48%

7. Collectively, the Interaction variables appear to be statistically ?.. in the model.

A. significant

B. insignificant

8. If we look at just the VIFs, which is the variable causing the highest (the most)

multicollinearity?

A. EDUCAT

B. EXPER

C. MONTHS

D. MALES E. None of these

9. Which two variables do you suspect will exhibit the highest pair-wise correlation?

A. MALES and EDUCAT B. EDUCAT and EDXMA

C. MALES and EDXMA

D. EXXMA and MOXMA E. INTERCEPT and MOXMA.

10. Based on the statistically better model, what salary will you predict for a male with 12 years

of EDUCATion, 6 years of EXPERience, and time from hiring = 15 MONTHS?

A. 3620 B. 3530 C. 4360 D. 5690 E. 10000 Solutions key: 1:B, 2:A, 3:C, 4:D, 5:C, 6:B, 7:B (as none is individually significant, a collective

test is superfluous), 8:D (MALES exhibits the highest VIF), 9:C (as these are the only two

variables that have high VIFs, they can only be correlated to each other: implies males are

consistently lower/ higher educated in the dataset.) 10:D (by substitution into the first equation,

the better statistical choice). Please use this link to hear an explanation of these solutions. Practice work:

For this part, please use WORKORD7. The variable BUILDING type is to be modeled using

Dummy variables (as done with GENDER, in the earlier example). Create three variables (say,

RESID, ATHLT, and ACADM; note that we will be holding Administrative buildings as the

base, with this coding). The three variables viz., RESID, ATHLT, and ACADM could be created

in columns D, E, and F (to stay adjacent to the other independent variables), their values could be

defined using the Excel function ?IF?. The function may be defined in the second row, and

copied through the spreadsheet. When coded in this manner, the data should look as:

DAYS

7

9

3

4

1 HOURS

0.50

6.50

1.00

0.50

1.00 MATERIAL

0

117

6

0

4 RESID

1

0

1

1

1 ATHLT

0

0

0

0

0

1

0

0

0 BUILDING

1

3

1

1

1 As before, a possible view of the formulae version of the spreadsheet (partial) is given below: DAYS

7

9

3

4 HOURS

0.5

6.5

1

0.5 MATERIAL

0

117

6

0 RESID

=IF(\$G2=1,1,0)

=IF(\$G3=1,1,0)

=IF(\$G4=1,1,0)

=IF(\$G5=1,1,0) ATHLT

=IF(\$G2=2,1,0)

=IF(\$G3=2,1,0)

=IF(\$G4=2,1,0)

=IF(\$G2=3,1,0)

=IF(\$G3=3,1,0)

=IF(\$G4=3,1,0)

=IF(\$G5=3,1,0) BUILDING

1

3

1

1 Please perform the MLR for DAYS on all the other variables (Exclude BUILDING as it has been

coded using indicators). Use KPK macros with the VIF option. Part of the output is shown below

ANOVA

Regression

Residual

Total df

5

66

- SS

1208.657

5520.875 Intercept

HOURS

MATERIAL

RESID

ATHLT

2.366

0.016

2.647

0.315

- Standard

Error

0.264

2.875

3.215 MS

65.337 t Stat

0.912

3.074

2.183

0.052

0.889 This example will be used for the preparatory quiz, PQ010. Now, KPK macros do not perform

STEPWISE procedures illustrated in Chapter 8. So, we will use the two macros Statpro and Regr

to perform the stepwise selection of variables. (Alternatively, these and more details on them

may be obtained from http://smgpublish.bu.edu/pekoz/statpro.htm). Please save the routines and

open them in Excel; open the dataset MEDICORP8, and follow through add ins &gt; Regressions&gt;

Stepwise etc. choosing default settings where applicable, and get the following output.

Results of stepwise regression for SALES

Step 1 - Entering variable: ADV

Summary measures

Multiple R

0.9003

R-Square

0.8106

0.8024

StErr of Est

101.4173

ANOVA Table

Source

Explained

Unexplained df

1

23 SS

1012407.9431

236565.7969 MS

1012407.9431

10285.4694 F

98.4309 p-value

0.0000 Coefficient

-157.3301

2.7721 Std Err

145.1912

0.2794 t-value

-1.0836

9.9212 p-value

0.2898

0.0000 Lower

limit

-457.6810

2.1941 Regression coefficients

Constant

limit

143.0208

3.3501 Step 2 - Entering variable: BONUS 0.9246

0.8549

0.8418

90.7485 Change

0.0243

0.0443

0.0394

-10.6688 % Change

2.7%

5.5%

4.9%

-10.5% df

2

22 SS

1067797.3181

181176.4219 MS

533898.6591

8235.2919 F

64.8306 p-value

0.0000 Coefficient

-516.4443

2.4732

1.8562 Std Err

189.8757

0.2753

0.7157 t-value

-2.7199

8.9832

2.5934 p-value

0.0125

0.0000

0.0166 Lower

limit

-910.2224

1.9022

0.3719 Summary measures

Multiple R

R-Square

StErr of Est

ANOVA Table

Source

Explained

Unexplained Regression coefficients

Constant

BONUS Upper

limit

-122.6662

3.0441

3.3405 Note that these results match those in Figures 8.7 (Text page 323, using MINITAB) and 8.10

(Text page 326 and 327, using SAS). You are expected to have an appreciation for the

FORWARD SELECTION, BACKWARD ELIMINATION, and their combination viz., the

STEPWISE procedures, though you will not be tested/quizzed on these. Please run the

FORWARD and BACKWARD procedures on this dataset, in order to have a feel for their

workings.

Please have a copy of the results for the MLR of DAYS using WORKORD7 data, with your

answers to problem 7.7, as you proceed to take PQ010. The Lead:

Analysis of variance that we have seen here is a more versatile procedure. In the next Module,

we will extend its application to more general settings: introducing the General Linear Modeling

procedure. Competency assessment:

Please use the data provided in the file named BIGTEX7 (problem 7.10, Text page 302). Run a regression

for SALARY on GENDER. Note that Gender appears to make significant difference to Salary. (Partial

outputs are given here only for your guidance and uniformity of the analysis. Please obtain the full results

ANOVA

df

Regression

Residual

Total Intercept

GENDER 1

45

46 SS

10726240

26580189

37306429 Coefficients

1385.174

955.6594 Standard

Error

160.254

224.2602 MS

10726240

590670.9 F

18.15942 t Stat

8.643616

4.261387 P-value

4.07E-11

0.000102 But this is before accounting for the other (likely, more significant) variables such as Education and

Service life. As Education is correlated to Position (R=0.8, even with ordinal data; verify using

?Correlation? from Data Analysis in Excel: also note that Graduate degree holder does not perform

Manual labor etc.) we need to use only one of EDUCAT and POSITION. So, for the second step, we

code in and use Positions 1- 4 as 4 Dummy variables (Position 5 as base) and along with YEARS and

GENDER run the second regression. This time we note only the Position variables are significant.

ANOVA

df

6

40

46 SS

27459020

9847408

37306429 Coefficients

2955.18

376.3449

36.65237

-1481.11

-1806.03

-1867.45

-1039.47 Standard

Error

533.3785

324.6119

40.13876

493.3294

394.5313

500.9966

366.1611 Regression

Residual

Total Intercept

GENDER

YEARS

pos1

pos2

pos3

pos4 MS

4576503

246185.2 F

18.58968 t Stat

5.540493

1.159369

0.913141

-3.00228

-4.57765

-3.72747

-2.83884 P-value

2.09E-06

0.25318

0.366639

0.004602

4.5E-05

0.000598

0.007081 So, for the third model we use only the 4 significant Position variables.

ANOVA

df

4

42

46 SS

27035243

10271185

37306429 Coefficients

3570

-1920

-1984.29

-2425.35

-1258 Standard

Error

221.157

361.1478

289.5624

251.5866

255.3701 Regression

Residual

Total Intercept

pos1

pos2

pos3

pos4 MS

6758811

244552 F

27.63752 t Stat

16.14238

-5.31638

-6.8527

-9.64023

-4.92618 P-value

1.29E-19

3.79E-06

2.36E-08

3.31E-12

1.35E-05 As we look for the significance of Gender, we run a fourth model with Position variables and GENDER.

ANOVA

df

Regression

Residual

Total Intercept

GENDER

pos1

pos2

pos3

pos4 5

41

46 SS

27253744

10052684

37306429 Coefficients

3335.053

293.6842

-1685.05

-2043.02

-2190.41

-1277.58 Standard

Error

333.136

311.1021

438.986

296.5394

354.122

256.5414 MS

5450749

245187.4 F

22.23095 t Stat

10.01109

0.944012

-3.83851

-6.88955

-6.18546

-4.98001 P-value

1.42E-12

0.350693

0.00042

2.36E-08

2.36E-07

1.2E-05 With these four MLR?s, we should be able to answer the ?Gender Discrimination? question posed in the

problem. The Module?s competency assessment quiz will use this problem and the full regression outputs.

Please note that this Module utilizes/ derives from all previous work, especially the nested models and the

ability to test for the significance of a few selected predictor variables, be such variables quantitative or

qualitative. Please take the module assessment quiz (TQ010):

The quiz covers the module material, and should be taken on the Date/ time-window posted in the

syllabus. Please be warned that this quiz may NOT be identical, but just at the same level of

understanding as PQ010 and the earlier ones. Please take special note that ?Incremental models?

approach is used in testing for GENDER and YEARS in presence of the position variables, together and

individually. This file may be printed using this link.

Solution details:
STATUS
QUALITY
Approved

This question was answered on: Sep 18, 2020 Solution~0001013976.zip (25.37 KB)

This attachment is locked

We have a ready expert answer for this paper which you can use for in-depth understanding, research editing or paraphrasing. You can buy it or order for a fresh, original and plagiarism-free copy from our tutoring website www.aceyourhomework.com (Deadline assured. Flexible pricing. TurnItIn Report provided)

##### Pay using PayPal (No PayPal account Required) or your credit card . All your purchases are securely protected by .

STATUS

QUALITY

Approved

Sep 18, 2020

EXPERT

Tutor 