Question Details

[answered] Module 10 Indicator/ Dummy variable Regressions, and stepwi


The last three pages of this assgt 10 doc may be helpful



Note that the Regression is statistically significant at 5% ?. What is the difference between the Male and Female salaries?

430

160

224

1385

955


Module 10

 

Indicator/ Dummy variable Regressions, and stepwise procedures

 

WORKSHEET

 

Module Content:

 

Chapter 7 (Readings: Sections 1 to 3): Pages 273 to 293; Chapter 8 (Readings: Sections 1

 

to 4) Pages 311 to 328. As before, the Text is the primary reading; previews and other material

 

provided in this worksheet are of supplementary nature. Concepts: Listed below are previewed using these power points.

 

Qualitative variables

 

Interactions

 

Coding of seasonal data with/ without interactions

 

Introduction to variable screening procedures Tools:

 

Excel with Data Analysis, KPK Macros, usual Tables and Stepwise macros. Illustration and assessment work:

 

This assignment is about creating and using Indicator/ Dummy variables as predictors in

 

MLRs (Text: pages 273 to 284). The number of possible formulations for these predictor

 

variables multiply manifold as we propose, and include interactions among them. We will look at

 

variable screening procedures to limit their number, only as a first step toward model-building.

 

As before, we will use three problems/datasets from the Text for this Worksheet. First:

 

problem 7.1 (page 285) utilizing the dataset HARRIS7, to run a multiple regression and answer

 

the questions asked in the Text. Please note that some output generated by SAS is given in the

 

Text. We use Excel to replicate it, and generate additional error diagnostics on the side. Then we

 

run a second model looking for possible improvement to the Model, adding gender-interaction

 

variables. The mock-quiz in this Worksheet is based on these two models. Please use similar

 

approach to the second problem 7.7 (page 301) to analyze the work orders data at the Texas

 

Christian University (WORKORD7). ?BUILDING? type will need to be coded using dummy

 

variables. Additional explanatory variables are NOT required to answer the questions raised in

 

the text. The preparatory quiz (PQ010) will be based on this second problem. Finally, the third

 

problem is 7.10 (page 302) using the dataset BIGTEX7 to explore possible gender discrimination

 

in wages. Only to ensure uniformity, an exact procedure to follow is laid out later in this

 

Worksheet. The assessment quiz (TQ010) will be based on this problem.

 

Problem 7.1 uses only one dummy/ logical variable. The data relates to salaries of Harris

 

Corp. employees (SALARY) by their education (EDUCAT), prior experience (EXPER), service

 

(MONTHS), and gender (coded as MALES = 1 if male; = 0 if female). [For this coding,

 

?female? is the base or reference; the coefficient of ?MALES? will thus give us the estimate of

 

the difference between male and female salaries, considering the effects of the other three

 

variables (EDUCAT, EXPER, and MONTHS).] First, run the MLR for SALARY using all the

 

given independent variables, at 5% significance level. The output (given on next page) can be

 

used to answer all the questions (parts a through e). Normal plot of errors is ensures normality.

 

Observe the low R-square value. To explore any possible improvement, we will run a second regression including additional variables representing interactions between gender and each of

 

the other three variables. (Note that this model assumes persistent gender discrimination, even

 

after hiring.) While running this second MLR using the KPK macros, please check VIF?s for any

 

multicollinearity. The output for the first part follows:

 

SUMMARY OUTPUT

 

Regression Statistics

 

Multiple R

 

R Square

 

Adjusted R Square

 

Standard Error

 

Observations 0.71

 

0.51

 

0.49

 

507.42

 

93 ANOVA

 

df

 

4

 

88

 

92 SS

 

23665351.39

 

22657938.93

 

46323290.32 Coefficients

 

3526.42

 

90.02

 

1.27

 

23.41

 

722.46 Standard

 

Error

 

327.73

 

24.69

 

0.59

 

5.20

 

117.82 Regression

 

Residual

 

Total Intercept

 

EDUCAT

 

EXPER

 

MONTHS

 

MALES MS

 

5916337.85

 

257476.58 t Stat

 

10.76

 

3.65

 

2.16

 

4.50

 

6.13 The error diagnostics part of the output (partially reproduced) should look as follows:

 

RESIDUAL OUTPUT PROBABILITY OUTPUT Observation

 

1

 

2

 

3

 

- Predicted

 

SALARY

 

4630.07

 

4646.30

 

5315.20

 

- Residuals

 

-730.07

 

-626.30

 

-1025.20

 

- Standard

 

Residuals

 

-1.47

 

-1.26

 

-2.07

 

- Percentile

 

0.54

 

1.61

 

2.69

 

- SALARY

 

3900

 

4020

 

4290

 

- The probability plot below verifies (visually) the normality of errors (straight line trend). SALARY Normal Probability Plot

 

10000

 

5000

 

0

 

0 20 40 60 80 Sample Percentile 100 120 As was stated earlier, we create the interaction variables, as products of MALES with each of the

 

other three variables, viz., EDUCAT, EXPER, and MONTHS. The formulae in the worksheet

 

shown below present a simple way to achieve this. (These relations may be defined in the first

 

row (#2) and copied easily through the spreadsheet.)

 

SALARY

 

3900

 

4020

 

- EDUCAT

 

12

 

10

 

- EXPER

 

0

 

44

 

- MONTHS

 

1

 

7

 

- MALES

 

0

 

0

 

- EDXMA

 

=B2*$E2

 

=B3*$E3

 

=B4*$E4 EXXMA

 

=C2*$E2

 

=C3*$E3

 

=C4*$E4 MOXMA

 

=D2*$E2

 

=D3*$E3

 

=D4*$E4 The output from this second MLR (being shown only in part) follows:

 

SUMMARY

 

Multiple R

 

R Square

 

Adjusted R Square

 

Standard Error

 

Observations 0.724554

 

0.524979

 

0.48586

 

508.7995

 

93 ANOVA

 

df

 

Regression

 

Residual

 

Total Intercept

 

EDUCAT

 

EXPER

 

MONTHS

 

MALES

 

EDXMA

 

EXXMA

 

MOXMA 7

 

85

 

92 SS

 

24318752.5

 

22004537.8

 

46323290.3 Coefficients

 

3758.743

 

69.61014

 

1.771588

 

21.20949

 

-218.322

 

73.12391

 

-0.79381

 

3.838269 Standard

 

Error

 

376.023935

 

28.5216381

 

0.78138422

 

6.34393677

 

842.13325

 

59.3035951

 

1.22999108

 

11.3778491 MS

 

3474108

 

258876.9 F

 

13.41992 t Stat

 

9.99602

 

2.440608

 

2.267243

 

3.34327

 

-0.25925

 

1.233044

 

-0.64538

 

0.337346 P-value

 

5.23E-16

 

0.016739

 

0.025913

 

0.001233

 

0.796071

 

0.220959

 

0.52042

 

0.736688 VIF

 

1.505966

 

1.791457

 

1.504053

 

57.49965

 

53.67531

 

3.190911

 

3.935882 A sample of the questions you may be expected to answer (using these and similar outputs) is

 

given below:

 

The first five (5) questions in this quiz relate to the model:

 

SALARY = ?o + ?1* EDUCAT + ?2 *EXPER + ?3 *MONTHS + ?4 *MALES

 

1. What is the minimum mean SALARY one may expect in Harris Corp.?

 

A. 3616.44

 

B. 3526.42

 

C. 4363.58

 

D. 5687.89

 

E. 3578.74

 

2. Based on the error diagnostics generated, the assumptions related to the errors appear:

 

A. validated

 

B. violated.

 

3. What percentage of variation in salaries is explained by the regression?

 

A. 71

 

B. 49

 

C. 51

 

D. 100

 

E. 53

 

4. What is the difference between starting male and female salaries?

 

A. 90.02

 

B. 1.27

 

C. 23.41

 

D. 722.46

 

E. 218 5. The standard error suggests that using this model, the SALARY could be predicted within (?)

 

??. dollars (rounded to the nearest 100) at 95% confidence, for this data set. (Please fill in the

 

closest number.)

 

A. 2000

 

B. 1500

 

C. 1000

 

D. 500

 

E. 2500

 

Using both the model outputs, the Interaction variables are to be collectively tested for their

 

significance.

 

6. What is the improvement in R2 observed, in using the Interaction model?

 

A. 0.5%

 

B. 1.5%

 

C. 2.5%

 

D. 5%

 

E. 48%

 

7. Collectively, the Interaction variables appear to be statistically ?.. in the model.

 

A. significant

 

B. insignificant

 

8. If we look at just the VIFs, which is the variable causing the highest (the most)

 

multicollinearity?

 

A. EDUCAT

 

B. EXPER

 

C. MONTHS

 

D. MALES E. None of these

 

9. Which two variables do you suspect will exhibit the highest pair-wise correlation?

 

A. MALES and EDUCAT B. EDUCAT and EDXMA

 

C. MALES and EDXMA

 

D. EXXMA and MOXMA E. INTERCEPT and MOXMA.

 

10. Based on the statistically better model, what salary will you predict for a male with 12 years

 

of EDUCATion, 6 years of EXPERience, and time from hiring = 15 MONTHS?

 

A. 3620 B. 3530 C. 4360 D. 5690 E. 10000 Solutions key: 1:B, 2:A, 3:C, 4:D, 5:C, 6:B, 7:B (as none is individually significant, a collective

 

test is superfluous), 8:D (MALES exhibits the highest VIF), 9:C (as these are the only two

 

variables that have high VIFs, they can only be correlated to each other: implies males are

 

consistently lower/ higher educated in the dataset.) 10:D (by substitution into the first equation,

 

the better statistical choice). Please use this link to hear an explanation of these solutions. Practice work:

 

For this part, please use WORKORD7. The variable BUILDING type is to be modeled using

 

Dummy variables (as done with GENDER, in the earlier example). Create three variables (say,

 

RESID, ATHLT, and ACADM; note that we will be holding Administrative buildings as the

 

base, with this coding). The three variables viz., RESID, ATHLT, and ACADM could be created

 

in columns D, E, and F (to stay adjacent to the other independent variables), their values could be

 

defined using the Excel function ?IF?. The function may be defined in the second row, and

 

copied through the spreadsheet. When coded in this manner, the data should look as:

 

DAYS

 

7

 

9

 

3

 

4

 

1 HOURS

 

0.50

 

6.50

 

1.00

 

0.50

 

1.00 MATERIAL

 

0

 

117

 

6

 

0

 

4 RESID

 

1

 

0

 

1

 

1

 

1 ATHLT

 

0

 

0

 

0

 

0

 

0 ACADM

 

0

 

1

 

0

 

0

 

0 BUILDING

 

1

 

3

 

1

 

1

 

1 As before, a possible view of the formulae version of the spreadsheet (partial) is given below: DAYS

 

7

 

9

 

3

 

4 HOURS

 

0.5

 

6.5

 

1

 

0.5 MATERIAL

 

0

 

117

 

6

 

0 RESID

 

=IF($G2=1,1,0)

 

=IF($G3=1,1,0)

 

=IF($G4=1,1,0)

 

=IF($G5=1,1,0) ATHLT

 

=IF($G2=2,1,0)

 

=IF($G3=2,1,0)

 

=IF($G4=2,1,0)

 

=IF($G5=2,1,0) ACADM

 

=IF($G2=3,1,0)

 

=IF($G3=3,1,0)

 

=IF($G4=3,1,0)

 

=IF($G5=3,1,0) BUILDING

 

1

 

3

 

1

 

1 Please perform the MLR for DAYS on all the other variables (Exclude BUILDING as it has been

 

coded using indicators). Use KPK macros with the VIF option. Part of the output is shown below

 

for your guidance.

 

ANOVA

 

Regression

 

Residual

 

Total df

 

5

 

66

 

- SS

 

1208.657

 

5520.875 Intercept

 

HOURS

 

MATERIAL

 

RESID

 

ATHLT

 

ACADM Coefficients

 

2.366

 

0.016

 

2.647

 

0.315

 

- Standard

 

Error

 

0.264

 

2.875

 

3.215 MS

 

65.337 t Stat

 

0.912

 

3.074

 

2.183

 

0.052

 

0.889 This example will be used for the preparatory quiz, PQ010. Now, KPK macros do not perform

 

STEPWISE procedures illustrated in Chapter 8. So, we will use the two macros Statpro and Regr

 

to perform the stepwise selection of variables. (Alternatively, these and more details on them

 

may be obtained from http://smgpublish.bu.edu/pekoz/statpro.htm). Please save the routines and

 

open them in Excel; open the dataset MEDICORP8, and follow through add ins > Regressions>

 

Stepwise etc. choosing default settings where applicable, and get the following output.

 

Results of stepwise regression for SALES

 

Step 1 - Entering variable: ADV

 

Summary measures

 

Multiple R

 

0.9003

 

R-Square

 

0.8106

 

Adj R-Square

 

0.8024

 

StErr of Est

 

101.4173

 

ANOVA Table

 

Source

 

Explained

 

Unexplained df

 

1

 

23 SS

 

1012407.9431

 

236565.7969 MS

 

1012407.9431

 

10285.4694 F

 

98.4309 p-value

 

0.0000 Coefficient

 

-157.3301

 

2.7721 Std Err

 

145.1912

 

0.2794 t-value

 

-1.0836

 

9.9212 p-value

 

0.2898

 

0.0000 Lower

 

limit

 

-457.6810

 

2.1941 Regression coefficients

 

Constant

 

ADV Upper

 

limit

 

143.0208

 

3.3501 Step 2 - Entering variable: BONUS 0.9246

 

0.8549

 

0.8418

 

90.7485 Change

 

0.0243

 

0.0443

 

0.0394

 

-10.6688 % Change

 

2.7%

 

5.5%

 

4.9%

 

-10.5% df

 

2

 

22 SS

 

1067797.3181

 

181176.4219 MS

 

533898.6591

 

8235.2919 F

 

64.8306 p-value

 

0.0000 Coefficient

 

-516.4443

 

2.4732

 

1.8562 Std Err

 

189.8757

 

0.2753

 

0.7157 t-value

 

-2.7199

 

8.9832

 

2.5934 p-value

 

0.0125

 

0.0000

 

0.0166 Lower

 

limit

 

-910.2224

 

1.9022

 

0.3719 Summary measures

 

Multiple R

 

R-Square

 

Adj R-Square

 

StErr of Est

 

ANOVA Table

 

Source

 

Explained

 

Unexplained Regression coefficients

 

Constant

 

ADV

 

BONUS Upper

 

limit

 

-122.6662

 

3.0441

 

3.3405 Note that these results match those in Figures 8.7 (Text page 323, using MINITAB) and 8.10

 

(Text page 326 and 327, using SAS). You are expected to have an appreciation for the

 

FORWARD SELECTION, BACKWARD ELIMINATION, and their combination viz., the

 

STEPWISE procedures, though you will not be tested/quizzed on these. Please run the

 

FORWARD and BACKWARD procedures on this dataset, in order to have a feel for their

 

workings.

 

Please have a copy of the results for the MLR of DAYS using WORKORD7 data, with your

 

answers to problem 7.7, as you proceed to take PQ010. The Lead:

 

Analysis of variance that we have seen here is a more versatile procedure. In the next Module,

 

we will extend its application to more general settings: introducing the General Linear Modeling

 

procedure. Competency assessment:

 

Please use the data provided in the file named BIGTEX7 (problem 7.10, Text page 302). Run a regression

 

for SALARY on GENDER. Note that Gender appears to make significant difference to Salary. (Partial

 

outputs are given here only for your guidance and uniformity of the analysis. Please obtain the full results

 

on your own.)

 

ANOVA

 

df

 

Regression

 

Residual

 

Total Intercept

 

GENDER 1

 

45

 

46 SS

 

10726240

 

26580189

 

37306429 Coefficients

 

1385.174

 

955.6594 Standard

 

Error

 

160.254

 

224.2602 MS

 

10726240

 

590670.9 F

 

18.15942 t Stat

 

8.643616

 

4.261387 P-value

 

4.07E-11

 

0.000102 But this is before accounting for the other (likely, more significant) variables such as Education and

 

Service life. As Education is correlated to Position (R=0.8, even with ordinal data; verify using

 

?Correlation? from Data Analysis in Excel: also note that Graduate degree holder does not perform

 

Manual labor etc.) we need to use only one of EDUCAT and POSITION. So, for the second step, we

 

code in and use Positions 1- 4 as 4 Dummy variables (Position 5 as base) and along with YEARS and

 

GENDER run the second regression. This time we note only the Position variables are significant.

 

ANOVA

 

df

 

6

 

40

 

46 SS

 

27459020

 

9847408

 

37306429 Coefficients

 

2955.18

 

376.3449

 

36.65237

 

-1481.11

 

-1806.03

 

-1867.45

 

-1039.47 Standard

 

Error

 

533.3785

 

324.6119

 

40.13876

 

493.3294

 

394.5313

 

500.9966

 

366.1611 Regression

 

Residual

 

Total Intercept

 

GENDER

 

YEARS

 

pos1

 

pos2

 

pos3

 

pos4 MS

 

4576503

 

246185.2 F

 

18.58968 t Stat

 

5.540493

 

1.159369

 

0.913141

 

-3.00228

 

-4.57765

 

-3.72747

 

-2.83884 P-value

 

2.09E-06

 

0.25318

 

0.366639

 

0.004602

 

4.5E-05

 

0.000598

 

0.007081 So, for the third model we use only the 4 significant Position variables.

 

ANOVA

 

df

 

4

 

42

 

46 SS

 

27035243

 

10271185

 

37306429 Coefficients

 

3570

 

-1920

 

-1984.29

 

-2425.35

 

-1258 Standard

 

Error

 

221.157

 

361.1478

 

289.5624

 

251.5866

 

255.3701 Regression

 

Residual

 

Total Intercept

 

pos1

 

pos2

 

pos3

 

pos4 MS

 

6758811

 

244552 F

 

27.63752 t Stat

 

16.14238

 

-5.31638

 

-6.8527

 

-9.64023

 

-4.92618 P-value

 

1.29E-19

 

3.79E-06

 

2.36E-08

 

3.31E-12

 

1.35E-05 As we look for the significance of Gender, we run a fourth model with Position variables and GENDER.

 

ANOVA

 

df

 

Regression

 

Residual

 

Total Intercept

 

GENDER

 

pos1

 

pos2

 

pos3

 

pos4 5

 

41

 

46 SS

 

27253744

 

10052684

 

37306429 Coefficients

 

3335.053

 

293.6842

 

-1685.05

 

-2043.02

 

-2190.41

 

-1277.58 Standard

 

Error

 

333.136

 

311.1021

 

438.986

 

296.5394

 

354.122

 

256.5414 MS

 

5450749

 

245187.4 F

 

22.23095 t Stat

 

10.01109

 

0.944012

 

-3.83851

 

-6.88955

 

-6.18546

 

-4.98001 P-value

 

1.42E-12

 

0.350693

 

0.00042

 

2.36E-08

 

2.36E-07

 

1.2E-05 With these four MLR?s, we should be able to answer the ?Gender Discrimination? question posed in the

 

problem. The Module?s competency assessment quiz will use this problem and the full regression outputs.

 

Please note that this Module utilizes/ derives from all previous work, especially the nested models and the

 

ability to test for the significance of a few selected predictor variables, be such variables quantitative or

 

qualitative. Please take the module assessment quiz (TQ010):

 

The quiz covers the module material, and should be taken on the Date/ time-window posted in the

 

syllabus. Please be warned that this quiz may NOT be identical, but just at the same level of

 

understanding as PQ010 and the earlier ones. Please take special note that ?Incremental models?

 

approach is used in testing for GENDER and YEARS in presence of the position variables, together and

 

individually. This file may be printed using this link.

 


Solution details:
STATUS
Answered
QUALITY
Approved
ANSWER RATING

This question was answered on: Sep 18, 2020

PRICE: $15

Solution~0001013977.zip (25.37 KB)

Buy this answer for only: $15

This attachment is locked

We have a ready expert answer for this paper which you can use for in-depth understanding, research editing or paraphrasing. You can buy it or order for a fresh, original and plagiarism-free copy from our tutoring website www.aceyourhomework.com (Deadline assured. Flexible pricing. TurnItIn Report provided)

Pay using PayPal (No PayPal account Required) or your credit card . All your purchases are securely protected by .
SiteLock

About this Question

STATUS

Answered

QUALITY

Approved

DATE ANSWERED

Sep 18, 2020

EXPERT

Tutor

ANSWER RATING

GET INSTANT HELP/h4>

We have top-notch tutors who can do your essay/homework for you at a reasonable cost and then you can simply use that essay as a template to build your own arguments.

You can also use these solutions:

  • As a reference for in-depth understanding of the subject.
  • As a source of ideas / reasoning for your own research (if properly referenced)
  • For editing and paraphrasing (check your institution's definition of plagiarism and recommended paraphrase).
This we believe is a better way of understanding a problem and makes use of the efficiency of time of the student.

NEW ASSIGNMENT HELP?

Order New Solution. Quick Turnaround

Click on the button below in order to Order for a New, Original and High-Quality Essay Solutions. New orders are original solutions and precise to your writing instruction requirements. Place a New Order using the button below.

WE GUARANTEE, THAT YOUR PAPER WILL BE WRITTEN FROM SCRATCH AND WITHIN YOUR SET DEADLINE.

Order Now