Question Details

[answered] A Data-Driven Approach to Predict the Success of Bank Telem


m the following list of articles and write Summary and Critique of selected article (maximum two A4 pages in 12 size Times New Roman font).


A critique is not (only) a criticism. A critique is a specific style of essay in which you identify, evaluate, and respond to an author's ideas, both positively and negatively. To learn more about ?How to write a critique?, explore this link:

http://www.uis.edu/ctl/wp-content/uploads/sites/76/2013/03/Howtocritiqueajournalarticle.pdf

?


A Data-Driven Approach to Predict the

 

Success of Bank Telemarketing S?ergio Moro a,? Paulo Cortez b Paulo Rita a

 

a ISCTE

 

b ALGORITMI - University Institute of Lisbon, 1649-026 Lisboa, Portugal

 

Research Centre, Univ. of Minho, 4800-058 Guimar?

 

aes, Portugal Abstract

 

We propose a data mining (DM) approach to predict the success of telemarketing

 

calls for selling bank long-term deposits. A Portuguese retail bank was addressed,

 

with data collected from 2008 to 2013, thus including the effects of the recent financial crisis. We analyzed a large set of 150 features related with bank client, product

 

and social-economic attributes. A semi-automatic feature selection was explored in

 

the modeling phase, performed with the data prior to July 2012 and that allowed

 

to select a reduced set of 22 features. We also compared four DM models: logistic

 

regression, decision trees (DT), neural network (NN) and support vector machine.

 

Using two metrics, area of the receiver operating characteristic curve (AUC) and

 

area of the LIFT cumulative curve (ALIFT), the four models were tested on an evaluation phase, using the most recent data (after July 2012) and a rolling windows

 

scheme. The NN presented the best results (AUC=0.8 and ALIFT=0.7), allowing

 

to reach 79% of the subscribers by selecting the half better classified clients. Also,

 

two knowledge extraction methods, a sensitivity analysis and a DT, were applied

 

to the NN model and revealed several key attributes (e.g., Euribor rate, direction

 

of the call and bank agent experience). Such knowledge extraction confirmed the

 

obtained model as credible and valuable for telemarketing campaign managers. Preprint submitted to Elsevier 19 February 2014 Key words: Bank deposits, Telemarketing, Savings, Classification, Neural

 

Networks, Variable Selection 1 Introduction Marketing selling campaigns constitute a typical strategy to enhance business. Companies use direct marketing when targeting segments of customers

 

by contacting them to meet a specific goal. Centralizing customer remote interactions in a contact center eases operational management of campaigns.

 

Such centers allow communicating with customers through various channels,

 

telephone (fixed-line or mobile) being one of the most widely used. Marketing operationalized through a contact center is called telemarketing due to

 

the remoteness characteristic [16]. Contacts can be divided in inbound and

 

outbound, depending on which side triggered the contact (client or contact

 

center), with each case posing different challenges (e.g., outbound calls are

 

often considered more intrusive). Technology enables rethinking marketing

 

by focusing on maximizing customer lifetime value through the evaluation of

 

available information and customer metrics, thus allowing to build longer and

 

tighter relations in alignment with business demand [28]. Also, it should be

 

stressed that the task of selecting the best set of clients, i.e., that are more

 

likely to subscribe a product, is considered NP-hard in [31].

 

Decision support systems (DSS) use information technology to support managerial decision making. There are several DSS sub-fields, such as personal

 

and intelligent DSS. Personal DSS are related with small-scale systems that

 

? Corresponding author. E-mail address: scmoro@gmail.com (S. Moro). 2 support a decision task of one manager, while intelligent DSS use artificial

 

intelligence techniques to support decisions [1]. Another related DSS concept

 

is Business Intelligence (BI), which is an umbrella term that includes information technologies, such as data warehouses and data mining (DM), to support

 

decision making using business data [32]. DM can play a key role in personal

 

and intelligent DSS, allowing the semi-automatic extraction of explanatory

 

and predictive knowledge from raw data [34]. In particular, classification is

 

the most common DM task [10] and the goal is to build a data-driven model

 

that learns an unknown underlying function that maps several input variables,

 

which characterize an item (e.g., bank client), with one labeled output target

 

(e.g., type of bank deposit sell: ?failure? or ?success?). There are several classification models, such as the classical Logistic Regression (LR), decision trees (DT) and the more recent neural networks (NN) and

 

support vector machines (SVM) [13]. LR and DT have the advantage of fitting models that tend to be easily understood by humans, while also providing

 

good predictions in classification tasks. NN and SVM are more flexible (i.e., no

 

a priori restriction is imposed) when compared with classical statistical modeling (e.g., LR) or even DT, presenting learning capabilities that range from

 

linear to complex nonlinear mappings. Due to such flexibility, NN and SVM

 

tend to provide accurate predictions, but the obtained models are difficult to

 

be understood by humans. However, these ?black box? models can be opened

 

by using a sensitivity analysis, which allows to measure the importance and

 

effect of particular input in the model output response [7]. When comparing

 

DT, NN and SVM, several studies have shown different classification performances. For instance, SVM provided better results in [6][8], comparable NN

 

and SVM performances were obtained in [5], while DT outperformed NN and

 

3 SVM in [24]. These differences in performance emphasize the impact of the

 

problem context and provide a strong reason to test several techniques when

 

addressing a problem before choosing one of them [9].

 

DSS and BI have been applied to banking in numerous domains, such as credit

 

pricing [25]. However, the research is rather scarce in terms of the specific area

 

of banking client targeting. For instance, [17] described the potential usefulness of DM techniques in marketing within Hong-Kong banking sector but

 

no actual data-driven model was tested. The research of [19] identified clients

 

for targeting at a major bank using pseudo-social networks based on relations

 

(money transfers between stakeholders). Their approach offers an interesting

 

alternative to traditional usage of business characteristics for modeling.

 

In previous work [23], we have explored data-driven models for modeling bank

 

telemarketing success. Yet, we only achieved good models when using attributes that are only known on call execution, such as call duration. Thus,

 

while providing interesting information for campaign managers, such models

 

cannot be used for prediction. In what is more closely related with our approach, [15] analyzed how a mass media (e.g., radio and television) marketing

 

campaign could affect the buying of a new bank product. The data was collected from an Iran bank, with a total of 22427 customers related with a six

 

month period, from January to July of 2006, when the mass media campaign

 

was conducted. It was assumed that all customers who bought the product

 

(7%) were influenced by the marketing campaign. Historical data allowed the

 

extraction of a total of 85 input attributes related with recency, frequency and

 

monetary features and the age of the client. A binary classification task was

 

modeled using a SVM algorithm that was fed with 26 attributes (after a feature selection step), using 2/3 randomly selected customers for training and

 

4 1/3 for testing. The classification accuracy achieved was 81% and through

 

a Lift analysis [3], such model could select 79% of the positive responders

 

with just 40% of the customers. While these results are interesting, a robust

 

validation was not conducted. Only one holdout run (train/test split) was

 

considered. Also, such random split does not reflect the temporal dimension

 

that a real prediction system would have to follow, i.e., using past patterns to

 

fit the model in order to issue predictions for future client contacts.

 

In this paper, we propose a personal and intelligent DSS that can automatically predict the result of a phone call to sell long term deposits by using a

 

DM approach. Such DSS is valuable to assist managers in prioritizing and selecting the next customers to be contacted during bank marketing campaigns.

 

For instance, by using a Lift analysis that analyzes the probability of success

 

and leaves to managers only the decision on how many customers to contact.

 

As a consequence, the time and costs of such campaigns would be reduced.

 

Also, by performing fewer and more effective phone calls, client stress and

 

intrusiveness would be diminished. The main contributions of this work are:

 

? We focus on feature engineering, which is a key aspect in DM [10], and propose generic social and economic indicators in addition to the more commonly used bank client and product attributes, in a total of 150 analyzed

 

features. In the modeling phase, a semi-automated process (based on business knowledge and a forward method) allowed to reduce the original set to

 

22 relevant features that are used by the DM models.

 

? We analyze a recent and large dataset (52944 records) from a Portuguese

 

bank. The data were collected from 2008 to 2013, thus including the effects

 

of the global financial crisis that peaked in 2008.

 

? We compare four DM models (LR, DT, NN and SVM) using a realistic

 

5 rolling windows evaluation and two classification metrics. We also show

 

how the best model (NN) could benefit the bank telemarketing business.

 

The paper is organized as follows: Section 2 presents the bank data and DM

 

approach; Section 3 describes the experiments conducted and analyzes the

 

obtained results; finally, conclusions are drawn in Section 4. 2 2.1 Materials and Methods Bank telemarketing data This research focus on targeting through telemarketing phone calls to sell longterm deposits. Within a campaign, the human agents execute phone calls to

 

a list of clients to sell the deposit (outbound) or, if meanwhile the client calls

 

the contact-center for any other reason, he is asked to subscribe the deposit

 

(inbound). Thus, the result is a binary unsuccessful or successful contact.

 

This study considers real data collected from a Portuguese retail bank, from

 

May 2008 to June 2013, in total of 52944 phone contacts. The dataset is

 

unbalanced, as only 6557 (12.38%) records are related with successes. For

 

evaluation purposes, a time ordered split was initially performed, where the

 

records were divided into training (four years) and test data (one year). The

 

training data is used for feature and model selection and includes all contacts

 

executed up to June 2012, in a total of 51651 examples. The test data is used

 

for measuring the prediction capabilities of the selected data-driven model,

 

including the most recent 1293 contacts, from July 2012 to June 2013.

 

Each record included the output target, the contact outcome ({?failure?, ?suc6 cess?}), and candidate input features. These include telemarketing attributes

 

(e.g., call direction), product details (e.g., interest rate offered) and client information (e.g., age). These records were enriched with social and economic

 

influence features (e.g., unemployment variation rate), by gathering external

 

data from the central bank of the Portuguese Republic statistical web site 1 .

 

The merging of the two data sources led to a large set of potentially useful

 

features, with a total of 150 attributes, which are scrutinized in Section 2.4. 2.2 Data mining models In this work, we test four binary classification DM models, as implemented in

 

the rminer package of the R tool [5]: logistic regression (LR), decision trees

 

(DT), neural network (NN) and support vector machine (SVM).

 

The LR is a popular choice (e.g., in credit scoring) that operates a smooth

 

nonlinear logistic transformation over a multiple regression model and allows

 

the estimation of class probabilities [33]: p(c|xk ) = 1+exp(w0 + 1

 

P

 

M

 

i=1 wi xk,i ) , where p(c|x) denotes the probability of class c given the k-th input example xk =

 

(xk,1 , ..., xk,M ) with M features and wi denotes a weight factor, adjusted by the

 

learning algorithm. Due to the additive linear combination of its independent

 

variables (x), the model is easy to interpret. Yet, the model is quite rigid and

 

cannot model adequately complex nonlinear relationships.

 

The DT is a branching structure that represents a set of rules, distinguishing

 

values in a hierarchical form [2]. This representation can translated into a set

 

of IF-THEN rules, which are easy to understand by humans.

 

1 http://www.bportugal.pt/EstatisticasWeb/Default.aspx?Lang=en-GB 7 The multilayer perceptron is the most popular NN architecture [14]. We adopt

 

a multilayer perceptron with one hidden layer of H hidden nodes and one

 

output node. The H hyperparameter sets the model learning complexity. A

 

NN with a value of H = 0 is equivalent to the LR model, while a high H value

 

allows the NN to learn complex nonlinear relationships. For a given input xk

 

the state of the i-th neuron (si ) is computed by: si = f (wi,0 + P j?Pi wi,j ? sj ), where Pi represents the set of nodes reaching node i; f is the logistic function;

 

wi,j denotes the weight of the connection between nodes j and i; and s1 = xk,1 ,

 

. . ., sM = xk,M . Given that the logistic function is used, the output node

 

automatically produces a probability estimate (? [0, 1]). The NN final solution

 

is dependent of the choice of starting weights. As suggested in [13], to solve this

 

issue, the rminer package uses an ensemble of Nr different trained networks

 

and outputs the average of the individual predictions [13].

 

The SVM classifier [4] transforms the input x ? <M space into a high mdimensional feature space by using a nonlinear mapping that depends on a

 

kernel. Then, the SVM finds the best linear separating hyperplane, related

 

to a set of support vector points, in the feature space. The rminer package

 

adopts the popular Gaussian kernel [13], which presents less parameters than

 

other kernels (e.g., polynomial): K(x, x0 ) = exp(??||x ? x0 ||2 ), ? > 0. The

 

probabilistic SVM output is given by [35]: f (xi ) = Pm j=1 yj ?j K(xj , xi ) + b and p(i) = 1/(1 + exp(Af (xi ) + B)), where m is the number of support vectors,

 

yi ? {?1, 1} is the output for a binary classification, b and ?j are coefficients

 

of the model, and A and B are determined by solving a regularized maximum

 

likelihood problem.

 

Before fitting the NN and SVM models, the input data is first standardized

 

to a zero mean and one standard deviation [13]. For DT, rminer adopts the

 

8 default parameters of the rpart R package, which implements the popular

 

CART algorithm [2] For the LR and NN learning, rminer uses the efficient

 

BFGS algorithm [22], from the family of quasi-Newton methods, while SVM

 

is trained using the sequential minimal optimization (SMO) [26]. The learning

 

capabilities of NN and SVM are affected by the choice of their hyperparameters

 

(H for NN; ? and C, a complex penalty parameter, for SVM). For setting these

 

values, rminer uses grid search and heuristics [5].

 

Complex DM models, such as NN and SVM, often achieve accurate predictive

 

performances. Yet, the increased complexity of NN and SVM makes the final

 

data-driven model difficult to be understood by humans. To open these blackbox models, there are two interesting possibilities, rule extraction and sensitivity analysis. Rule extraction often involves the use of a white-box method

 

(e.g., decision tree) to learn the black-box responses [29]. The sensitivity analysis procedure works by analyzing the responses of a model when a given input

 

is varied through its domain [7]. By analyzing the sensitivity responses, it is

 

possible to measure input relevance and average impact of a particular input

 

in the model. The former can be shown visually using an input importance

 

bar plot and the latter by plotting the Variable Effect Characteristic (VEC)

 

curve. Opening the black-box allows to explaining how the model makes the

 

decisions and improves the acceptance of prediction models by the domain

 

experts, as shown in [20]. 2.3 Evaluation A class can be assigned from a probabilistic outcome by assigning a threshold

 

D, such that event c is true if p(c|xk ) > D. The receiver operating charac9 teristic (ROC) curve shows the performance of a two class classifier across

 

the range of possible threshold (D) values, plotting one minus the specificity

 

(x-axis) versus the sensitivity (y-axis) [11]. The overall accuracy is given by

 

the area under the curve (AU C = R1

 

0 ROCdD), measuring the degree of dis- crimination that can be obtained from a given model. AUC is a popular classification metric [21] that presents advantages of being independent of the class

 

frequency or specific false positive/negative costs. The ideal method should

 

present an AUC of 1.0, while an AUC of 0.5 denotes a random classifier. In the domain of marketing, the Lift analysis is popular for accessing the

 

quality of targeting models [3]. Usually, the population is divided into deciles,

 

under a decreasing order of their predictive probability for success. A useful

 

Lift cumulative curve is obtained by plotting the population samples (ordered

 

by the deciles, x-axis) versus the cumulative percentage of real responses captured (y-axis). Similarly to the AUC metric, the ideal method should present

 

an area under the LIFT (ALIFT) cumulative curve close to 1.0. A high ALIFT

 

confirms that the predictive model concentrates responders in the top deciles,

 

while a ALIFT of 0.5 corresponds to the performance of a random baseline. Given that the training data includes a large number of contacts (51651), we

 

adopt the popular and fast holdout method (with R distinct runs) for feature

 

and model selection purposes. Under this holdout scheme, the training data

 

is further divided into training and validation sets by using a random split

 

with 2/3 and 1/3 of the contacts, respectively. The results are aggregated by

 

the average of the R runs and a Mann-Whitney non-parametric test is used

 

to check statistical significance at the 95% confidence level. In real environment, the DSS should be regularly updated as new contact data

 

10 becomes available. Moreover, client propensity to subscribe a bank product

 

may evolve through time (e.g., changes in the economic environment). Hence,

 

for achieving a robust predictive evaluation we adopt the more realistic fixedsize (of length W ) rolling windows evaluation scheme that performs several

 

model updates and discards oldest data [18]. Under this scheme, a training

 

window of W consecutive contacts is used to fit the model and then we perform

 

predictions related with the next K contacts. Next, we update (i.e., slide) the

 

training window by replacing the oldest K contacts with K newest contacts

 

(related with the previously predicted contacts but now we assume that the

 

outcome result is known), in order to perform new K predictions, an so on.

 

For a test set of length L, a total of number model updates (i.e., trainings) is

 

U = L/K. Figure 1 exemplifies the rolling windows evaluation procedure.

 

time

 

training data test data

 

W K model

 

training

 

updates 1 set test

 

set K

 

2 W

 

training

 

set ... K

 

test

 

set ...

 

K U ...

 

W

 

training

 

set K

 

test

 

set Fig. 1. Schematic of the adopted rolling windows evaluation procedure. 2.4 Feature selection The large number (150) of potential useful features demanded a stricter choice

 

of relevant attributes. Feature selection is often a key DM step, since it is useful

 

to discard irrelevant inputs, leading to simpler data-driven models that are

 

easier to interpret and that tend to provide better predictive performances

 

[12]. In [34], it is argued that while automatic methods can be useful, the

 

11 best way is to perform a manual feature selection by using problem domain

 

knowledge, i.e., by having a clear understanding of what the attributes actually

 

mean. In this work, we use a semi-automatic approach for feature selection

 

based on two steps that are described below.

 

In the first step, business intuitive knowledge was used to define a set of

 

fourteen questions, which represent certain hypotheses that are tested. Each

 

question (or factor of analysis) is defined in terms of a group of related attributes selected from the original set of 150 features by a bank campaign

 

manager (domain expert). For instance, the question about the gender influence (male/female) includes the three features, related with the gender of the

 

banking agent, client and client-agent difference (0 ? if same sex; 1 ? else).

 

Table 1 exhibits the analyzed factors and the number of attributes related

 

with each factor, covering a total of 69 features (reduction of 46%).

 

In the second step, an automated selection approach is adopted, based an

 

adapted forward selection method [12]. Given that standard forward selection

 

is dependent on the sequence of features used and that the features related with

 

a factor of analysis are highly related, we first apply a simple wrapper selection

 

method that works with a DM fed with combinations of inputs taken from a

 

single factor. The goal is to identify the most interesting factors and features

 

attached to such factors. Using only training set data, several DM models are

 

fit, by using: each individual feature related to a particular question (i.e., one

 

input) to predict the contact result; and all features related with the same

 

question (e.g., 3 inputs for question #2 about gender influence). Let AU Cq

 

and AU Cq,i denote the AUC values, as measured on the validation set, for the

 

model fed with all inputs related with question q and only the i?th individual

 

feature of question q. We assume that the business hypothesis is confirmed if

 

12 Table 1

 

Analyzed business questions for a successful contact result

 

Question (factor of analysis) Number of

 

features 1: Is offered rate relevant? 5 2: Is gender relevant? 3 3: Is agent experience relevant? 3 4: Are social status and stability relevant? 5 5: Is client-bank relationship relevant? 11 6: Are bank blocks (triggered to prevent certain operations) relevant? 6 7: Is phone call context relevant? 4 8: Are date and time conditions relevant? 3 9: Are bank profiling indicators relevant? 7 10: Are social and economic indicators relevant? 11 11: Are financial assets relevant? 3 12: Is residence district relevant? 1 13: Can age be related to products with longer term periods? 3 14: Are web page hits (for campaigns displayed in bank web sites) relevant? 4

 

Number of features after business knowledge selection 69 Number of features after first feature selection phase 22 at least one of the individually tested attributes achieves an AU Cq,i greater

 

than a threshold T1 and if the model will all question related features returns

 

an AU Cq greater than another threshold T2 . When an hypothesis is confirmed,

 

only the m?th feature is selected if AU Cq,m > AU Cq or AU Cq ?AU Cq,m < T3 ,

 

where AU Cq,m = max (AU Cq,i ). Else, we rank the input relevance of the model

 

with all question related features in order to select the most relevant ones, such

 

that the sum of input importances is higher than a threshold T4 .

 

Once a set of confirmed hypotheses and relevant features is achieved, a forward

 

selection method is applied, working on a factor by factor step basis. A DM

 

model that is fed with training set data using as inputs all relevant features of

 

13 the first confirmed factor and then AUC is computed over the validation set.

 

Then, another DM model is trained with all previous inputs plus the relevant

 

features of the next confirmed factor. If there is an increase in the AUC, then

 

the current factor features are included in the next step DM model, else they

 

are discarded. This procedure ends when all confirmed factors have been tested

 

if they improve the predictive performance in terms of the AUC value. 3 3.1 Experiments and Results Modeling All experiments were performed using the rminer package and R tool [5] and

 

conducted in a Linux server, with an Intel Xeon 5500 2.27GHz processor. Each

 

DM model related with this section was executed using a total of R = 20 runs.

 

For the feature selection, we adopted the NN model described in Section 2.2 as

 

the base DM model, since preliminary experiments, using only training data,

 

confirmed that NN provided the best AUC and ALIFT results when compared

 

with other DM methods. Also, these preliminary experiments confirmed that

 

SVM required much more computation when compared with NN, in an expected result since SMO algorithm memory and processing requirements grow

 

much more heavily with the size of the dataset when compared with BFGS

 

algorithm used by the NN. At this stage, we set the number of hidden nodes

 

using the heuristic H = round(M/2) (M is the number of inputs), which is

 

also adopted by the WEKA tool [34] and tends to provide good classification

 

results [5]. The NN ensemble is composed of Nr = 7 distinct networks, each

 

trained with 100 epochs of the BFGS algorithm.

 

14 Before executing the feature selection, we fixed the initial phase thresholds

 

to reasonable values: T1 = 0.60 and T2 = 0.65, two AUC values better than

 

the random baseline of 0.5 and such that T2 > T1 ; T3 = 0.01, the minimum

 

difference of AUC values; and T4 =60...

 


Solution details:
STATUS
Answered
QUALITY
Approved
ANSWER RATING

This question was answered on: Sep 18, 2020

PRICE: $15

Solution~0001003169.zip (25.37 KB)

Buy this answer for only: $15

This attachment is locked

We have a ready expert answer for this paper which you can use for in-depth understanding, research editing or paraphrasing. You can buy it or order for a fresh, original and plagiarism-free copy from our tutoring website www.aceyourhomework.com (Deadline assured. Flexible pricing. TurnItIn Report provided)

Pay using PayPal (No PayPal account Required) or your credit card . All your purchases are securely protected by .
SiteLock

About this Question

STATUS

Answered

QUALITY

Approved

DATE ANSWERED

Sep 18, 2020

EXPERT

Tutor

ANSWER RATING

GET INSTANT HELP/h4>

We have top-notch tutors who can do your essay/homework for you at a reasonable cost and then you can simply use that essay as a template to build your own arguments.

You can also use these solutions:

  • As a reference for in-depth understanding of the subject.
  • As a source of ideas / reasoning for your own research (if properly referenced)
  • For editing and paraphrasing (check your institution's definition of plagiarism and recommended paraphrase).
This we believe is a better way of understanding a problem and makes use of the efficiency of time of the student.

NEW ASSIGNMENT HELP?

Order New Solution. Quick Turnaround

Click on the button below in order to Order for a New, Original and High-Quality Essay Solutions. New orders are original solutions and precise to your writing instruction requirements. Place a New Order using the button below.

WE GUARANTEE, THAT YOUR PAPER WILL BE WRITTEN FROM SCRATCH AND WITHIN YOUR SET DEADLINE.

Order Now