Problem
I want to analyze my data and determine why customers are leaving. Do you have techniques for customer churn analysis?
Solution
This article explains how to analyze the data using Python and perform customer churn analysis to determine why customers stop using a service.
Requirements
For this example, we will use the telcos.csv from GitHub. Download the sample.

- Python.
- Visual Studio Code or any code editor of your preference.
The Churning Data
You can view your CSV data with any tool of your preference. In this example, we will use the csv-viewer from GitHub:

The data shows the following columns:
- CustomerID
- Gender
- SeniorCitizen
- Partner
- Dependents
- Tenure
- PhoneService
- MultipleLines
- InternetService
- OnlineSecurity
- OnlineBackup
- DeviceProtection
- TechSupport
- StreamingTV
- StreamingMovies
- Contract
- PaperlessBilling
- PaymentMethod
- MonthlyCharges
- TotalCharges
- Churn
Adding Libraries
We must install some libraries in the path where Python is installed.
If you’re using Visual Studio Code, you can move the mouse over the Python kernel to find the path where it is installed when you run the code in the interactive Window.

If you do not know where Python is installed, you can run the following commands:
import sys
print(sys.executable)In the Python’s installation path, run this command to add the needed libraries:
pip install pandas matplotlib seabornThese libraries will be necessary to get charts, extract data, and analyze the information.
Removing Garbage
The first step will be to clean null values. The following code will read the CSV file and detect null values:
import pandas as p
import matplotlib.pyplot as pl
import seaborn as sns
csv_path='c:\\data\\telco.csv'
# read the telco.csv file
dataframe = p.read_csv(csv_path)
# Verify if there are null values
print(dataframe.isnull().sum())The first part of the code is importing the libraries:
import pandas as p
import matplotlib.pyplot as pl
import seaborn as snsNext, we will read the CSV file:
csv_path='c:\\data\\telco.csv'
# read the telco.csv file
dataframe = p.read_csv(csv_path)Finally, we will check if there are null values:
print(dataframe.isnull().sum())If we run the code, it will show that there are no null values:

In the next step, we will remove the non-numeric values.
# Remove the non numeric values
dataframe['TotalCharges'] = p.to_numeric(dataframe['TotalCharges'], errors='coerce')
dataframe = dataframe.dropna(subset=['TotalCharges'])The next line will show the chart with customers who stay and churning ones.

As you can see, there are approximately 5000 users and around 2000 churned users. We will create a similar code to graph other columns.
The following code block contains the complete code. It will show the charts by gender, contract type, internet service, churn by tenure, and monthly charges.
# Show the chart of churn by gender
pl.figure(figsize=(10, 6))
# Count values
sns.countplot(x='Churn', hue='gender', data=dataframe, palette='Blues')
# Add a title
pl.title('Churn distribution by gender')
# Show chart
pl.show()
# Show the chart of churn by contract type
pl.figure(figsize=(10, 6))
# Count values
sns.countplot(x='Churn', hue='Contract', data=dataframe, palette='Greens')
# Add a title
pl.title('Churn distribution by Contract')
# Show chart
pl.show()
# Show the chart of churn by type of internet service
pl.figure(figsize=(10, 6))
# Count values
sns.countplot(x='Churn', hue='InternetService', data=dataframe, palette='Oranges')
# Add a title
pl.title('Churn distribution by Internet services')
# Show chart
pl.show()
# Show the chart of churn by tenure
pl.figure(figsize=(10, 6))
# Count values
sns.histplot(data=dataframe, x='tenure', hue='Churn', multiple='stack', bins=30, palette='Purples')
# Add a title
pl.title('Churn distribution by tenure')
# Show chart
pl.show()
# Show the chart of churn by monthly charges
pl.figure(figsize=(10, 6))
# Count values
sns.histplot(data=dataframe, x='MonthlyCharges', hue='Churn', multiple='stack', bins=30, palette='Reds')
# Add a title
pl.title('Churn distribution by Monthly Charges')
# Show chart
pl.show()Churn Distribution by Gender
The first chart compares the churned and stayed customers. As you can see, there is no relationship between the gender and the churned customers. The churning % is not related to gender.

Churn Distribution by Contract
When we analyze the chart by contract, we notice that the churned users have a month-to-month contract. These are the customers that we need to analyze. These can be temporary customers who travel a lot. We may need to analyze why they buy a month-to-month contract.

Churn Distribution by Internet Service
This data is also important because most of the churned customers use fiber optic.

Churn Distribution by Tenure
The people with lower tenure values have a higher churn number.

Churn Distribution by Monthly Charges
The customers with monthly charges equal to 80 have a slightly higher number of churned customers. However, there is a slight difference across all of the values.

Logistic Regression Analysis
Now, we will analyze which data affects the churning. To do that, we will apply logistic Regression, a method used to predict outcomes. With this method, we will predict which columns affect the churning.
For example, does gender affect the churning? What about the tech support?
To analyze the data, we will import additional libraries first:
# Import the libraries
# This library is used to train the model
from sklearn.model_selection import train_test_split
#Use the logistic regression linear model
from sklearn.linear_model import LogisticRegression
# These libraries are used to check the accuracy of the model and classify
# the columns that affect the churning.
from sklearn.metrics import classification_report, accuracy_score
# This library is used to read data
import pandas as pThe sklearn libraries are used to import logistic regression libraries to train the model and classify the data.
In the next section, we will read the CSV files and remove the values that are not numeric and cannot be used for analysis.
We will also convert the churn data into binary data (0 and 1) to predict and detect the columns that affect it.
csv_path='c:\\data\\telco.csv'
# read the telco.csv file
dataframe = p.read_csv(csv_path)
# Remove the non numeric values
dataframe['TotalCharges'] = p.to_numeric(dataframe['TotalCharges'], errors='coerce')
dataframe = dataframe.dropna(subset=['TotalCharges'])
# Convert the column churn to binary numbers and remove the column
#CustomerID which is not required for the analysis
dataframe['Churn'] = dataframe['Churn'].apply(lambda x: 1 if x == 'Churned' else 0)
dataframe = dataframe.drop(['customerID'], axis=1)The next line converts the categorical values into numeric values for the analysis and removes objective column churn from the other variables:
# convert categorical values into numerical values
dataframe_encoded = p.get_dummies(dataframe, drop_first=True)
# separate the columns to drop and remove the columns
X = dataframe_encoded.drop(['Unnamed: 0', 'Churn'], axis=1)
y = dataframe_encoded['Churn']Also, we will test and train the model:
# Divide the data for testing and train
xtr, xts, ytr, yts = train_test_split(X, y, test_size=0.3, random_state=42)
# train the logistic regression model with 6000 iterations.
model = LogisticRegression(max_iter=6000)
model.fit(xtr, ytr)
# Create prediction with the test data
y_pre = model.predict(xts)
# Evaluate the model
accuracyval = accuracy_score(yts, y_pre)
classif_rep = classification_report(yts, y_pre)In logistic regression, 30% of the data is used for testing and the rest for training (test size=0.3). Random_state equal to 42 is used to create a seed for our test. 6000 is the number of iterations used. In machine learning, we train the data to learn and once the model is trained, we use the data to test, which means to predict the data.
The accuracyval will show a % of the accuracy of the model. The classification report on the other hand generates a report of the evaluation metrics.
print(classif_rep)
#Get the coefficients of the model
coefficients = p.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_[0]
})
# provides the absolute value of the coefficient
coefficients['Abs_Coefficient'] = coefficients['Coefficient'].abs()The previous code shows the names of the variables used in the model and the coefficient contains the values of the coefficients. The coefficients measure the influence of each column or variable in the churning result.
In addition, we got the absolute value of the coefficient.
# Order the columns according to the coefficient number. Arrange the data
coefficients = coefficients.sort_values(by='Abs_Coefficient', ascending=False)
# Show to most important columns related to churn
print('Top columns related to churn:')
#print the coefficients. The coefficient provides the incidence to churn
print(coefficients[['Feature', 'Coefficient']])Finally, we sort the data and show it. If we run the code, we will get the following results:

The report provides a precision of 84% for churning (1) and retention (0). Recall means that the percentage of the churning is identified correctly. F1-Score is a balance between precision and exhaustivity. Support is the number of churning and retention customers.
The macro average is a simple unweighted mean and the weighted average, as the name says, is a weighted mean. The macro average is used if you want a general sense of the model regardless of the distribution.
Finally, we have the results. If the columns related to churn are negative, it means that it has a negative influence.
In this example, a contract of 2 years means that the customers with this type of contract remain using the TELCOS’ services. On the other hand, customers using fiber optic services have a higher likelihood of churning.
The contract of 1 year, online security DSL Internet service, and tech support also helps customer retention.
Next Steps
For more information about analysis services, refer to the following links:
- Introduction to the SQL Server Analysis Services Logistic Regression Data Mining Algorithm
- Introduction to SQL Server Machine Learning Services with Python
- Microsoft Naïve Bayes Data Mining Model in SQL Server Analysis Services

Daniel Calbimonte is a Microsoft Most Valuable Professional, Microsoft Certified Trainer and Microsoft Certified IT Professional for SQL Server. He is an accomplished SSIS author, teacher at IT Academies and has over 10 years of experience as a QE and developer for SQL Server related software. He has worked for the government, oil companies, web sites, magazines and universities around the world. Daniel also regularly speaks at SQL Servers conferences and blogs.
- MSSQLTips Awards: Author of the Year Contender – 2015-2018, 2022, 2023 | Champion (100+ tips) – 2018


