By: ram_admin on Feb 26, 2021
Olist is an e-commerce platform where offline or existing eCommerce sellers can list themselves to grow their sales.
The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allow viewing orders from multiple dimensions: from order status, price, payment, and freight performance to customer location, product attributes, and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/long coordinates. Data Source.
We can use the data for a various different type of analyses and a few are:
Before all the cool stuff, we will do the exploratory data analysis. To make sense of the data, the entity-relationship diagram is provided. This will help us see the ways to link different tables.
Read Data
Segmentation - Purchase Behaviors
Data Explorations
Spend Segments
Connect to Google Drive where the data is available.
from google.colab import drive
drive.mount('/content/drive')
Connect with Zip file and print the list of files
import pandas as pd
import zipfile
# Create reference to zipped file/folder
zf = zipfile.ZipFile('/content/drive/MyDrive/Training/Retail/data/Brazilian E-Commerce Public Dataset by Olist.zip')
# Get the name of files
zf.namelist()
olist_customers_dataset = pd.read_csv(zf.open('olist_customers_dataset.csv'))
Now all of the data files are read and pandas data frames are created. The next step will be to explore these datasets. But, if we do analysis across the data frame and features, it may take a lot of time and without a significant outcome. We will finalize the objective first.
We want to build spend segments using the order transactions.
Based on the value and volume of purchases, we want to segment the customers.
One of the simple segmentation structure can be to create categories such as below
We need to think of a definition of inactive customers. Maybe any of the customers who have not spent in the last 3 months.
Lost Customers: Not spend in the last 6 months
Order, Order items, and Customer DataFrames are important to explore.
olist_orders_dataset.head()
olist_order_items_dataset.head()
olist_customers_dataset.head()
We need to get purchase information at a customer level (customer unique id).
import pandas as pd
customer_orders = pd.merge(olist_orders_dataset,
olist_customers_dataset,
on='customer_id',
how='inner')
customer_orders.head()
Now, linking orders to the order value details. Aggregate - products, sellers, products, and value at an order level.
order_items = olist_order_items_dataset.groupby('order_id').agg(
{
'product_id':'count', # count products
'seller_id': pd.Series.nunique, # count sellers
'price': 'sum', # sum of ordare values
'shipping_limit_date': 'count' # transaction counts
}
)
order_items=order_items.reset_index()
order_items.rename(columns = {'product_id':'product_counts',
'seller_id':'seller_counts',
'price': 'order_value',
'shipping_limit_date':'order_count'
}, inplace = True)
order_items.head()
Now, we can link the summary order information with the customer order data frame.
import pandas as pd
customer_orders = pd.merge(customer_orders,order_items, on='order_id',how='inner')
All the relevant information is available in this data frame. We can aggregate the data at a customer level (customer_unique_id) as we want to analyze customer spending behavior.
We may want to consider only the delivered orders.
Keep only delivered orders.
customer_orders =customer_orders[customer_orders.order_status=='delivered']
customer_orders.order_status.value_counts()
customer_orders['order_date']= pd.to_datetime(customer_orders.order_approved_at,format="%Y-%m-%d %H:%M:%S")
customer_orders['order_date'].head()
Aggregate the data at a customer level.
import numpy as np
customer_shopping = customer_orders.groupby('customer_unique_id').agg(
{
'seller_counts': 'sum', # sellers
'order_value': 'sum', # sum of ordare values
'order_count': 'sum' , # Order counts
'product_counts': 'sum', # product counts,
'order_date' : [np.min, np.max]
}
)
customer_shopping=customer_shopping.reset_index()
customer_shopping.head()
Update column names.
customer_shopping.columns = ['customer_unique_id','seller_counts', 'order_value', 'order_count','product_counts','first_order_date','last_order_date']
Distribution of order counts across customers.
import matplotlib.pyplot as plt
# Word Counts
customer_shopping.order_count.hist(bins = 10)
plt.xlabel('Order Counts')
plt.ylabel('# of Customers')
plt.title('Order Count Distributions')
plt.show()
import matplotlib.pyplot as plt
# Word Counts
customer_shopping.order_value.hist(bins = 5)
plt.xlabel('Order Value')
plt.ylabel('# of Customers')
plt.title('Order Value Distributions')
plt.show()
customer_shopping.last_order_date.min()
customer_shopping['last_order_monthyear'] = pd.to_numeric(customer_shopping['last_order_date'].dt.year*100+customer_shopping['last_order_date'].dt.month, downcast='integer')
orders_by_period = customer_shopping['last_order_monthyear']
order_months = customer_shopping.groupby('last_order_monthyear').agg(
{
'customer_unique_id':'count', # customers
'order_value': 'sum', # Value
'order_count': 'sum' # Orders
}
)
order_months=order_months.reset_index()
order_months
import matplotlib.pyplot as plt
customer_shopping.groupby('last_order_monthyear').size().plot(
kind = 'bar',
figsize=(12,12)
)
plt.title('Month & Year',fontsize=25)
Now create features – recency, frequency, and value for defining the segments.
customer_shopping['recency'] = np.where(customer_shopping['last_order_monthyear'] >= 201805, 'Active',
np.where(customer_shopping['last_order_monthyear'] >= 201801, 'Inactive', 'Lapsed'))
customer_shopping['recency'].value_counts()
customer_shopping.order_value.quantile(0.75)
154.7375
customer_shopping['value'] = np.where(customer_shopping['order_value'] >= 150, 'High','Low')
customer_shopping.order_count.quantile(0.75)
customer_shopping['frequency'] = np.where(customer_shopping['order_count'] >= 2, 'High','Low')
Creating active and inactive segment dimensions.
# Active and Inactive Customers
customer_shopping_v1 = customer_shopping[(customer_shopping.recency.isin(['Active', 'Inactive']))]
Defining conditions for defining the groups.
# Define a function
def definesegment(row):
if (row.recency == 'Active'):
if (row.value=='High') & (row.frequency=='High'):
return 'High Value and Frequent Spenders'
elif (row.value=='High') & (row.frequency=='Low'):
return 'High Value but Infrequent Spenders'
elif (row.value=='Low') & (row.frequency=='High'):
return 'Low Value but Frequent Spenders'
else:
return 'Low Value & Infrequent Spenders'
elif (row.recency == 'Inactive'):
if (row.value=='High'):
return 'High Value Spenders'
else:
return 'Low Value Spenders'
else:
return 'Other'
Creating the segments based on the above segment function.
# Create Segments
customer_shopping_v1['segment'] = customer_shopping_v1.apply(definesegment, axis=1)
customer_shopping_v1['segment'].value_counts()
Now, segments are created. We can see the visual of these segments in terms of counts of customers in each of these segments.
import seaborn as sns # Statistical visualisations
# Plotting counts
fig, ax = plt.subplots(1,2, figsize=(16,6))
ax[0].set_title("Active Customers")
ax[1].set_title("Inactive Customers")
sns.countplot(data = customer_shopping_v1[customer_shopping_v1.recency == "Active"],
x="segment",
order = ['Low Value & Infrequent Spenders', 'Low Value but Frequent Spenders',
'High Value but Infrequent Spenders', 'High Value and Frequent Spenders'],
ax=ax[0])
sns.countplot(data = customer_shopping_v1[customer_shopping_v1.recency == "Inactive"],
x="segment",
order = ['Low Value Spenders', 'High Value Spenders'],
ax=ax[1])
plt.setp(ax[0].xaxis.get_majorticklabels(), rotation=90)
plt.setp(ax[:], xlabel='')
plt.setp(ax[:], ylabel='Number of Customers')
plt.show()