Ramgopal Prajapat:

Learnings and Views

Retail Analytics - eCommerce Data EDA and Simple Segmentation

By: ram_admin on Feb 26, 2021

Olist is an e-commerce platform where offline or existing eCommerce sellers can list themselves to grow their sales.

The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allow viewing orders from multiple dimensions: from order status, price, payment, and freight performance to customer location, product attributes, and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/long coordinates. Data Source.

We can use the data for a various different type of analyses and a few are:

  • Transactional Segmentations to understanding the various shopping behavior of the customers or group of customers
  • Customer Value Estimation to measure customer value based on the shopping patterns
  • Weekly/Daily Sales Forecast for operational planning
  • Customer Sentiment Analysis based on reviews

Before all the cool stuff, we will do the exploratory data analysis. To make sense of the data, the entity-relationship diagram is provided. This will help us see the ways to link different tables.

Read Data

Segmentation - Purchase Behaviors

Data Explorations

  • Customer Summary
  • Order Counts
  • Order Value
  • Period of Orders

Spend Segments

  • Define Segments
  • Segment Plots



Read Data

Connect to Google Drive where the data is available.

from google.colab import drive


Connect with Zip file and print the list of files

import pandas as pd

import zipfile

# Create reference to zipped file/folder

zf = zipfile.ZipFile('/content/drive/MyDrive/Training/Retail/data/Brazilian E-Commerce Public Dataset by Olist.zip')

# Get the name of files


Read data files

olist_customers_dataset = pd.read_csv(zf.open('olist_customers_dataset.csv'))


Now all of the data files are read and pandas data frames are created. The next step will be to explore these datasets. But, if we do analysis across the data frame and features, it may take a lot of time and without a significant outcome. We will finalize the objective first.

We want to build spend segments using the order transactions.

Segmentation - Purchase Behaviour

Based on the value and volume of purchases, we want to segment the customers.

One of the simple segmentation structure can be to create categories such as below

Active Customers:

  • High Value and Frequent Spenders
  • High Value but Infrequent Spenders
  • Low Value but Frequent Spenders
  • Low Value & Infrequent Spenders

Inactive Customers: High Spenders and Low Spenders

We need to think of a definition of inactive customers. Maybe any of the customers who have not spent in the last 3 months.

Lost Customers: Not spend in the last 6 months

Data Explorations

Order, Order items, and Customer DataFrames are important to explore.







We need to get purchase information at a customer level (customer unique id).

  • Count number of transaction
  • Total purchase value
  • Last transaction date (to define if a customer is active, inactive, or lost)

Combine Order and Customer Data Frames

import pandas as pd

customer_orders = pd.merge(olist_orders_dataset,







Now, linking orders to the order value details. Aggregate - products, sellers, products, and value at an order level.

order_items = olist_order_items_dataset.groupby('order_id').agg(


         'product_id':'count',    # count products

          'seller_id': pd.Series.nunique,  # count sellers

          'price''sum',        # sum of ordare values

         'shipping_limit_date''count'  # transaction counts




order_items.rename(columns = {'product_id':'product_counts',




                              }, inplace = True)


Now, we can link the summary order information with the customer order data frame.

import pandas as pd

customer_orders = pd.merge(customer_orders,order_items, on='order_id',how='inner')

All the relevant information is available in this data frame. We can aggregate the data at a customer level (customer_unique_id) as we want to analyze customer spending behavior.

We may want to consider only the delivered orders.

Keep only delivered orders.

customer_orders =customer_orders[customer_orders.order_status=='delivered'


Create order date

customer_orders['order_date']= pd.to_datetime(customer_orders.order_approved_at,format="%Y-%m-%d %H:%M:%S")



Customer Summary

Aggregate the data at a customer level.

import numpy as np 

customer_shopping = customer_orders.groupby('customer_unique_id').agg(


         'seller_counts''sum',  # sellers

         'order_value''sum',        # sum of ordare values

         'order_count''sum' , # Order counts

         'product_counts''sum'# product counts,

         'order_date' : [np.min, np.max]






Update column names.

customer_shopping.columns = ['customer_unique_id','seller_counts''order_value''order_count','product_counts','first_order_date','last_order_date']

Order Counts

Distribution of order counts across customers.

import matplotlib.pyplot as plt

# Word Counts

customer_shopping.order_count.hist(bins = 10)

plt.xlabel('Order Counts')

plt.ylabel('# of Customers')

plt.title('Order Count Distributions')


Order Value

import matplotlib.pyplot as plt

# Word Counts

customer_shopping.order_value.hist(bins = 5)

plt.xlabel('Order Value')

plt.ylabel('# of Customers')

plt.title('Order Value Distributions')


Period of Orders



customer_shopping['last_order_monthyear'] = pd.to_numeric(customer_shopping['last_order_date'].dt.year*100+customer_shopping['last_order_date'].dt.month, downcast='integer')


orders_by_period = customer_shopping['last_order_monthyear']


order_months = customer_shopping.groupby('last_order_monthyear').agg(


         'customer_unique_id':'count',    # customers

          'order_value''sum',  # Value

         'order_count''sum'  # Orders







import matplotlib.pyplot as plt 


    kind = 'bar',



plt.title('Month & Year',fontsize=25)


Spend Segments

Now create features – recency, frequency, and value for defining the segments.


customer_shopping['recency'] = np.where(customer_shopping['last_order_monthyear'] >= 201805'Active'

                                         np.where(customer_shopping['last_order_monthyear'] >= 201801'Inactive''Lapsed')) 






customer_shopping['value'] = np.where(customer_shopping['order_value'] >= 150'High','Low')




customer_shopping['frequency'] = np.where(customer_shopping['order_count'] >= 2'High','Low')

Define Segments

Creating active and inactive segment dimensions.

# Active and Inactive Customers


customer_shopping_v1 = customer_shopping[(customer_shopping.recency.isin(['Active''Inactive']))]

Defining conditions for defining the groups.

# Define a function

def definesegment(row):  

    if (row.recency == 'Active'):

        if (row.value=='High') & (row.frequency=='High'):

          return 'High Value and Frequent Spenders'

        elif (row.value=='High') & (row.frequency=='Low'):

          return 'High Value but Infrequent Spenders'

        elif (row.value=='Low') & (row.frequency=='High'):

          return 'Low Value but Frequent Spenders'


          return 'Low Value & Infrequent Spenders'

    elif (row.recency == 'Inactive'):

       if (row.value=='High'):

          return 'High Value Spenders'


          return 'Low Value Spenders'


      return 'Other'  


Creating the segments based on the above segment function.

# Create Segments

customer_shopping_v1['segment'] = customer_shopping_v1.apply(definesegment, axis=1)


Segment Plots

Now, segments are created. We can see the visual of these segments in terms of counts of customers in each of these segments.

import seaborn as sns # Statistical visualisations

# Plotting counts

fig, ax = plt.subplots(1,2, figsize=(16,6))

ax[0].set_title("Active Customers")

ax[1].set_title("Inactive Customers")

sns.countplot(data = customer_shopping_v1[customer_shopping_v1.recency == "Active"],


            order = ['Low Value & Infrequent Spenders''Low Value but Frequent Spenders'

                     'High Value but Infrequent Spenders''High Value and Frequent Spenders'],


sns.countplot(data = customer_shopping_v1[customer_shopping_v1.recency == "Inactive"],


            order = ['Low Value Spenders''High Value Spenders'],


plt.setp(ax[0].xaxis.get_majorticklabels(), rotation=90)

plt.setp(ax[:], xlabel='')

plt.setp(ax[:], ylabel='Number of Customers')









Leave a comment