[240401] 파이썬 유저 코호트 분석 (월별/주차별)

0. 들어가며

- 2016.09. ~ 2018.09 기간 데이터를 활용해 월별/주차별 유저 코호트 분석을 하고자 함

- 월은 y-m, 주차는 y-w로 기간을 따로 쪼개 보지 않고 전체 기간의 코호트 흐름을 보는 것이 목적

- 코호트에 들어갈 측정값은 고유 유저의 구매수를 위주로 보았으나 매출액 칼럼도 생성하여 추가 확인함

1. 데이터 전처리

1) 주차별 숫치 표기법: dt.strftime('%V') vs. dt.isocalendar().week vs. dt.strftime('%Y-%w')

- 1년 총 1~52주차로 년도별 주차의 숫자를 라벨링하여 주 단위 이탈 흐름을 살펴 보고자 함

- (1차 시도) dt.strftime('%Y-%w'): 주의 값이 0부터 시작하는 오류 발견 & str 형태라 구매주차 - 첫구매주차 연산 적용 불가

- (2차 시도) dt.isocalendar().week : 한 자릿수는 01이 아닌 1로만 표기되어, 시각화시 정렬 이슈 발생 ex. 1, 11, 12... 2, 21, 22..

- (3차 시도) dt.strftime('%V'): 년도와 구매주차를 따로 구해서 별도 함수로 진행함 (뒷단 코드 참조)

2) 최소값 처리하는 방법: min() vs. transform()
- min(): 그룹바이 적용된 최소값 반환. series가 원본 dataframe보다 짧아질 수 있음
└ (ex) df_cohort.groupby('customer_unique_id')['order_purchase_timestamp'].min()
- transform(): 그룹바이 시 적용된 최소값을 모든 행에 반환. 즉 series와 원본 dataframe 길이가 동일!
└ (ex) df_cohort.groupby('customer_unique_id')['order_purchase_timestamp'].transform('min')

# 코호트용 데이터프레임 생성 
df_cohort = df_merged.copy()
df_cohort = df_cohort[['customer_unique_id','revenue', 'order_purchase_timestamp']]


# 구매 날짜에 대해 년월/년/주차별별로 칼럼 생성
df_cohort['purchase_order_m'] = df_cohort['order_purchase_timestamp'].dt.strftime('%Y-%m')
df_cohort['purchase_order_y'] = df_cohort['order_purchase_timestamp'].dt.year
df_cohort['purchase_order_w'] = df_cohort['order_purchase_timestamp'].dt.strftime('%V') #dt.isocalendar().week

# 유저별 첫구매 날짜에 대해 년월/년/주차별별로 칼럼 생성
df_cohort['first_order_m'] = df_cohort.groupby('customer_unique_id')['order_purchase_timestamp'].transform('min').dt.strftime('%Y-%m')
df_cohort['first_order_y'] = df_cohort.groupby('customer_unique_id')['order_purchase_timestamp'].transform('min').dt.year
df_cohort['first_order_w'] = df_cohort.groupby('customer_unique_id')['order_purchase_timestamp'].transform('min').dt.strftime('%V')
# df_cohort['first_order_w'] = df_cohort['first_order_w'].dt.strftime('%V')
df_cohort

2. 월별 코호트 분석

- 월 단위의 경우, to_period('M') 함수를 활용하여 간격 계산

# 월별로 분석 테이블 만들기 
# 월별 구매수 
co_user_cnt = df_cohort.groupby(['first_order_m', 'purchase_order_m'])['customer_unique_id'].nunique().reset_index()
co_user_cnt = co_user_cnt.rename({'customer_unique_id': 'Total_users'}, axis= 1)
co_user_cnt

# 월별 매출액 
co_revenue_sum = df_cohort.groupby(['first_order_m', 'purchase_order_m'])['revenue'].sum().reset_index()
co_revenue_sum = co_revenue_sum.rename({'revenue': 'amount_revenue'}, axis= 1)
co_revenue_sum

# 월별 구매수와 매출액 병합 
co_m = pd.merge(co_user_cnt, co_revenue_sum, on=['first_order_m', 'purchase_order_m'], how='inner')

# 월별 코호트 기간 칼럼 추가 
temp = [] 
for i in range(len(co_m.index)):
    f_first_order_m = pd.to_datetime(co_m['first_order_m'][i]).to_period('M')
    f_order_m = pd.to_datetime(co_m['purchase_order_m'][i]).to_period('M')
    month_diff = (f_order_m - f_first_order_m).n

    temp.append(month_diff) 
co_m['cohort_period'] = temp
co_m

# 피벗 테이블 만들기 
cohort_pivot_m = co_m.pivot_table(index = 'first_order_m',
                                    columns = 'cohort_period',
                                    values = 'Total_users')

# 비율로 보기
cohort_pivot_m_ratio = cohort_pivot_m.div(cohort_pivot_m[0], axis=0)   

cohort_pivot_m_ratio_slice = cohort_pivot_m_ratio.iloc[3:, 1:]

# 히트맵 시각화 진행
plt.rcParams['figure.figsize'] = (12, 8)
sns.heatmap(cohort_pivot_m_ratio_slice, annot = True, fmt = '.2f')
plt.yticks(rotation = 360)
plt.show()

# 구매금액으로 코호트 분석 
cohort_pivot_m_revenue = co_m.pivot_table(index = 'first_order_m',
                                    columns = 'cohort_period',
                                    values = 'amount_revenue')

# 비율로 보기
cohort_pivot_m_revenue_ratio = cohort_pivot_m_revenue.div(cohort_pivot_m[0], axis=0)   

cohort_pivot_m_revenue_ratio_slice = cohort_pivot_m_revenue_ratio.iloc[3:, 1:]

# 히트맵 시각화 진행
plt.rcParams['figure.figsize'] = (12, 8)
sns.heatmap(cohort_pivot_m_revenue_ratio_slice, annot = True, fmt = '.2f')
plt.yticks(rotation = 360)
plt.show()

3. 주차별 코호트 분석

- 년도를 고려하여 첫구매가 2017년 1주차고, 그다음 구매가 2018년 2주차일 경우에는 그 차이가 53주가 되도록 함수 적용

- 2017년 1주차~52주차, 2018년 1주차~ 로 순서대로 정렬이 잘 되도록 값을 다루는 작업도 중요했음

# 주별로 분석 테이블 만들기 
# 주별 구매수 
co_user_cnt_w = df_cohort.groupby(['first_order_y', 'first_order_w', 'purchase_order_y', 'purchase_order_w'])['customer_unique_id'].nunique().reset_index()
co_user_cnt_w = co_user_cnt_w.rename({'customer_unique_id': 'Total_users'}, axis= 1)
co_user_cnt_w

# 주별 매출액 
co_revenue_sum_w = df_cohort.groupby(['first_order_y', 'first_order_w', 'purchase_order_y', 'purchase_order_w'])['revenue'].sum().reset_index()
co_revenue_sum_w = co_revenue_sum_w.rename({'revenue': 'amount_revenue'}, axis= 1)
co_revenue_sum_w

# 주별 구매수와 매출액 병합 
co_w = pd.merge(co_user_cnt_w, co_revenue_sum_w, on=['first_order_y', 'first_order_w', 'purchase_order_y', 'purchase_order_w'], how='inner')
co_w


# 주별 코호트 기간 칼럼 추가 
temp = []
first_yw = []
order_yw = [] 
for i in range(len(co_w.index)):
    if co_w['first_order_y'][i] == co_w['purchase_order_y'][i]:
        week_diff = int(co_w['purchase_order_w'][i]) - int(co_w['first_order_w'][i]) 
    elif co_w['purchase_order_y'][i] > co_w['first_order_y'][i]:
        x = co_w['purchase_order_y'][i] - co_w['first_order_y'][i] 
        week_diff = (int(co_w['purchase_order_w'][i])) + ((52*x) - int(co_w['first_order_w'][i]))
    first_yw.append(str(co_w['first_order_y'][i]) + '-' + str(co_w['first_order_w'][i]))
    order_yw.append(str(co_w['purchase_order_y'][i]) + '-' + str(co_w['purchase_order_w'][i]))
    temp.append(week_diff)
co_w['first_order_yw'] = first_yw
co_w['purchase_order_yw'] = order_yw
co_w['cohort_period'] = temp

co_w.sort_values(by=['first_order_y','first_order_w'] , ascending=True)

# 피벗 테이블 생성
cohort_pivot = co_w.pivot_table(index = 'first_order_yw',
                                    columns = 'cohort_period',
                                    values = 'Total_users')
cohort_pivot

# 히트맵 시각화 진행
cohort_pivot_slice = cohort_pivot.iloc[4:, 1:]
cohort_pivot_slice

plt.rcParams['figure.figsize'] = (12, 8)
sns.heatmap(cohort_pivot_slice) #annot = True, fmt = '.2f'
plt.yticks(rotation = 360)
plt.show()

# 특정 기간은 slice으로 확인 
cohort_pivot_slice2 = cohort_pivot.iloc[56:,1:35]
cohort_pivot_slice2

plt.rcParams['figure.figsize'] = (12, 8)
sns.heatmap(cohort_pivot_slice2) #annot = True, fmt = '.2f'
plt.yticks(rotation = 360)
plt.show()

4. 분석 결론

- 최초 구매후 한 달 이내에 재방문 비중이 가장 높았던 것으로 분석

- 고객 평균 구매 주기는 약 86일로 확인되어, 1개월 넘어가는 시점에 재방문 유도 프로모션 기획 가능

- 다른 현황 자료 추가 분석 후 코호트 파트는 해석 보충 예정

'TIL' 카테고리의 다른 글

[240403] 클러스터링 분석 - ③ 계층적 군집화와 덴드로그램 (실습) (1)	2024.04.03
[240402] 클러스터링 분석 - ③ 계층적 군집화와 덴드로그램 (1)	2024.04.02
[240329] 파이썬 sqlalchmey로 SQL DB 연동해 데이터 불러오기 (0)	2024.04.01
[240328] 파이썬: 코드카타 49 & SQL: 코드카타 167 (0)	2024.03.28
[240327] 파이썬: 코드카타 48 & SQL: 코드카타 166 (0)	2024.03.27

데이터 분석 공부하려고 만든 블로그

[240401] 파이썬 유저 코호트 분석 (월별/주차별)

'TIL' 카테고리의 다른 글

티스토리툴바

[240401] 파이썬 유저 코호트 분석 (월별/주차별)

'TIL' 카테고리의 다른 글

'TIL' Related Articles

티스토리툴바