[Python] Data Aggregation

프로그래밍

[Python] Data Aggregation

RainIron 2021. 5. 9. 14:49

1. Import

import pandas as pd
import numpy as np

2. Groupby

* DataFrame.groupby(그룹으로 묶고 싶은 컬럼)

SQL로 group으로 묶을 경우 Avg, Sum, Mean, Count 등의 함수를 썼듯이, 마찬가지로 보고 싶은 집계 함수를 써야한다.

=> return Series

emp = pd.DataFrame({'num': [1, 2, 3, 4, 5],
                   'name': ['smith', 'kali', 'timo', 'echo', 'shco'],
                   'deptno': [10, 10, 20, 20, 50],
                   'salary': [1000, 2000, 4000, 5000, 10000]})
                   
deptno_salary = emp['salary'].groupby(emp['deptno'])
# <pandas.core.groupby.generic.SeriesGroupBy object at 0x0000021B6E1D7B80>

# 부서별 급여 합계
deptno_salary_sum = deptno_salary.sum()
'''
deptno
10     3000
20     9000
50    10000
Name: salary, dtype: int64
'''

# 부서별 급여 평균
deptno_salary_mean = deptno_salary.mean()
deptno_salary_mean
'''
deptno
10     1500
20     4500
50    10000
Name: salary, dtype: int64
'''

* add_prefix(): 인덱스에 명칭 추가

# add_prefix() 사용해서 인덱스 명칭 수정
deptno_salary_sum = deptno_salary.sum().add_prefix('sum_')
'''
deptno
sum_10     3000
sum_20     9000
sum_50    10000
Name: salary, dtype: int64
'''

3. Merge

* Pandas.merge(합칠 대상1, 합칠 대상2, left_on, right_on, left_index, right_index)

left_on, right_on은 조인할 컬럼명, left_index, right_index는 조인할 인덱스명

컬럼명은 DataFrame, 인덱스명은 Series를 참고하면 된다.

# 기존 emp 테이블에 부서별 급여 평균을 조인(merge)
# emp(DataFrame) -> left_on
# deptno_salary_mean(Series) -> right_index
# DataFrame과 Seires를 조인할 수 있다!
join = pd.merge(emp, deptno_salary_mean, left_on = 'deptno', right_index = True)

# rename()
# axis = 1: 칼럼들의 이름 참조
# inplace = True: 변경한 내용을 반영
join.rename({'salary_x': 'salary', 'salary_y': 'dept_sal_mean'}, axis = 1, inplace = True)

4. Date_Range

* Pandas.date_range(start, end, periods(간격), freq(간격 단위) => return DatatimeIndex

months = pd.date_range('2020-01', periods = 12, freq = 'M')
'''
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31'],
              dtype='datetime64[ns]', freq='M')
'''

예시) 월별 생산량

sales = pd.DataFrame({'month': months,
              'pen': np.random.randint(0, 10, 12),
              'ball': np.random.randint(0, 10, 12),
              'car': np.random.randint(0, 10, 12)})

perf['month'] = perf['month'].dt.strftime('%Y-%m')

'프로그래밍' 카테고리의 다른 글

[Python] Matplotlib 활용(2) (0)	2021.05.10
[Python] Matplotlib 활용(1) (0)	2021.05.09
[Python] Data Manipulation (0)	2021.05.07
[Python] Pandas Library(Series)(2) (0)	2021.05.06
[Python] Pandas Library 활용(DataFrame) (0)	2021.05.06

현재글[Python] Data Aggregation

일상 정리하기

26살! 계획과 실행을 좋아합니다:) 소프트웨어 전공생

로지스틱회귀분석, matplotlib, CSS, r, 모델평가, SQL튜닝, hive, SQL, spring, oracle, SpringMVC, pyspark, 빅데이터분석기사, HTML, jsp, Pandas, Python, 실습, PL/SQL, 회계관리,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

일상 정리하기