2017-05-16 3 views
0

'조정 됨'이라는 전체 데이터 프레임이 있습니다. 'fyear'및 'conm'에 새로운 값 조건이있는 열 '스테이지'를 추가하고 싶습니다.pandas dataframe 다른 열 값에 여러 조건을 사용하여 값이있는 열을 추가합니다.

fyear conm    indadjsg 
1 1999 1-800-FLOWERS.COM 26.646086 
2 2000 1-800-FLOWERS.COM 22.727175 
3 2001 1-800-FLOWERS.COM 7.312014 
4 2002 1-800-FLOWERS.COM 4.948308 
5 2003 1-800-FLOWERS.COM 6.278798 
23 1996 ABERCROMBIE & FITCH -CL A 34.831691 
24 1997 ABERCROMBIE & FITCH -CL A 48.053137 
25 1998 ABERCROMBIE & FITCH -CL A 48.918326 
26 1999 ABERCROMBIE & FITCH -CL A 46.956456 
27 2000 ABERCROMBIE & FITCH -CL A 33.91436 
28 2001 ABERCROMBIE & FITCH -CL A 67.23423 
29 2002 ABERCROMBIE & FITCH -CL A 99.09342 
11929 2006 CLIFTON BANCORP INC 0.236418 
11930 2007 CLIFTON BANCORP INC -1.366626 
11931 2008 CLIFTON BANCORP INC 8.564019 
11932 2009 CLIFTON BANCORP INC -4.966110 
11933 2010 CLIFTON BANCORP INC -4.359552 
11934 2011 CLIFTON BANCORP INC -16.313852 
11935 2012 CLIFTON BANCORP INC -18.193550 
11936 2013 CLIFTON BANCORP INC -10.126603 
11937 2014 CLIFTON BANCORP INC 4.718584 
11938 2015 CLIFTON BANCORP INC -11.889065 
11940 2015 CLIPPER REALTY INC 70.945767 
11941 2016 CLIPPER REALTY INC 3.776001 
11980 2014 CM FINANCE INC 205.894048 
11981 2015 CM FINANCE INC 68.518555 
121247 2009 VCA INC -5.552030 
121248 2010 VCA INC -3.357275 
121249 2011 VCA INC -0.930798 
121250 2012 VCA INC 5.974914 
121256 2007 VIASPACE INC -50.966869 
121257 2008 VIASPACE INC 149.957403 
121258 2009 VIASPACE INC 197.776855 
121259 2010 VIASPACE INC -25.201733 
121260 2011 VIASPACE INC 77.082624 
121261 2012 VIASPACE INC 78.034233 
121266 2005 YASHENG GROUP -3.728098 
121267 2006 YASHENG GROUP -2.233927 
121268 2007 YASHENG GROUP 0.349349 
121279 2009 YUHE INTERNATIONAL INC 27.995324 
121280 2010 YUHE INTERNATIONAL INC 34.375630 

1) 나는 회사 고유의 fyear의 수가 나머지, 나는 채우기 후 상태를 '시작')와 동일하거나

2 미만 5

byyr = adjusted.groupby(by=['conm'])['fyear'] 
dfbyyr =byyr.count().to_frame() 
start = dfbyyr[dfbyyr['fyear'] <= 5] 

           fyear 
    conm     
    1-800-FLOWERS.COM   5 
    ABERCROMBIE & FITCH -CL A 7 
    CLIFTON BANCORP INC  10 
    CLIPPER REALTY INC   2 
    CM FINANCE INC    2 
    VCA INC      4 
    VIASPACE INC    6 
    YASHENG GROUP    3 
    YUHE INTERNATIONAL INC  2 
경우 채우기 '시작'할 데이터 중 다른 값을 채우고 싶습니다. 나는 유일한 회사의 평균 indadjsg를 계산했다.

mask2 = adjusted.groupby(by=['conm'])['indadjsg'] 
countsg = mask2.mean().to_frame().reset_index() 
c = countsg.dropna() 

데이터 프레임 'C'

conm    indadjsg 
0 1-800-FLOWERS.COM 3.291539 
1 ABERCROMBIE & FITCH -CL A 105.335324 
2 CLIFTON BANCORP INC 22.920683 
3 CLIPPER REALTY INC 36.784677 
4 CM FINANCE INC 1.605919 
5 VCA INC 3.116871 
6 VIASPACE INC -106.153789 
7 YASHENG GROUP -2.676296 
8 YUHE INTERNATIONAL INC 12.306557 

내가 부여 할 조건은 다음과 같습니다 : 내가 만들고 싶어

 indadjsg < 0, 'decline' 
0 <= indadjsg <= 15, 'revival' 
15< indadjsg <= 100, 'mature' 
100< indajsg   , 'growth' 

최종 데이터 프레임이

fyear conm    indadjsg stage 
1 1999 1-800-FLOWERS.COM 26.646086 start 
2 2000 1-800-FLOWERS.COM 22.727175 start 
3 2001 1-800-FLOWERS.COM 7.312014 start 
4 2002 1-800-FLOWERS.COM 4.948308 start 
5 2003 1-800-FLOWERS.COM 6.278798 start 
23 1996 ABERCROMBIE & FITCH -CL A 34.831691 growth 
24 1997 ABERCROMBIE & FITCH -CL A 48.053137 growth  
25 1998 ABERCROMBIE & FITCH -CL A 48.918326 growth  
26 1999 ABERCROMBIE & FITCH -CL A 46.956456 growth 
27 2000 ABERCROMBIE & FITCH -CL A 33.91436 growth 
28 2001 ABERCROMBIE & FITCH -CL A 67.23423 growth 
29 2002 ABERCROMBIE & FITCH -CL A 99.09342 growth 
11929 2006 CLIFTON BANCORP INC 0.236418  mature 
11930 2007 CLIFTON BANCORP INC -1.366626  mature 
11931 2008 CLIFTON BANCORP INC 8.564019  mature 
11932 2009 CLIFTON BANCORP INC -4.966110  mature 
11933 2010 CLIFTON BANCORP INC -4.359552  mature 
11934 2011 CLIFTON BANCORP INC -16.313852  mature 
11935 2012 CLIFTON BANCORP INC -18.193550  mature 
11936 2013 CLIFTON BANCORP INC -10.126603  mature 
11937 2014 CLIFTON BANCORP INC 4.718584  mature 
11938 2015 CLIFTON BANCORP INC -11.889065  mature 
11940 2015 CLIPPER REALTY INC 70.945767  start 
11941 2016 CLIPPER REALTY INC 3.776001  start 
11980 2014 CM FINANCE INC 205.894048 start 
11981 2015 CM FINANCE INC 68.518555  start 
121247 2009 VCA INC -5.552030    start 
121248 2010 VCA INC -3.357275    start 
121249 2011 VCA INC -0.930798    start 
121250 2012 VCA INC 5.974914    start 
121256 2007 VIASPACE INC -50.966869 decline 
121257 2008 VIASPACE INC 149.957403 decline 
121258 2009 VIASPACE INC 197.776855 decline 
121259 2010 VIASPACE INC -25.201733 decline 
121260 2011 VIASPACE INC 77.082624  decline 
121261 2012 VIASPACE INC 78.034233  decline 
121266 2005 YASHENG GROUP -3.728098  start 
121267 2006 YASHENG GROUP -2.233927  start 
121268 2007 YASHENG GROUP 0.349349   start 
121279 2009 YUHE INTERNATIONAL INC 27.995324 start 
121280 2010 YUHE INTERNATIONAL INC 34.375630 start 
같다

O에서 할 수있는 방법이 있습니까? nce? 나는 별도의 열을 만들고 병합하는 것을 생각할 수 있습니다. 효율적으로 생각하는 것을 도와 줄 수 있습니까? 미리 감사드립니다.

답변

1

groupby/transform 연산 (아래 classify 함수 참조)을 사용하여 stage 열을 계산하는 방법이 있지만 각 그룹에 대해 한 번 사용자 지정 Python 함수를 호출해야합니다. 그룹이 많으면 비효율적 인 경향이 있습니다.

일반적으로 많은 Python 함수 호출을 벡터화 된 작업으로 바꾸거나 전체 (큰) DataFrame 또는 DataFrame의 큰 열을 대체하면 성능이 향상됩니다. conm의 많은 (그룹 즉 많은)가 그렇다면

는 첫 번째 아이디어로 이동 아마 에 더 나은 - 각 회사의 단계를 계산 한 후 adjusted로 다시 결과를 병합. 여기에 한 가지 방법입니다 - 병합이 이 join에 대한 호출을 통해 이루어집니다 : 여기

       conm fyear indadjsg stage 
1    1-800-FLOWERS.COM 1999 26.646086 start 
2    1-800-FLOWERS.COM 2000 22.727175 start 
3    1-800-FLOWERS.COM 2001 7.312014 start 
4    1-800-FLOWERS.COM 2002 4.948308 start 
5    1-800-FLOWERS.COM 2003 6.278798 start 
23  ABERCROMBIE & FITCH -CL A 1996 34.831691 mature 
24  ABERCROMBIE & FITCH -CL A 1997 48.053137 mature 
25  ABERCROMBIE & FITCH -CL A 1998 48.918326 mature 
26  ABERCROMBIE & FITCH -CL A 1999 46.956456 mature 
27  ABERCROMBIE & FITCH -CL A 2000 33.914360 mature 
28  ABERCROMBIE & FITCH -CL A 2001 67.234230 mature 
29  ABERCROMBIE & FITCH -CL A 2002 99.093420 mature 
11929   CLIFTON BANCORP INC 2006 0.236418 decline 
11930   CLIFTON BANCORP INC 2007 -1.366626 decline 
11931   CLIFTON BANCORP INC 2008 8.564019 decline 
11932   CLIFTON BANCORP INC 2009 -4.966110 decline 
11933   CLIFTON BANCORP INC 2010 -4.359552 decline 
11934   CLIFTON BANCORP INC 2011 -16.313852 decline 
11935   CLIFTON BANCORP INC 2012 -18.193550 decline 
11936   CLIFTON BANCORP INC 2013 -10.126603 decline 
11937   CLIFTON BANCORP INC 2014 4.718584 decline 
11938   CLIFTON BANCORP INC 2015 -11.889065 decline 
11940   CLIPPER REALTY INC 2015 70.945767 start 
11941   CLIPPER REALTY INC 2016 3.776001 start 
11980    CM FINANCE INC 2014 205.894048 start 
11981    CM FINANCE INC 2015 68.518555 start 
121247     VCA INC 2009 -5.552030 start 
121248     VCA INC 2010 -3.357275 start 
121249     VCA INC 2011 -0.930798 start 
121250     VCA INC 2012 5.974914 start 
121256    VIASPACE INC 2007 -50.966869 mature 
121257    VIASPACE INC 2008 149.957403 mature 
121258    VIASPACE INC 2009 197.776855 mature 
121259    VIASPACE INC 2010 -25.201733 mature 
121260    VIASPACE INC 2011 77.082624 mature 
121261    VIASPACE INC 2012 78.034233 mature 
121266    YASHENG GROUP 2005 -3.728098 start 
121267    YASHENG GROUP 2006 -2.233927 start 
121268    YASHENG GROUP 2007 0.349349 start 
121279  YUHE INTERNATIONAL INC 2009 27.995324 start 
121280  YUHE INTERNATIONAL INC 2010 34.375630 start 

가있을 때 느린 대체 방법입니다

import numpy as np 
import pandas as pd 
adjusted = pd.DataFrame({'conm': ['1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIPPER REALTY INC', 'CLIPPER REALTY INC', 'CM FINANCE INC', 'CM FINANCE INC', 'VCA INC', 'VCA INC', 'VCA INC', 'VCA INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'YASHENG GROUP', 'YASHENG GROUP', 'YASHENG GROUP', 'YUHE INTERNATIONAL INC', 'YUHE INTERNATIONAL INC'], 'fyear': [1999, 2000, 2001, 2002, 2003, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2015, 2016, 2014, 2015, 2009, 2010, 2011, 2012, 2007, 2008, 2009, 2010, 2011, 2012, 2005, 2006, 2007, 2009, 2010], 'indadjsg': [26.646085999999997, 22.727175, 7.312014, 4.948308, 6.278798, 34.831691, 48.053137, 48.918326, 46.956456, 33.914359999999995, 67.23423000000001, 99.09342, 0.236418, -1.3666260000000001, 8.564019, -4.96611, -4.359552, -16.313852, -18.19355, -10.126603, 4.718584, -11.889064999999999, 70.945767, 3.7760010000000004, 205.894048, 68.518555, -5.55203, -3.357275, -0.9307979999999999, 5.974914, -50.966869, 149.957403, 197.776855, -25.201732999999997, 77.082624, 78.034233, -3.728098, -2.233927, 0.34934899999999997, 27.995324, 34.37563]}, index=[1, 2, 3, 4, 5, 23, 24, 25, 26, 27, 28, 29, 11929, 11930, 11931, 11932, 11933, 11934, 11935, 11936, 11937, 11938, 11940, 11941, 11980, 11981, 121247, 121248, 121249, 121250, 121256, 121257, 121258, 121259, 121260, 121261, 121266, 121267, 121268, 121279, 121280]) 

grouped = adjusted.groupby(by=['conm']) 
stage = pd.cut(grouped['indadjsg'].mean(), bins=[-np.inf,0,15,100,np.inf], labels=False) 
stage.name = 'stage' 
labels = np.array(['decline', 'revival', 'mature', 'growth']) 
adjusted = adjusted.join(stage, on='conm') 
adjusted['stage'] = labels[adjusted['stage']] 
mask = (grouped['fyear'].transform('count') <= 5) 
adjusted.loc[mask, 'stage'] = 'start' 
print(adjusted) 

수율 많은 그룹 (그러나 그룹이 거의없는 경우 아마 더 빠름).

groupby/transform 조작 을 사용하여 stage 열을 사용자 정의 파이썬 함수 classify을 사용하여 계산할 수 있습니다. classify은 각 그룹에 대해 한 번, 즉 conm의 각 값에 대해 한 번 호출됩니다.right=True 또는 right=False.

3

난 당신이 pd.cutnp.where을 통해이 작업을 수행 할 수 있다고 생각합니다 :

Adjusted: 
     fyear conm      indadjsg  
0  1999 1-800-FLOWERS.COM   26.646086    
1  2000 1-800-FLOWERS.COM   22.727175    
2  2001 1-800-FLOWERS.COM   7.312014    
3  2002 1-800-FLOWERS.COM   4.948308    
4  2003 1-800-FLOWERS.COM   6.278798    
5  1996 ABERCROMBIE & FITCH -CL A 34.831691    
6  1997 ABERCROMBIE & FITCH -CL A 48.053137    
... 
35  2012 VIASPACE INC    78.034233    
36  2005 YASHENG GROUP    -3.728098    
37  2006 YASHENG GROUP    -2.233927    
38  2007 YASHENG GROUP    0.349349    
39  2009 YUHE INTERNATIONAL INC  27.995324    
40  2010 YUHE INTERNATIONAL INC  34.375630    

이 코드는 특히 영리 아니지만, 그것은 매우 간단합니다 :

# add an empty "stage" column 
adjusted['stage'] = '' 

# create boolean masks for each stage classification 
g = adjusted.groupby(by='conm') 
decline = g['indadjsg'].transform('mean') < 0 
revival = (g['indadjsg'].transform('mean') >= 0) & (g['indadjsg'].transform('mean') <= 15) 
mature = (g['indadjsg'].transform('mean') > 15) & (g['indadjsg'].transform('mean') <= 100) 
growth = (g['indadjsg'].transform('mean') > 100) 
start = g['fyear'].transform('count') <= 5 

adjusted.loc[decline, 'stage'] = 'decline' 
adjusted.loc[revival, 'stage'] = 'revival' 
adjusted.loc[mature, 'stage'] = 'mature' 
adjusted.loc[growth, 'stage'] = 'growth' 

# set 'start' classification last so it overwrites 
# the classification set based on 
adjusted.loc[start, 'stage'] = 'start' 

출력은 다음과 같습니다

fyear conm      indadjsg stage 
0 1999 1-800-FLOWERS.COM   26.646086 start 
1 2000 1-800-FLOWERS.COM   22.727175 start 
2 2001 1-800-FLOWERS.COM   7.312014 start 
3 2002 1-800-FLOWERS.COM   4.948308 start 
4 2003 1-800-FLOWERS.COM   6.278798 start 
5 1996 ABERCROMBIE & FITCH -CL A 34.831691 mature 
6 1997 ABERCROMBIE & FITCH -CL A 48.053137 mature 
7 1998 ABERCROMBIE & FITCH -CL A 48.918326 mature 
8 1999 ABERCROMBIE & FITCH -CL A 46.956456 mature 
9 2000 ABERCROMBIE & FITCH -CL A 33.914360 mature 
10 2001 ABERCROMBIE & FITCH -CL A 67.234230 mature 
11 2002 ABERCROMBIE & FITCH -CL A 99.093420 mature 
12 2006 CLIFTON BANCORP INC   0.236418 decline 
13 2007 CLIFTON BANCORP INC   -1.366626 decline 
14 2008 CLIFTON BANCORP INC   8.564019 decline 
15 2009 CLIFTON BANCORP INC   -4.966110 decline 
16 2010 CLIFTON BANCORP INC   -4.359552 decline 
17 2011 CLIFTON BANCORP INC   -16.313852 decline 
18 2012 CLIFTON BANCORP INC   -18.193550 decline 
19 2013 CLIFTON BANCORP INC   -10.126603 decline 
20 2014 CLIFTON BANCORP INC   4.718584 decline 
21 2015 CLIFTON BANCORP INC   -11.889065 decline 
22 2015 CLIPPER REALTY INC   70.945767 start 
23 2016 CLIPPER REALTY INC   3.776001 start 
24 2014 CM FINANCE INC    205.894048 start 
25 2015 CM FINANCE INC    68.518555 start 
26 2009 VCA INC      -5.552030 start 
27 2010 VCA INC      -3.357275 start 
28 2011 VCA INC      -0.930798 start 
29 2012 VCA INC      5.974914 start 
30 2007 VIASPACE INC    -50.966869 mature 
31 2008 VIASPACE INC    149.957403 mature 
32 2009 VIASPACE INC    197.776855 mature 
33 2010 VIASPACE INC    -25.201733 mature 
34 2011 VIASPACE INC    77.082624 mature 
35 2012 VIASPACE INC    78.034233 mature 
36 2005 YASHENG GROUP    -3.728098 start 
37 2006 YASHENG GROUP    -2.233927 start 
38 2007 YASHENG GROUP    0.349349 start 
39 2009 YUHE INTERNATIONAL INC  27.995324 start 
40 2010 YUHE INTERNATIONAL INC  34.375630 start    
+0

중복 응답, @unutbu 죄송와 나는 동시에 대답하여 우리의 응답은 동일한 솔루션으로 약간 다른 경로를 사용합니다. –

+0

사과 할 필요가 없습니다. 대체 방법을 보는 것은 매우 유용 할 수 있습니다. – unutbu

0

이 시작과 함께 빈의 가장자리를 지정해야합니다, pd.cut에

adjusted # copied text from your example 
Out[86]: 
    fyear    conm indadjsg 
0 1999 1-800-FLOWERS.COM 26.64609 
1 2000 1-800-FLOWERS.COM 22.72717 
2 2001 1-800-FLOWERS.COM 7.31201 
3 2002 1-800-FLOWERS.COM 4.94831 
4 2003 1-800-FLOWERS.COM 6.27880 
5 1996  ABERCROMBIE 34.83169 
6 1997  ABERCROMBIE 48.05314 
7 1998  ABERCROMBIE 48.91833 
8 1999  ABERCROMBIE 46.95646 
9 2000  ABERCROMBIE 33.91436 
10 2001  ABERCROMBIE 67.23423 
11 2002  ABERCROMBIE 99.09342 
.. ...    ...  ... 
25 2015     CM 68.51856 
26 2009    VCA -5.55203 
27 2010    VCA -3.35728 
28 2011    VCA -0.93080 
29 2012    VCA 5.97491 
30 2007   VIASPACE -50.96687 
31 2008   VIASPACE 149.95740 
32 2009   VIASPACE 197.77686 
33 2010   VIASPACE -25.20173 
34 2011   VIASPACE 77.08262 
35 2012   VIASPACE 78.03423 
36 2005   YASHENG -3.72810 

byyr = adjusted.groupby(by='conm')['fyear'].count().to_frame() 
start = byyr.fyear[adjusted.conm] 

indadjsg = adjusted.groupby(by='conm')['indadjsg'].mean().to_frame() 
px = indadjsg.indadjsg[adjusted.conm] 
categories = pd.cut(px.values.reshape((len(px),)), 
        bins= [-np.inf, 0, 15, 100, np.inf], 
        labels=['decline', 'revival', 'mature', 'growth']) 

adjusted.loc[:, 'stage'] = np.where(start <= 5, 'start', categories) 

adjusted # result 
Out[130]: 
    fyear    conm indadjsg stage 
0 1999 1-800-FLOWERS.COM 26.64609 start 
1 2000 1-800-FLOWERS.COM 22.72717 start 
2 2001 1-800-FLOWERS.COM 7.31201 start 
3 2002 1-800-FLOWERS.COM 4.94831 start 
4 2003 1-800-FLOWERS.COM 6.27880 start 
5 1996  ABERCROMBIE 34.83169 mature 
6 1997  ABERCROMBIE 48.05314 mature 
7 1998  ABERCROMBIE 48.91833 mature 
8 1999  ABERCROMBIE 46.95646 mature 
9 2000  ABERCROMBIE 33.91436 mature 
10 2001  ABERCROMBIE 67.23423 mature 
11 2002  ABERCROMBIE 99.09342 mature 
.. ...    ...  ...  ... 
25 2015     CM 68.51856 start 
26 2009    VCA -5.55203 start 
27 2010    VCA -3.35728 start 
28 2011    VCA -0.93080 start 
29 2012    VCA 5.97491 start 
30 2007   VIASPACE -50.96687 mature 
31 2008   VIASPACE 149.95740 mature 
32 2009   VIASPACE 197.77686 mature 
33 2010   VIASPACE -25.20173 mature 
34 2011   VIASPACE 77.08262 mature 
35 2012   VIASPACE 78.03423 mature 
36 2005   YASHENG -3.72810 start 

:

import bisect 
def classify(grp, grid=[0,15,100,np.inf], 
      labels=['decline', 'revival', 'mature', 'growth']): 
    return 'start' if len(grp) <= 5 else labels[bisect.bisect_left(grid, grp.mean())] 

grouped = adjusted.groupby(by=['conm']) 
adjusted['stage'] = grouped['indadjsg'].transform(classify) 
print(adjusted)