Python 数据分析之 pandas 进阶(二) - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
raquant
V2EX    Python

Python 数据分析之 pandas 进阶(二)

  •  
  •   raquant 2017-03-14 18:57:51 +08:00 1937 次点击
    这是一个创建于 3211 天前的主题,其中的信息可能已经有所发展或是发生改变。

    python 数据分析之 pandas 进阶(二)

    六、分组

    对于“ group by ”操作,我们通常是指以下一个或多个操作步骤:

    ( Splitting )按照一些规则将数据分为不同的组 ( Applying )对于每组数据分别执行一个函数 ( Combining )将结果组合刀一个数据结构中 将要处理的数组是:

    df = pd.DataFrame({ 'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C': np.random.randn(8), 'D': np.random.randn(8) }) df A B C D 0 foo one 0.961295 -0.281012 1 bar one 0.901454 0.621284 2 foo two -0.584834 0.919414 3 bar three 1.259104 -1.012103 4 foo two 0.153107 1.108028 5 bar two 0.115963 1.333981 6 foo one 1.421895 -1.456916 7 foo three -2.103125 -1.757291 

    1 、分组并对每个分组执行 sum 函数:

    df.groupby('A').sum() C D A bar 2.276522 0.943161 foo -0.151661 -1.467777 

    2 、通过多个列进行分组形成一个层次索引,然后执行函数:

    df.groupby(['A', 'B']).sum() C D A B bar one 0.901454 0.621284 three 1.259104 -1.012103 two 0.115963 1.333981 foo one 2.383191 -1.737928 three -2.103125 -1.757291 two -0.431727 2.027441 

    七、 Reshaping

    Stack

    tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']])) tuples [('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')] 
    index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B']) df2 = df[:4] df2 A B first second bar one -0.907306 -0.009961 two 0.905177 -2.877961 baz one -0.356070 -0.373447 two -1.496644 -1.958782 
    stacked = df2.stack() stacked first second bar one A -0.907306 B -0.009961 two A 0.905177 B -2.877961 baz one A -0.356070 B -0.373447 two A -1.496644 B -1.958782 dtype: float64 
    stacked.unstack() A B first second bar one -0.907306 -0.009961 two 0.905177 -2.877961 baz one -0.356070 -0.373447 two -1.496644 -1.958782 
    stacked.unstack(1) second one two first bar A -0.907306 0.905177 B -0.009961 -2.877961 baz A -0.356070 -1.496644 B -0.373447 -1.958782 

    八、相关操作

    要处理的数组为:

    df A B C D F 2013-01-01 0.000000 0.000000 0.135704 5 NaN 2013-01-02 0.139027 1.683491 -1.031190 5 1 2013-01-03 -0.596279 -1.211098 1.169525 5 2 2013-01-04 0.367213 -0.020313 2.169802 5 3 2013-01-05 0.224122 1.003625 -0.488250 5 4 2013-01-06 0.186073 -0.537019 -0.252442 5 5 

    (一)、统计

    1 、执行描述性统计:

    df.mean() A 0.053359 B 0.153115 C 0.283858 D 5.000000 F 3.000000 dtype: float64 

    2 、在其他轴上进行相同的操作:

    df.mean(1) 2013-01-01 1.283926 2013-01-02 1.358266 2013-01-03 1.272430 2013-01-04 2.103341 2013-01-05 1.947899 2013-01-06 1.879322 Freq: D, dtype: float64 

    3 、对于拥有不同维度,需要对齐的对象进行操作, pandas 会自动的沿着指定的维度进行广播

    dates s = pd.Series([1,3,4,np.nan,6,8], index=dates).shift(2) s DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') 2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1 2013-01-04 3 2013-01-05 4 2013-01-06 NaN Freq: D, dtype: float64 

    (二)、 Apply

    对数据应用函数:

    df.apply(np.cumsum) A B C D F 2013-01-01 0.000000 0.000000 0.135704 5 NaN 2013-01-02 0.139027 1.683491 -0.895486 10 1 2013-01-03 -0.457252 0.472393 0.274039 15 3 2013-01-04 -0.090039 0.452081 2.443841 20 6 2013-01-05 0.134084 1.455706 1.955591 25 10 2013-01-06 0.320156 0.918687 1.703149 30 15 
    df.apply(lambda x: x.max() - x.min()) A 0.963492 B 2.894589 C 3.200992 D 0.000000 F 4.000000 dtype: float64 

    (三)、字符串方法

    Series 对象在其 str 属性中配备了一组字符串处理方法,可以很容易的应用到数组中的每个元素。

    s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) s.str.lower() 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object 

    九、时间序列

    1 、时区表示:

    rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D') ts = pd.Series(np.random.randn(len(rng)), rng) ts 2012-03-06 -0.932261 2012-03-07 -1.405305 2012-03-08 0.809844 2012-03-09 -0.481539 2012-03-10 -0.489847 Freq: D, dtype: float64 
    ts_utc = ts.tz_localize('UTC') ts_utc 2012-03-06 00:00:00+00:00 -0.932261 2012-03-07 00:00:00+00:00 -1.405305 2012-03-08 00:00:00+00:00 0.809844 2012-03-09 00:00:00+00:00 -0.481539 2012-03-10 00:00:00+00:00 -0.489847 Freq: D, dtype: float64 

    2 、时区转换

    ts_utc.tz_convert('US/Eastern') 2012-03-05 19:00:00-05:00 -0.932261 2012-03-06 19:00:00-05:00 -1.405305 2012-03-07 19:00:00-05:00 0.809844 2012-03-08 19:00:00-05:00 -0.481539 2012-03-09 19:00:00-05:00 -0.489847 Freq: D, dtype: float64 

    3 、时区跨度转换

    rng = pd.date_range('1/1/2012', periods=5, freq='M') ts = pd.Series(np.random.randn(len(rng)), index=rng) ps = ts.to_period() ts ps ps.to_timestamp() 2012-01-31 0.932519 2012-02-29 0.247016 2012-03-31 -0.946069 2012-04-30 0.267513 2012-05-31 -0.554343 Freq: M, dtype: float64 2012-01 0.932519 2012-02 0.247016 2012-03 -0.946069 2012-04 0.267513 2012-05 -0.554343 Freq: M, dtype: float64 2012-01-01 0.932519 2012-02-01 0.247016 
    ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000)) ts = ts.cumsum() ts2012-03-01 -0.946069 2012-04-01 0.267513 2012-05-01 -0.554343 Freq: MS, dtype: float64 

    十、画图

    ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000)) ts = ts.cumsum() ts 

    十一、 Categorical

    从 0.15 版本开始, pandas 可以在 DataFrame 中支持 Categorical 类型的数据

    df = pd.DataFrame({ 'id':[1,2,3,4,5,6], 'raw_grade':['a','b','b','a','a','e'] }) df id raw_grade 0 1 a 1 2 b 2 4 a 4 5 a 5 6 e 

    1 、将原始的 grade 转换为 Categorical 数据类型:

    df['grade'] = df['raw_grade'].astype('category', ordered=True) df['grade'] 0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a < b < e] 

    2 、将 Categorical 类型数据重命名为更有意义的名称:

    df['grade'].cat.categories = ['very good', 'good', 'very bad'] 

    3 、对类别进行重新排序,增加缺失的类别:

    df['grade'] = df['grade'].cat.set_categories(['very bad', 'bad', 'medium', 'good', 'very good']) df['grade'] 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad < bad < medium < good < very good] 

    4 、排序是按照 Categorical 的顺序进行的而不是按照字典顺序进行:

    df.sort('grade') id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good 

    5 、对 Categorical 列进行排序时存在空的类别:

    df.groupby("grade").size() grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64 

    以上代码不想自己试一试吗?

    镭矿 raquant提供 jupyter 在线练习学习 python 的机会,无需安装 python 即可运行 python 程序。

    mingyun
        1
    mingyun  
       2017-03-14 23:06:59 +08:00
    pandas 进阶(一)呢
    raquant
        2
    raquant  
    OP
       2017-03-15 12:07:58 +08:00
    t/347398#reply0

    在这里呀:)
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     3527 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 28ms UTC 00:55 PVG 08:55 LAX 16:55 JFK 19:55
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86