Python 数据分析之 pandas 进阶(一) - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
raquant
V2EX    Python

Python 数据分析之 pandas 进阶(一)

  •  
  •   raquant 2017-03-14 15:50:30 +08:00 2370 次点击
    这是一个创建于 3211 天前的主题,其中的信息可能已经有所发展或是发生改变。

    python 数据分析之 pandas 进阶(一)

    导入本篇中使用到的模块:

    import numpy as np import pandas as pd from pandas import Series, DataFrame 

    我们可以调整数据输出框大小以便观察:

    pd.set_option('display.width', 200) 

    一、创建对象

    1 、可以通过传递一个 list 对象来创建一个 Series , pandas 会默认创建整型索引:

    s = pd.Series([1,3,5,np.nan,6,8]) s 0 1 1 3 2 5 3 NaN 4 6 5 8 dtype: float64dates = pd.date_range('20130101', periods=6) 

    2 、通过传递一个 numpy array ,时间索引以及列标签来创建一个 DataFrame :

    dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) dates df DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') A B C D 2013-01-01 -1.857957 -0.297110 0.135704 0.199878 2013-01-02 0.139027 1.683491 -1.031190 1.447487 2013-01-03 -0.596279 -1.211098 1.169525 0.663366 2013-01-04 0.367213 -0.020313 2.169802 -1.295228 2013-01-05 0.224122 1.003625 -0.488250 -0.594528 2013-01-06 0.186073 -0.537019 -0.252442 0.530238 

    3 、通过传递一个能够被转换成类似序列结构的字典对象来创建一个 DataFrame :

    df2 = pd.DataFrame({'A':1., 'B':pd.Timestamp('20130102'), 'C':pd.Series(1, index=list(range(4)),dtype='float32'), 'D':np.array([3] * 4, dtype='int32'), 'E':pd.Categorical(['test','train', 'test','train']), 'F':'foo' }) df2 

    4 、查看不同列的数据类型:

    df2.dtypes A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object 

    5 、使用 Tab 自动补全功

    二、查看数据

    1.查看 Frame 中头部和尾部的行:能会自动识别所有的属性以及自定义的列

    df.head() A B C D 2013-01-01 -1.857957 -0.297110 0.135704 0.199878 2013-01-02 0.139027 1.683491 -1.031190 1.447487 2013-01-03 -0.596279 -1.211098 1.169525 0.663366 2013-01-04 0.367213 -0.020313 2.169802 -1.295228 2013-01-05 0.224122 1.003625 -0.488250 -0.594528 
    df.tail(3) A B C D 2013-01-04 0.367213 -0.020313 2.169802 -1.295228 2013-01-05 0.224122 1.003625 -0.488250 -0.594528 2013-01-06 0.186073 -0.537019 -0.252442 0.530238 

    2 、显示索引、列和底层的 numpy 数据:

    df.index DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') df.columns Index(['A', 'B', 'C', 'D'], dtype='object') 

    3 、 describe()函数对于数据的快速统计汇总:

    df.describe() A B C D count 6.000000 6.000000 6.000000 6.000000 mean -0.256300 0.103596 0.283858 0.158536 std 0.854686 1.060269 1.181208 0.973309 min -1.857957 -1.211098 -1.031190 -1.295228 25% -0.412452 -0.477042 -0.429298 -0.395927 50% 0.162550 -0.158711 -0.058369 0.365058 75% 0.214610 0.747641 0.911070 0.630084 max 0.367213 1.683491 2.169802 1.447487 

    4 、对数据的转置(tranverse):

    df.T 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 00:00:00 00:00:00 00:00:00 00:00:00 00:00:00 00:00:00 A -1.857957 0.139027 -0.596279 0.367213 0.224122 0.186073 B -0.297110 1.683491 -1.211098 -0.020313 1.003625 -0.537019 C 0.135704 -1.031190 1.169525 2.169802 -0.488250 -0.252442 D 0.199878 1.447487 0.663366 -1.295228 -0.594528 0.530238 

    5 、按轴进行排序:

    df.sort_index(axis=1,ascending=False) D C B A 2013-01-01 0.199878 0.135704 -0.297110 -1.857957 2013-01-02 1.447487 -1.031190 1.683491 0.139027 2013-01-03 0.663366 1.169525 -1.211098 -0.596279 2013-01-04 -1.295228 2.169802 -0.020313 0.367213 2013-01-05 -0.594528 -0.488250 1.003625 0.224122 2013-01-06 0.530238 -0.252442 -0.537019 0.186073 

    6 、按值进行排序:

    df.sort(columns='B') A B C D 2013-01-03 -0.596279 -1.211098 1.169525 0.663366 2013-01-06 0.186073 -0.537019 -0.252442 0.530238 2013-01-01 -1.857957 -0.297110 0.135704 0.199878 2013-01-04 0.367213 -0.020313 2.169802 -1.295228 2013-01-05 0.224122 1.003625 -0.488250 -0.594528 2013-01-02 0.139027 1.683491 -1.031190 1.447487 

    三、选择数据

    以下是要操作的数组:

    df A B C D 2013-01-01 -1.857957 -0.297110 0.135704 0.199878 2013-01-02 0.139027 1.683491 -1.031190 1.447487 2013-01-03 -0.596279 -1.211098 1.169525 0.663366 2013-01-04 0.367213 -0.020313 2.169802 -1.295228 2013-01-05 0.224122 1.003625 -0.488250 -0.594528 2013-01-06 0.186073 -0.537019 -0.252442 0.530238 

    1 、获取数据

    (1)、选择一个单独的列,这将会返回一个 Series:

    df['A'] 2013-01-01 -1.857957 2013-01-02 0.139027 2013-01-03 -0.596279 2013-01-04 0.367213 2013-01-05 0.224122 2013-01-06 0.186073 Freq: D, Name: A, dtype: float64 

    (2)、通过[]进行选择,即:切片

    df[0:3] A B C D 2013-01-01 -1.857957 -0.297110 0.135704 0.199878 2013-01-02 0.139027 1.683491 -1.031190 1.447487 2013-01-03 -0.596279 -1.211098 1.169525 0.663366 

    2 、标签选择

    (1)、使用标签来获取一个交叉的区域

    df.loc[dates[0]] A -1.857957 B -0.297110 C 0.135704 D 0.199878 Name: 2013-01-01 00:00:00, dtype: float64 

    (2)、通过标签来在多个轴上进行选择

    df.loc[:,['A', 'B']] A B 2013-01-01 -1.857957 -0.297110 2013-01-02 0.139027 1.683491 2013-01-03 -0.596279 -1.211098 2013-01-04 0.367213 -0.020313 2013-01-05 0.224122 1.003625 2013-01-06 0.186073 -0.537019 

    (3)、标签切片

    df.loc['20130102':'20130104', ['A','B']] A B 2013-01-02 0.139027 1.683491 2013-01-03 -0.596279 -1.211098 2013-01-04 0.367213 -0.020313 

    (4)、对于返回的对象进行维度缩减

    df.loc['20130102', ['A','B']] A 0.139027 B 1.683491 Name: 2013-01-02 00:00:00, dtype: float64 

    (5)、获取一个标量

    df.loc[dates[0], 'A'] -1.8579571971312099 

    3 、位置选择

    (1)、通过传递数值进行位置选择(选择的是行)

    df.iloc[3] A 0.367213 B -0.020313 C 2.169802 D -1.295228 Name: 2013-01-04 00:00:00, dtype: float64 

    (2)、通过数值进行切片

    df.iloc[3:5,0:2] A B 2013-01-04 0.367213 -0.020313 2013-01-05 0.224122 1.003625 

    (3)、通过指定一个位置的列表

    df.iloc[[1,2,4],[0,2]] A C 2013-01-02 0.139027 -1.031190 2013-01-03 -0.596279 1.169525 2013-01-05 0.224122 -0.488250 

    (4)、对行进行切片

    df.iloc[1:3,:] A B C D 2013-01-02 0.139027 1.683491 -1.031190 1.447487 2013-01-03 -0.596279 -1.211098 1.169525 0.663366 

    (5)、获取特定的值

    df.iloc[1,1] 1.6834910794696132 

    4 、布尔索引

    (1)、使用一个单独列的值来选择数据:

    df[df.A > 0] A B C D 2013-01-02 0.139027 1.683491 -1.031190 1.447487 2013-01-04 0.367213 -0.020313 2.169802 -1.295228 2013-01-05 0.224122 1.003625 -0.488250 -0.594528 2013-01-06 0.186073 -0.537019 -0.252442 0.530238 

    (2)、使用 where 操作来选择数据:

    df[df > 0] A B C D 2013-01-01 NaN NaN 0.135704 0.199878 2013-01-02 0.139027 1.683491 NaN 1.447487 2013-01-03 NaN NaN 1.169525 0.663366 2013-01-04 0.367213 NaN 2.169802 NaN 2013-01-05 0.224122 1.003625 NaN NaN 2013-01-06 0.186073 NaN NaN 0.530238 

    (3)、使用 isin()方法来过滤:

    df2 = df.copy() df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three'] df2 A B C D E 2013-01-01 -1.857957 -0.297110 0.135704 0.199878 one 2013-01-02 0.139027 1.683491 -1.031190 1.447487 one 2013-01-03 -0.596279 -1.211098 1.169525 0.663366 two 2013-01-04 0.367213 -0.020313 2.169802 -1.295228 three 2013-01-05 0.224122 1.003625 -0.488250 -0.594528 four 2013-01-06 0.186073 -0.537019 -0.252442 0.530238 three 
    df2[df2['E'].isin(['two', 'four'])] A B C D E 2013-01-03 -0.596279 -1.211098 1.169525 0.663366 two 2013-01-05 0.224122 1.003625 -0.488250 -0.594528 four 

    5 、设置

    (1)、设置一个新的列:

    s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6)) s1 2013-01-02 1 2013-01-03 2 2013-01-04 3 2013-01-05 4 2013-01-06 5 2013-01-07 6 Freq: D, dtype: int64 
    df['F'] = s1 df A B C D F 2013-01-01 0.000000 0.000000 0.135704 5 NaN 2013-01-02 0.139027 1.683491 -1.031190 5 1 2013-01-03 -0.596279 -1.211098 1.169525 5 2 2013-01-04 0.367213 -0.020313 2.169802 5 3 2013-01-05 0.224122 1.003625 -0.488250 5 4 2013-01-06 0.186073 -0.537019 -0.252442 5 5 

    (2)、设置新值

    df.at[dates[0],'A'] = 0 #通过标签设置新值 df.iat[0,1] = 0 #通过位置设置新值 df.loc[:, 'D'] = np.array([5] * len(df)) #通过一个 numpy 数值设置一组新值 df A B C D F 2013-01-01 0.000000 0.000000 0.135704 5 NaN 2013-01-02 0.139027 1.683491 -1.031190 5 1 2013-01-03 -0.596279 -1.211098 1.169525 5 2 2013-01-04 0.367213 -0.020313 2.169802 5 3 2013-01-05 0.224122 1.003625 -0.488250 5 4 2013-01-06 0.186073 -0.537019 -0.252442 5 5 

    四、缺失值处理

    在 pandas 中,使用 np.nan 来代替缺失值,这些值将默认不会包含在计算中。所处理的数组是:

    df A B C D F 2013-01-01 0.000000 0.000000 0.135704 5 NaN 2013-01-02 0.139027 1.683491 -1.031190 5 1 2013-01-03 -0.596279 -1.211098 1.169525 5 2 2013-01-04 0.367213 -0.020313 2.169802 5 3 2013-01-05 0.224122 1.003625 -0.488250 5 4 2013-01-06 0.186073 -0.537019 -0.252442 5 5 

    1 、 reindex()方法可以对指定轴上的索引进行改变 /增加 /删除操作,这将返回原始数据的一个拷贝:

    df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E']) df1.loc[dates[0]:dates[1], 'E'] = 1 df1 A B C D F E 2013-01-01 0.000000 0.000000 0.135704 5 NaN 1 2013-01-02 0.139027 1.683491 -1.031190 5 1 1 2013-01-03 -0.596279 -1.211098 1.169525 5 2 NaN 2013-01-04 0.367213 -0.020313 2.169802 5 3 NaN 

    2 、去掉包含缺失值的行:

    df1.dropna(how='any') A B C D F E 2013-01-02 0.139027 1.683491 -1.03119 5 1 1 

    3 、对缺失值进行填充:

    df1.fillna(value=5) A B C D F E 2013-01-01 0.000000 0.000000 0.135704 5 5 1 2013-01-02 0.139027 1.683491 -1.031190 5 1 1 2013-01-03 -0.596279 -1.211098 1.169525 5 2 5 2013-01-04 0.367213 -0.020313 2.169802 5 3 5 

    4 、对数据进行布尔填充:

    pd.isnull(df1) A B C D F E 2013-01-01 False False False False True False 2013-01-02 False False False False False False 2013-01-03 False False False False False True 2013-01-04 False False False False False True 

    五、合并

    pandas 提供了大量的方法能够轻松的对 Series 、 DataFrame 和 Panel 对象进行各种符合各种逻辑关系的合并操作。

    1 、 Concat

    df = pd.DataFrame(np.random.randn(10, 4)) df 0 1 2 3 0 0.680581 1.918851 0.521201 -0.389951 1 0.724157 2.282989 0.648427 -0.827308 2 2.437781 0.232518 1.066197 -0.233117 3 0.038747 3.174875 -1.384120 0.322864 4 -0.835962 1.015841 0.042094 -1.903701 5 0.095194 1.926612 0.512825 0.786349 6 -1.098231 -0.669381 -0.623124 -0.411114 7 -1.229527 -0.738026 0.453683 -2.037488 8 -0.499546 -0.816864 -0.395079 -0.320400 9 0.850367 1.047287 -1.205815 -1.287821 
    pieces = [df[:3], df[3:7], df[7:]] # break it into pieces pieces [ 0 1 2 3 0 0.680581 1.918851 0.521201 -0.389951 1 0.724157 2.282989 0.648427 -0.827308 2 2.437781 0.232518 1.066197 -0.233117, 0 1 2 3 3 0.038747 3.174875 -1.384120 0.322864 4 -0.835962 1.015841 0.042094 -1.903701 5 0.095194 1.926612 0.512825 0.786349 6 -1.098231 -0.669381 -0.623124 -0.411114, 0 1 2 3 7 -1.229527 -0.738026 0.453683 -2.037488 8 -0.499546 -0.816864 -0.395079 -0.320400 9 0.850367 1.047287 -1.205815 -1.287821] 

    2 、 Append 将一行连接到一个 DataFrame 上

    df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D']) df A B C D 0 -0.923050 -1.798683 -0.543700 0.983715 1 -0.031082 1.069746 -0.761914 0.142136 2 0.178376 -0.984427 0.270601 0.737754 3 -0.882595 0.057637 -1.027661 -1.829378 4 0.570082 0.210366 0.805305 -1.233238 5 0.442322 0.709155 -0.304849 0.885378 6 -0.218852 0.052263 0.467727 0.832747 7 0.516890 0.005642 -0.990794 -1.624444 
    s = df.iloc[3] df.append(s, ignore_index=True) A B C D 0 -0.923050 -1.798683 -0.543700 0.983715 1 -0.031082 1.069746 -0.761914 0.142136 2 0.178376 -0.984427 0.270601 0.737754 3 -0.882595 0.057637 -1.027661 -1.829378 4 0.570082 0.210366 0.805305 -1.233238 5 0.442322 0.709155 -0.304849 0.885378 6 -0.218852 0.052263 0.467727 0.832747 7 0.516890 0.005642 -0.990794 -1.624444 8 -0.882595 0.057637 -1.027661 -1.829378 

    以上代码不想自己试一试吗?

    镭矿 raquant提供 jupyter 在线练习学习 python 的机会,无需安装 python 即可运行 python 程序。

    3 条回复    2017-03-16 09:57:42 +08:00
    iam36
        1
    iam36  
       2017-03-15 13:53:59 +08:00
    感谢楼主分享,能否对于这些数学分析当中的基础理论进行一个专项讲解,再结合楼上的示例,你出本书我立刻就买:)
    raquant
        2
    raquant  
    OP
       2017-03-15 22:22:38 +08:00
    @iam36 没时间出书呀,哈哈,你的建议挺好,可以考虑,谢谢
    iam36
        3
    iam36  
       2017-03-16 09:57:42 +08:00
    @raquant 写帖子、写博客也都可以的~让我成为你忠实的粉丝吧~
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     3351 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 24ms UTC 00:50 PVG 08:50 LAX 16:50 JFK 19:50
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86