python数据分析实战之泰坦尼克号统计-阿里云开发者社区

python数据分析实战之泰坦尼克号统计

2017-11-12 2957

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

源数据文件下载地址：https://www.kaggle.com/c/titanic/data

注意下载的时候可能需要创建用户什么的或者直接使用Google账号

也可以在这篇文章的附件中下载

源文章参考：

http://nbviewer.ipython.org/github/jmportilla/Udemy-notes/blob/master/Intro%20to%20Data%20Projects%20-%20Titanic.ipynb

首先当然是各种库导入了；

笔者用的是windows的Anaconda，所以pandas，scipy，numpy，matplotlib都是直接封装好的，只需要装seaborn了，注意seaborn不支持python2.6.

安装seaborn通过以下命令

 
        C:\Users\Ye>conda 
        install 
        seaborn

或者在下面的路径执行下面的命令

 
        C:\Anaconda\Scripts>pip instakk seaborn

启动ipython notebook

Anaconda下载地址：https://www.continuum.io/downloads

在linux上也可以装Anaconda，或者直接以及pip安装，pip安装可以参考我的http://youerning.blog.51cto.com/10513771/1711008

下面是笔者所用的各种库以及其版本了

 
        import 
        matplotlib.pyplot as plt 
       
        import 
        pandas as pd  
       
        import 
        sys  
       
        import 
        seaborn as sns 
       
        import 
        matplotlib  
       
        print 
        'Python version ' 
        + 
        sys.version 
       
        print 
        'Pandas version ' 
        + 
        pd.__version__ 
       
        print 
        'Seaborn version' 
        + 
        sns.__version__ 
       
        print 
        'Matplotlib version' 
        + 
        matplotlib.__version__

Python version 2.7.9 |Anaconda 2.2.0 (64-bit)| (default, Dec 18 2014, 16:57:52) [MSC v.1500 64 bit (AMD64)]

Pandas version 0.15.2

Seaborn version0.6.0

Matplotlib version1.4.3

好吧show Time

 
        ###首先导入各种模块
       
        import 
        pandas as pd 
       
        from 
        pandas 
        import 
        Series,DataFrame 
       
        import 
        numpy as  np 
       
        import 
        matplotlib.pyplot as plt 
       
        import 
        seaborn as sns 
       
        ###让图片在ipython notebook上直接显示
       
        %
        matplotlib inline

读入准备好的数据文件：

 
        titanic_df 
        = 
        pd.read_csv(
        "C:\\train.csv"
        )

简单的预览一下数据结构及信息，head默认查看前5条，如果需要更多可以在括号里填入相应的数字：

 
        titanic_df.head()

也可以通过info查看每个字段的一些统计信息

 
        titanic_df.info()

Int64Index: 891 entries, 0 to 890

Data columns (total 12 columns):

PassengerId 891 non-null int64

Survived 891 non-null int64

Pclass 891 non-null int64

Name 891 non-null object

Sex 891 non-null object

Age 714 non-null float64

SibSp 891 non-null int64

Parch 891 non-null int64

Ticket 891 non-null object

Fare 891 non-null float64

Cabin 204 non-null object

Embarked 889 non-null object

dtypes: float64(2), int64(5), object(5)

memory usage: 90.5+ KB

#简单统计男女比例，我们data数据选择titanic_df,然后选择其中的Sex字段作为X轴，其中kind : {point, bar, count, box, violin, strip}一共六种方式，我们选count，有的版本似乎不需要选择kind=count

 
        sns.factorplot(
        'Sex'
        ,data
        =
        titanic_df,kind
        =
        "count"
        )

#为了更细化，我们显示以Pclass作为X轴，统计每个等级中的男女比例：

 
        sns.factorplot(
        'Pclass'
        ,data
        =
        titanic_df,kind
        =
        "count"
        ,hue
        =
        "Sex"
        )

我们也可以将男女分为男，女，小孩，为原有数据库新增一个字段

定义一个函数，判断男，女，小孩

 
        def 
        male_famle_child(passenger): 
       
        age,sex 
        = 
        passenger 
       
        if 
        age < 
        16
        : 
       
        return 
        "Child" 
       
        else
        : 
       
        return 
        sex

###新增一字段“Person”

 
        titanic_df[
        "Person"
        ] 
        = 
        titanic_df[[
        "Age"
        ,
        "Sex"
        ]].
        apply
        (male_famle_child,axis
        =
        1
        )

再次在Pclass分类中体现男女小孩的比例

 
        sns.factorplot(
        "Pclass"
        ,data
        =
        titanic_df,hue
        =
        "Person"
        ,kind
        =
        "count"
        )

简要查看各年龄段的发布，将年龄段的间距分为70段，默认10段，你当然可以分得更细或者更系数

 
        titanic_df[
        'Age'
        ].hist(bins
        =
        70
        )

查看平均年龄：

 
        titanic_df[
        "Age"
        ].mean()

29.69911764705882

查看“Person”字段的数量统计

 
        titanic_df[
        "Person"
        ].value_counts()

male 537

female 271

Child 83

dtype: int64

统计不同年龄段，个类别的分布趋势，核密度统计方式

注：核密度估计，参考：http://www.lifelaf.com/blog/?p=723

注：hue代表除row，col之外的第三维度，等级，不同的类型不同的颜色

Palette代表调色板

###使用Facet函数创建plot，以“Sex”字段区分等级，aspect=4代表宽度为之前的4倍

 
        fig 
        = 
        sns.FacetGrid(titanic_df,hue
        =
        "Sex"
        ,aspect
        =
        4
        )    
       
        ###使用map函数映射kde，以Age作为X轴
       
        fig.
        map
        (sns.kdeplot,
        "Age"
        ,shade
        =
        True
        ) 
       
        ###取最大年龄
       
        oldest 
        = 
        titanic_df[
        "Age"
        ].
        max
        () 
       
        ###设置x轴的取值范围为0到oldest
       
        fig.
        set
        (xlim
        =
        (
        0
        ,oldest)) 
       
        ###添加图标，印记
       
        fig.add_legend()

 
        fig 
        = 
        sns.FacetGrid(titanic_df,hue
        =
        "Person"
        ,aspect
        =
        4
        ) 
       
        fig.
        map
        (sns.kdeplot,
        "Age"
        ,shade
        =
        True
        ) 
       
        oldest 
        = 
        titanic_df[
        "Age"
        ].
        max
        () 
       
        fig.
        set
        (xlim
        =
        (
        0
        ,oldest)) 
       
        fig.add_legend()

 
        fig 
        = 
        sns.FacetGrid(titanic_df,hue
        =
        "Pclass"
        ,aspect
        =
        4
        ) 
       
        fig.
        map
        (sns.kdeplot,
        "Age"
        ,shade
        =
        True
        ) 
       
        oldest 
        = 
        titanic_df[
        "Age"
        ].
        max
        () 
       
        fig.
        set
        (xlim
        =
        (
        0
        ,oldest)) 
       
        fig.add_legend()

上面画出的图片很美腻有木有！！！

统计不同船舱的人数分布

 
        首先取得不同船舱的等级
       
        deck 
        = 
        titanic_df[
        "Cabin"
        ].dropna()   
        ##去掉NaN的值 
       
        deck.head()

1 C85

3 C123

6 E46

10 G6

11 C103

Name: Cabin, dtype: object

由上可发现船舱的类别由第一个字符可以加以区分可以得到各船舱人数的数量

 
        levels 
        = 
        [] 
       
        for 
        level 
        in 
        deck: 
       
        levels.append(level[
        0
        ]) 
       
        cabin_df 
        = 
        DataFrame(levels) 
       
        cabin_df.columns 
        = 
        [
        "Cabin"
        ]    
        ###为序列加上字段名

##去cabin_df数据集的Cabin字段，颜色用winter_d，方法调用count

palette的颜色有很多种，选择可以参考matplotlib 官方网站：http://matplotlib.org/users/colormaps.html

 
        sns.factorplot(
        "Cabin"
        ,data
        =
        cabin_df,palette
        =
        "winter_d"
        ,kind
        =
        "count"
        )

因为上面T船舱的数量实在太小，酌情删除

 
        cabin_df 
        = 
        cabin_df[cabin_df.Cabin !
        = 
        "T"
        ]

然后生成图片

 
        sns.factorplot(
        "Cabin"
        ,data
        =
        cabin_df,palette
        =
        "summer"
        ,kind
        =
        "count"
        )

统计进站港口的数量分布

 
        sns.factorplot(
        "Embarked"
        ,data
        =
        titanic_df,hue
        =
        "Pclass"
        , 
       
        x_order
        =
        [
        "C"
        ,
        "Q"
        ,
        "S"
        ] 
       
        ,kind
        =
        "count"
        )

统计单身及有家庭的人数分布

 
        ###创建Alone字段
       
        titanic_df[
        "Alone"
        ] 
        = 
        titanic_df.SibSp 
        + 
        titanic_df.Parch 
       
        titanic_df[
        "Alone"
        ]

0 1

1 1

...

876 0

877 0

###由上可知，大于1的都是有兄弟姐妹或者父母孩子的

 
        ###所以修改Alone字段的数字为Alone或者with family
       
        titanic_df[
        "Alone"
        ].loc[titanic_df[
        "Alone"
        ] > 
        0
        ] 
        =
        "With Family" 
       
        titanic_df[
        "Alone"
        ].loc[titanic_df[
        "Alone"
        ] 
        =
        = 
        0
        ] 
        = 
        "Alone" 
       
        titanic_df.head()

统计Alone的发布人数

 
        sns.factorplot(
        "Alone"
        ,data
        =
        titanic_df,hue
        =
        "Pclass"
        ,palette
        =
        "Blues"
        ,kind
        =
        "count"
        )

统计存活的以及没存活的分布

 
  
    
      
     
        ##简单将Survivor字段的0,1映射成no与yes，及没有存活及存活
       
 
        titanic_df[
        "Survivor"
        ] 
        = 
        titanic_df.Survived.
        map
        ({
        0
        :
        "no"
        ,
        1
        :
        "yes"
        }) 
       
 
          
       
 
        sns.factorplot(
        "Survivor"
        ,data
        =
        titanic_df,palette
        =
        "Set1"
        ,kind
        =
        "count"
        ) 
       
 
    

   
 

下面的没太看懂，所以不深入了

 
        注意：factoryplot函数第一个值取X轴，第二个值为Y轴
       
        sns.factorplot(
        "Pclass"
        ,
        "Survived"
        ,data
        =
        titanic_df,x_order
        =
        [
        1
        ,
        2
        ,
        3
        ])

后记：这篇文章主要是摘自Python for data Analysis的视频内容翻译过来的，也填了一些坑，统计了一些现存数据的统计结果，统计什么倒不是很重要，主要是这么统计，怎么画图美腻的统计图~~~我也是一名菜鸟，大家共勉，希望有个菜鸟之数据分析进阶的系列跟大家一起分享，后面应该还会有一篇关于股票的，敬请期待^_^

本文转自 youerning 51CTO博客，原文链接:http://blog.51cto.com/youerning/1711371

python数据分析实战之泰坦尼克号统计

热门文章

最新文章

相关课程

相关电子书

相关实验场景