[!NOTE] 注意
本章大篇幅引用Pyhon for Data Analysis,Third Edition,Wes McKinney,仅作为学习交流使用。

本章默认进行如下操作:

import numpy as np

import pandas as pd

Getting Started with pandas

pandas Data Structures

Series

The string representation of a Series displayed interactively
shows the index on the left and the values on the right.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
In [14]: obj = pd.Series([4, 7, -5, 3])

In [15]: obj
Out[15]:
0 4
1 7
2 -5
3 3
dtype: int64

In [16]: obj.array
Out[16]:
<PandasArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [17]: obj.index
Out[17]: RangeIndex(start=0, stop=4, step=1)

In [18]: obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])

In [19]: obj2
Out[19]:
d 4
b 7
a -5
c 3
dtype: int64

In [20]: obj2.index
Out[20]: Index(['d', 'b', 'a', 'c'], dtype='object')

# 通过字典创建Series
In [30]: sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

In [31]: obj3 = pd.Series(sdata)

In [32]: obj3
Out[32]:
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64

Using NumPy functions or NumPy-like operations,
such as filtering with a Boolean array, scalar
multiplication, or applying math functions
, will preserve the index-value link.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
In [24]: obj2[obj2 > 0]
Out[24]:
d 6
b 7
c 3
dtype: int64

In [25]: obj2 * 2
Out[25]:
d 12
b 14
a -10
c 6
dtype: int64

In [26]: import numpy as np

In [27]: np.exp(obj2)
Out[27]:
d 403.428793
b 1096.633158
a 0.006738
c 20.085537
dtype: float64

DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
"year": [2000, 2001, 2002, 2001, 2002, 2003],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [50]: frame
Out[50]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2

# 返回前五行
In [51]: frame.head()
Out[51]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9

# 返回后五行
In [52]: frame.tail()
Out[52]:
state year pop
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2

The del keyword will delete columns like with a dictionary.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In [68]: frame2["eastern"] = frame2["state"] == "Ohio"

In [69]: frame2
Out[69]:
year state pop debt eastern
0 2000 Ohio 1.5 NaN True
1 2001 Ohio 1.7 NaN True
2 2002 Ohio 3.6 -1.2 True
3 2001 Nevada 2.4 NaN False
4 2002 Nevada 2.9 -1.5 False
5 2003 Nevada 3.2 -1.7 False

In [70]: del frame2["eastern"]

In [71]: frame2.columns
Out[71]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

Possible data inputs to the DataFrame constructor

Type Notes
2D ndarray A matrix of data, passing optional row and column labels
Dictionary of arrays, lists, or tuples Each sequence becomes a column in the DataFrame; all sequences must be the same length
NumPy structured/record array Treated as the “dictionary of arrays” case
Dictionary of Series Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed
Dictionary of dictionaries Each inner dictionary becomes a column; keys are unioned to form the row index as in the “dictionary of Series” case
List of dictionaries or Series Each item becomes a row in the DataFrame; unions of dictionary keys or Series indexes become the DataFrame’s column labels
List of lists or tuples Treated as the “2D ndarray” case
Another DataFrame The DataFrame’s indexes are used unless different ones are passed
NumPy MaskedArray Like the “2D ndarray” case except masked values are missing in the DataFrame result

Some Index methods and properties

Method/Property Description
append() Concatenate with additional Index objects, producing a new Index
difference() Compute set difference as an Index
intersection() Compute set intersection
union() Compute set union
isin() Compute Boolean array indicating whether each value is contained in the passed collection
delete() Compute new Index with element at Index i deleted
drop() Compute new Index by deleting passed values
insert() Compute new Index by inserting element at Index i
is_monotonic Returns True if each element is greater than or equal to the previous element
is_unique Returns True if the Index has no duplicate values
unique() Compute the array of unique values in the Index

Essential Functionality

Reindex

Dropping Entries from an Axis

DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

Parameters:

  • labels
    single label or list-like
    Index labels to drop.

  • axis
    {0 or ‘index’}
    Unused. Parameter needed for compatibility with DataFrame.

  • index
    single label or list-like
    Redundant for application on Series, but ‘index’ can be used instead of ‘labels’.

  • columns
    single label or list-like
    No change is made to the Series; use ‘index’ or ‘labels’ instead.

  • level
    int or level name, optional
    For MultiIndex, level for which the labels will be removed.

  • inplace
    bool, default False
    If True, do operation inplace and return None.

  • errors
    {‘ignore’, ‘raise’}, default ‘raise’
    If ‘ignore’, suppress error and only existing labels are dropped.

Series.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

Parameters:

  • labels
    array-like, optional
    New labels/index to conform the axis specified by ‘axis’ to.

  • index
    array-like, optional
    New labels for the index. Preferably an Index object to avoid duplicating data.

  • columns
    array-like, optional
    New labels for the columns. Preferably an Index object to avoid duplicating data.

  • axis
    int or str, optional
    Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1).

  • method
    {None, ‘backfill’/‘bfill’, ‘pad’/‘ffill’, ‘nearest’}
    Method to use for filling holes in reindexed DataFrame.

    • None (default): Don’t fill gaps.
    • pad / ffill: Propagate last valid observation forward to next valid.
    • backfill / bfill: Use next valid observation to fill gap.
    • nearest: Use nearest valid observations to fill gap.
  • copy
    bool, default True
    Return a new object, even if the passed indexes are the same.

  • level
    int or name
    Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value
    scalar, default np.nan
    Value to use for missing values. Defaults to NaN, but can be any compatible value.

  • limit
    int, default None
    Maximum number of consecutive elements to forward or backward fill.

  • tolerance
    optional
    Maximum distance between original and new labels for inexact matches. Can be scalar or list-like, applying variable tolerance per element.

Indexing, Selection, and Filtering

Since loc operator indexes exclusively with labels, there is also an iloc operator that indexes exclusively with integers to work consistently whether or not the index contains integers.

Indexing options with DataFrame

Arithmetic and Data Alignment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
In [182]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])

In [183]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
.....: index=["a", "c", "e", "f", "g"])

In [184]: s1
Out[184]:
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64

In [185]: s2
Out[185]:
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64

In [186]: s1 + s2
Out[186]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64

In [187]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
.....: index=["Ohio", "Texas", "Colorado"])

In [188]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
.....: index=["Utah", "Ohio", "Texas", "Oregon"])

In [189]: df1
Out[189]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0

In [190]: df2
Out[190]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0

In [191]: df1 + df2
Out[191]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN

In [192]: df1 = pd.DataFrame({"A": [1, 2]})

In [193]: df2 = pd.DataFrame({"B": [3, 4]})

In [194]: df1
Out[194]:
A
0 1
1 2

In [195]: df2
Out[195]:
B
0 3
1 4

In [196]: df1 + df2
Out[196]:
A B
0 NaN NaN
1 NaN NaN

In [197]: df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
.....: columns=list("abcd"))

In [198]: df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
.....: columns=list("abcde"))

# Here is an example where we set a particular value to NA (null) by assigning np.nan to it
In [199]: df2.loc[1, "b"] = np.nan

In [200]: df1
Out[200]:
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0

# Adding these results in missing values in the locations that don’t overlap
In [201]: df2
Out[201]:
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 NaN 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0

In [202]: df1 + df2
Out[202]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 NaN 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
Method Description
add, radd Methods for addition (+)
sub, rsub Methods for subtraction (-)
div, rdiv Methods for division (/)
floordiv, rfloordiv Methods for floor division (//)
mul, rmul Methods for multiplication (*)
pow, rpow Methods for exponentiation (**)

Sorting and Ranking

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# To sort lexicographically by row or column label, use the sort_index method, which returns a new, sorted object
In [234]: obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])

In [235]: obj
Out[235]:
d 0
a 1
b 2
c 3
dtype: int64

In [236]: obj.sort_index()
Out[236]:
a 1
b 2
c 3
d 0
dtype: int64

# With a DataFrame, you can sort by index on either axis
In [237]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
.....: index=["three", "one"],
.....: columns=["d", "a", "b", "c"])

In [238]: frame
Out[238]:
d a b c
three 0 1 2 3
one 4 5 6 7

In [239]: frame.sort_index()
Out[239]:
d a b c
one 4 5 6 7
three 0 1 2 3

In [240]: frame.sort_index(axis="columns")
Out[240]:
a b c d
three 1 2 3 0
one 5 6 7 4

# The data is sorted in ascending order by default but can be sorted in descending order, too.
In [241]: frame.sort_index(axis="columns", ascending=False)
Out[241]:
d c b a
three 0 3 2 1
one 4 7 6 5

In [242]: obj = pd.Series([4, 7, -3, 2])

# To sort a Series by its values, use its sort_values method
In [243]: obj.sort_values()
Out[243]:
2 -3
3 2
0 4
1 7
dtype: int64

In [244]: obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

In [245]: obj.sort_values()
Out[245]:
4 -3.0
5 2.0
0 4.0
2 7.0
1 NaN
3 NaN
dtype: float64

# Missing values can be sorted to the start instead by using the na_position option.
In [246]: obj.sort_values(na_position="first")
Out[246]:
1 NaN
3 NaN
4 -3.0
5 2.0
0 4.0
2 7.0
dtype: float64

In [247]: frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})

In [248]: frame
Out[248]:
b a
0 4 0
1 7 1
2 -3 0
3 2 1

In [249]: frame.sort_values("b")
Out[249]:
b a
2 -3 0
3 2 1
0 4 0
1 7 1

In [250]: frame.sort_values(["a", "b"])
Out[250]:
b a
2 -3 0
0 4 0
3 2 1
1 7 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
In [251]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [252]: obj.rank()
Out[252]:
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64

In [253]: obj.rank(method="first")
Out[253]:
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64

In [254]: obj.rank(ascending=False)
Out[254]:
0 1.5
1 7.0
2 1.5
3 3.5
4 5.0
5 6.0
6 3.5
dtype: float64

In [255]: frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],
.....: "c": [-2, 5, 8, -2.5]})

In [256]: frame
Out[256]:
b a c
0 4.3 0 -2.0
1 7.0 1 5.0
2 -3.0 0 8.0
3 2.0 1 -2.5

In [257]: frame.rank(axis="columns")
Out[257]:
b a c
0 3.0 2.0 1.0
1 3.0 1.0 2.0
2 1.0 2.0 3.0
3 3.0 2.0 1.0

Table: Tie-breaking methods with rank

Method Description
"average" Default: assign the average rank to each entry in the equal group
"min" Use the minimum rank for the whole group
"max" Use the maximum rank for the whole group
"first" Assign ranks in the order the values appear in the data
"dense" Like method=“min”, but ranks always increase by 1 between groups rather than the number of equal elements in a group

Summarizing and Computing Descriptive Statistics

Table:Descriptive and summary statistics

Method Description
count Number of non-NA values
describe Compute set of summary statistics
min, max Compute minimum and maximum values
argmin, argmax Compute index locations (integers) at which minimum or maximum value is obtained; not available on DataFrame objects
idxmin, idxmax Compute index labels at which minimum or maximum value is obtained
quantile Compute sample quantile ranging from 0 to 1 (default: 0.5)
sum Sum of values
mean Mean of values
median Arithmetic median (50% quantile) of values
mad Mean absolute deviation from mean value
prod Product of all values
var Sample variance of values
std Sample standard deviation of values
skew Sample skewness (third moment) of values
kurt Sample kurtosis (fourth moment) of values
cumsum Cumulative sum of values
cummin, cummax Cumulative minimum or maximum of values, respectively
cumprod Cumulative product of values
diff Compute first arithmetic difference (useful for time series)
pct_change Compute percent changes

Table: Options for reduction methods

Method Description
axis Axis to reduce over; “index” for DataFrame’s rows and “columns” for columns
skipna Exclude missing values; True by default
level Reduce grouped by level if the axis is hierarchically indexed (MultiIndex)