[!NOTE] 注意
本章大篇幅引用Pyhon for Data Analysis,Third Edition,Wes McKinney，仅作为学习交流使用。

本章默认进行如下操作：

import numpy as np

import pandas as pd

Getting Started with pandas

pandas Data Structures

Series

The string representation of a Series displayed interactively
shows the index on the left and the values on the right.

In [14]: obj = pd.Series([4, 7, -5, 3])

In [15]: obj
Out[15]: 
0    4
1    7
2   -5
3    3
dtype: int64

In [16]: obj.array
Out[16]: 
<PandasArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [17]: obj.index
Out[17]: RangeIndex(start=0, stop=4, step=1)

In [18]: obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])

In [19]: obj2
Out[19]: 
d    4
b    7
a   -5
c    3
dtype: int64

In [20]: obj2.index
Out[20]: Index(['d', 'b', 'a', 'c'], dtype='object')

# 通过字典创建Series
In [30]: sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

In [31]: obj3 = pd.Series(sdata)

In [32]: obj3
Out[32]: 
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

Using NumPy functions or NumPy-like operations,
such as filtering with a Boolean array, scalar
multiplication, or applying math functions
, will preserve the index-value link.

In [24]: obj2[obj2 > 0]
Out[24]: 
d    6
b    7
c    3
dtype: int64

In [25]: obj2 * 2
Out[25]: 
d    12
b    14
a   -10
c     6
dtype: int64

In [26]: import numpy as np

In [27]: np.exp(obj2)
Out[27]: 
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

DataFrame

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [50]: frame
Out[50]: 
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2

# 返回前五行
In [51]: frame.head()
Out[51]: 
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9

# 返回后五行
In [52]: frame.tail()
Out[52]: 
    state  year  pop
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2

The del keyword will delete columns like with a dictionary.

In [68]: frame2["eastern"] = frame2["state"] == "Ohio"

In [69]: frame2
Out[69]: 
   year   state  pop  debt  eastern
0  2000    Ohio  1.5   NaN     True
1  2001    Ohio  1.7   NaN     True
2  2002    Ohio  3.6  -1.2     True
3  2001  Nevada  2.4   NaN    False
4  2002  Nevada  2.9  -1.5    False
5  2003  Nevada  3.2  -1.7    False

In [70]: del frame2["eastern"]

In [71]: frame2.columns
Out[71]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

Possible data inputs to the DataFrame constructor

Type	Notes
2D ndarray	A matrix of data, passing optional row and column labels
Dictionary of arrays, lists, or tuples	Each sequence becomes a column in the DataFrame; all sequences must be the same length
NumPy structured/record array	Treated as the “dictionary of arrays” case
Dictionary of Series	Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed
Dictionary of dictionaries	Each inner dictionary becomes a column; keys are unioned to form the row index as in the “dictionary of Series” case
List of dictionaries or Series	Each item becomes a row in the DataFrame; unions of dictionary keys or Series indexes become the DataFrame’s column labels
List of lists or tuples	Treated as the “2D ndarray” case
Another DataFrame	The DataFrame’s indexes are used unless different ones are passed
NumPy MaskedArray	Like the “2D ndarray” case except masked values are missing in the DataFrame result

Some Index methods and properties

Method/Property	Description
append()	Concatenate with additional Index objects, producing a new Index
difference()	Compute set difference as an Index
intersection()	Compute set intersection
union()	Compute set union
isin()	Compute Boolean array indicating whether each value is contained in the passed collection
delete()	Compute new Index with element at Index i deleted
drop()	Compute new Index by deleting passed values
insert()	Compute new Index by inserting element at Index i
is_monotonic	Returns True if each element is greater than or equal to the previous element
is_unique	Returns True if the Index has no duplicate values
unique()	Compute the array of unique values in the Index

Essential Functionality

Reindex

Dropping Entries from an Axis

DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

Parameters:

labels
single label or list-like
Index labels to drop.
axis
{0 or ‘index’}
Unused. Parameter needed for compatibility with DataFrame.
index
single label or list-like
Redundant for application on Series, but ‘index’ can be used instead of ‘labels’.
columns
single label or list-like
No change is made to the Series; use ‘index’ or ‘labels’ instead.
level
int or level name, optional
For MultiIndex, level for which the labels will be removed.
inplace
bool, default False
If True, do operation inplace and return None.
errors
{‘ignore’, ‘raise’}, default ‘raise’
If ‘ignore’, suppress error and only existing labels are dropped.

Series.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

Parameters:

labels
array-like, optional
New labels/index to conform the axis specified by ‘axis’ to.
index
array-like, optional
New labels for the index. Preferably an Index object to avoid duplicating data.
columns
array-like, optional
New labels for the columns. Preferably an Index object to avoid duplicating data.
axis
int or str, optional
Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1).
method
{None, ‘backfill’/‘bfill’, ‘pad’/‘ffill’, ‘nearest’}
Method to use for filling holes in reindexed DataFrame.
- None (default): Don’t fill gaps.
- pad / ffill: Propagate last valid observation forward to next valid.
- backfill / bfill: Use next valid observation to fill gap.
- nearest: Use nearest valid observations to fill gap.
copy
bool, default True
Return a new object, even if the passed indexes are the same.
level
int or name
Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value
scalar, default np.nan
Value to use for missing values. Defaults to NaN, but can be any compatible value.
limit
int, default None
Maximum number of consecutive elements to forward or backward fill.
tolerance
optional
Maximum distance between original and new labels for inexact matches. Can be scalar or list-like, applying variable tolerance per element.

Indexing, Selection, and Filtering

Since loc operator indexes exclusively with labels, there is also an iloc operator that indexes exclusively with integers to work consistently whether or not the index contains integers.

Indexing options with DataFrame

Arithmetic and Data Alignment

In [182]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])

In [183]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
   .....:                index=["a", "c", "e", "f", "g"])

In [184]: s1
Out[184]: 
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [185]: s2
Out[185]: 
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [186]: s1 + s2
Out[186]: 
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [187]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
   .....:                    index=["Ohio", "Texas", "Colorado"])

In [188]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
   .....:                    index=["Utah", "Ohio", "Texas", "Oregon"])

In [189]: df1
Out[189]: 
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

In [190]: df2
Out[190]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [191]: df1 + df2
Out[191]: 
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

In [192]: df1 = pd.DataFrame({"A": [1, 2]})

In [193]: df2 = pd.DataFrame({"B": [3, 4]})

In [194]: df1
Out[194]: 
   A
0  1
1  2

In [195]: df2
Out[195]: 
   B
0  3
1  4

In [196]: df1 + df2
Out[196]: 
    A   B
0 NaN NaN
1 NaN NaN

In [197]: df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
   .....:                    columns=list("abcd"))

In [198]: df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
   .....:                    columns=list("abcde"))

# Here is an example where we set a particular value to NA (null) by assigning np.nan to it
In [199]: df2.loc[1, "b"] = np.nan

In [200]: df1
Out[200]: 
     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0

# Adding these results in missing values in the locations that don’t overlap
In [201]: df2
Out[201]: 
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   NaN   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0

In [202]: df1 + df2
Out[202]: 
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0   NaN  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

Method	Description
`add`, `radd`	Methods for addition (+)
`sub`, `rsub`	Methods for subtraction (-)
`div`, `rdiv`	Methods for division (/)
`floordiv`, `rfloordiv`	Methods for floor division (//)
`mul`, `rmul`	Methods for multiplication (*)
`pow`, `rpow`	Methods for exponentiation (**)

Sorting and Ranking

# To sort lexicographically by row or column label, use the sort_index method, which returns a new, sorted object
In [234]: obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])

In [235]: obj
Out[235]: 
d    0
a    1
b    2
c    3
dtype: int64

In [236]: obj.sort_index()
Out[236]: 
a    1
b    2
c    3
d    0
dtype: int64

# With a DataFrame, you can sort by index on either axis
In [237]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
   .....:                      index=["three", "one"],
   .....:                      columns=["d", "a", "b", "c"])

In [238]: frame
Out[238]: 
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

In [239]: frame.sort_index()
Out[239]: 
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

In [240]: frame.sort_index(axis="columns")
Out[240]: 
       a  b  c  d
three  1  2  3  0
one    5  6  7  4

# The data is sorted in ascending order by default but can be sorted in descending order, too.
In [241]: frame.sort_index(axis="columns", ascending=False)
Out[241]: 
       d  c  b  a
three  0  3  2  1
one    4  7  6  5

In [242]: obj = pd.Series([4, 7, -3, 2])

# To sort a Series by its values, use its sort_values method
In [243]: obj.sort_values()
Out[243]: 
2   -3
3    2
0    4
1    7
dtype: int64

In [244]: obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

In [245]: obj.sort_values()
Out[245]: 
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

# Missing values can be sorted to the start instead by using the na_position option.
In [246]: obj.sort_values(na_position="first")
Out[246]: 
1    NaN
3    NaN
4   -3.0
5    2.0
0    4.0
2    7.0
dtype: float64

In [247]: frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})

In [248]: frame
Out[248]: 
   b  a
0  4  0
1  7  1
2 -3  0
3  2  1

In [249]: frame.sort_values("b")
Out[249]: 
   b  a
2 -3  0
3  2  1
0  4  0
1  7  1

In [250]: frame.sort_values(["a", "b"])
Out[250]: 
   b  a
2 -3  0
0  4  0
3  2  1
1  7  1

In [251]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [252]: obj.rank()
Out[252]: 
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [253]: obj.rank(method="first")
Out[253]: 
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [254]: obj.rank(ascending=False)
Out[254]: 
0    1.5
1    7.0
2    1.5
3    3.5
4    5.0
5    6.0
6    3.5
dtype: float64

In [255]: frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],
   .....:                       "c": [-2, 5, 8, -2.5]})

In [256]: frame
Out[256]: 
     b  a    c
0  4.3  0 -2.0
1  7.0  1  5.0
2 -3.0  0  8.0
3  2.0  1 -2.5

In [257]: frame.rank(axis="columns")
Out[257]: 
     b    a    c
0  3.0  2.0  1.0
1  3.0  1.0  2.0
2  1.0  2.0  3.0
3  3.0  2.0  1.0

Table: Tie-breaking methods with rank

Method	Description
`"average"`	Default: assign the average rank to each entry in the equal group
`"min"`	Use the minimum rank for the whole group
`"max"`	Use the maximum rank for the whole group
`"first"`	Assign ranks in the order the values appear in the data
`"dense"`	Like method=“min”, but ranks always increase by 1 between groups rather than the number of equal elements in a group

Summarizing and Computing Descriptive Statistics

Table:Descriptive and summary statistics

Method	Description
`count`	Number of non-NA values
`describe`	Compute set of summary statistics
`min`, `max`	Compute minimum and maximum values
`argmin`, `argmax`	Compute index locations (integers) at which minimum or maximum value is obtained; not available on DataFrame objects
`idxmin`, `idxmax`	Compute index labels at which minimum or maximum value is obtained
`quantile`	Compute sample quantile ranging from 0 to 1 (default: 0.5)
`sum`	Sum of values
`mean`	Mean of values
`median`	Arithmetic median (50% quantile) of values
`mad`	Mean absolute deviation from mean value
`prod`	Product of all values
`var`	Sample variance of values
`std`	Sample standard deviation of values
`skew`	Sample skewness (third moment) of values
`kurt`	Sample kurtosis (fourth moment) of values
`cumsum`	Cumulative sum of values
`cummin`, `cummax`	Cumulative minimum or maximum of values, respectively
`cumprod`	Cumulative product of values
`diff`	Compute first arithmetic difference (useful for time series)
`pct_change`	Compute percent changes

Table: Options for reduction methods

Method	Description
`axis`	Axis to reduce over; “index” for DataFrame’s rows and “columns” for columns
`skipna`	Exclude missing values; True by default
`level`	Reduce grouped by level if the axis is hierarchically indexed (MultiIndex)