ML#0.2 Getting Started with pandas
[!NOTE] 注意
本章大篇幅引用Pyhon for Data Analysis,Third Edition,Wes McKinney,仅作为学习交流使用。
本章默认进行如下操作:
import numpy as np
import pandas as pd
Getting Started with pandas
pandas Data Structures
Series
The string representation of a Series displayed interactively
shows the index on the left and the values on the right.
1 | In [14]: obj = pd.Series([4, 7, -5, 3]) |
Using NumPy functions or NumPy-like operations,
such as filtering with a Boolean array, scalar
multiplication, or applying math functions
, will preserve the index-value link.
1 | In [24]: obj2[obj2 > 0] |
DataFrame
1 | data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"], |
The del keyword will delete columns like with a dictionary.
1 | In [68]: frame2["eastern"] = frame2["state"] == "Ohio" |
Possible data inputs to the DataFrame constructor
| Type | Notes |
|---|---|
| 2D ndarray | A matrix of data, passing optional row and column labels |
| Dictionary of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length |
| NumPy structured/record array | Treated as the “dictionary of arrays” case |
| Dictionary of Series | Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed |
| Dictionary of dictionaries | Each inner dictionary becomes a column; keys are unioned to form the row index as in the “dictionary of Series” case |
| List of dictionaries or Series | Each item becomes a row in the DataFrame; unions of dictionary keys or Series indexes become the DataFrame’s column labels |
| List of lists or tuples | Treated as the “2D ndarray” case |
| Another DataFrame | The DataFrame’s indexes are used unless different ones are passed |
| NumPy MaskedArray | Like the “2D ndarray” case except masked values are missing in the DataFrame result |
Some Index methods and properties
| Method/Property | Description |
|---|---|
| append() | Concatenate with additional Index objects, producing a new Index |
| difference() | Compute set difference as an Index |
| intersection() | Compute set intersection |
| union() | Compute set union |
| isin() | Compute Boolean array indicating whether each value is contained in the passed collection |
| delete() | Compute new Index with element at Index i deleted |
| drop() | Compute new Index by deleting passed values |
| insert() | Compute new Index by inserting element at Index i |
| is_monotonic | Returns True if each element is greater than or equal to the previous element |
| is_unique | Returns True if the Index has no duplicate values |
| unique() | Compute the array of unique values in the Index |
Essential Functionality
Reindex
Dropping Entries from an Axis
DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
Parameters:
-
labels
single label or list-like
Index labels to drop. -
axis
{0 or ‘index’}
Unused. Parameter needed for compatibility with DataFrame. -
index
single label or list-like
Redundant for application on Series, but ‘index’ can be used instead of ‘labels’. -
columns
single label or list-like
No change is made to the Series; use ‘index’ or ‘labels’ instead. -
level
int or level name, optional
For MultiIndex, level for which the labels will be removed. -
inplace
bool, default False
If True, do operation inplace and return None. -
errors
{‘ignore’, ‘raise’}, default ‘raise’
If ‘ignore’, suppress error and only existing labels are dropped.
Series.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
Parameters:
-
labels
array-like, optional
New labels/index to conform the axis specified by ‘axis’ to. -
index
array-like, optional
New labels for the index. Preferably an Index object to avoid duplicating data. -
columns
array-like, optional
New labels for the columns. Preferably an Index object to avoid duplicating data. -
axis
int or str, optional
Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). -
method
{None, ‘backfill’/‘bfill’, ‘pad’/‘ffill’, ‘nearest’}
Method to use for filling holes in reindexed DataFrame.- None (default): Don’t fill gaps.
- pad / ffill: Propagate last valid observation forward to next valid.
- backfill / bfill: Use next valid observation to fill gap.
- nearest: Use nearest valid observations to fill gap.
-
copy
bool, default True
Return a new object, even if the passed indexes are the same. -
level
int or name
Broadcast across a level, matching Index values on the passed MultiIndex level. -
fill_value
scalar, default np.nan
Value to use for missing values. Defaults to NaN, but can be any compatible value. -
limit
int, default None
Maximum number of consecutive elements to forward or backward fill. -
tolerance
optional
Maximum distance between original and new labels for inexact matches. Can be scalar or list-like, applying variable tolerance per element.
Indexing, Selection, and Filtering
Since loc operator indexes exclusively with labels, there is also an iloc operator that indexes exclusively with integers to work consistently whether or not the index contains integers.
Indexing options with DataFrame
Arithmetic and Data Alignment
1 | In [182]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"]) |
| Method | Description |
|---|---|
add, radd |
Methods for addition (+) |
sub, rsub |
Methods for subtraction (-) |
div, rdiv |
Methods for division (/) |
floordiv, rfloordiv |
Methods for floor division (//) |
mul, rmul |
Methods for multiplication (*) |
pow, rpow |
Methods for exponentiation (**) |
Sorting and Ranking
1 | # To sort lexicographically by row or column label, use the sort_index method, which returns a new, sorted object |
1 | In [251]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4]) |
Table: Tie-breaking methods with rank
| Method | Description |
|---|---|
"average" |
Default: assign the average rank to each entry in the equal group |
"min" |
Use the minimum rank for the whole group |
"max" |
Use the maximum rank for the whole group |
"first" |
Assign ranks in the order the values appear in the data |
"dense" |
Like method=“min”, but ranks always increase by 1 between groups rather than the number of equal elements in a group |
Summarizing and Computing Descriptive Statistics
Table:Descriptive and summary statistics
| Method | Description |
|---|---|
count |
Number of non-NA values |
describe |
Compute set of summary statistics |
min, max |
Compute minimum and maximum values |
argmin, argmax |
Compute index locations (integers) at which minimum or maximum value is obtained; not available on DataFrame objects |
idxmin, idxmax |
Compute index labels at which minimum or maximum value is obtained |
quantile |
Compute sample quantile ranging from 0 to 1 (default: 0.5) |
sum |
Sum of values |
mean |
Mean of values |
median |
Arithmetic median (50% quantile) of values |
mad |
Mean absolute deviation from mean value |
prod |
Product of all values |
var |
Sample variance of values |
std |
Sample standard deviation of values |
skew |
Sample skewness (third moment) of values |
kurt |
Sample kurtosis (fourth moment) of values |
cumsum |
Cumulative sum of values |
cummin, cummax |
Cumulative minimum or maximum of values, respectively |
cumprod |
Cumulative product of values |
diff |
Compute first arithmetic difference (useful for time series) |
pct_change |
Compute percent changes |
Table: Options for reduction methods
| Method | Description |
|---|---|
axis |
Axis to reduce over; “index” for DataFrame’s rows and “columns” for columns |
skipna |
Exclude missing values; True by default |
level |
Reduce grouped by level if the axis is hierarchically indexed (MultiIndex) |
