DataFrame

You can think of DataFrame like a spreadsheet or SQL table. It is the most commonly used pandas object. Rows are referred to as axis 0 and columns as axis 1 and each axis can have multiple levels (which is not common).

Creating a DataFrame

Below example is to create sample DataFrame for examples on this page

import pandas as pd
df=pd.DataFrame(data=[['apple','fruit',85,3], # Natural line breaking allowed for code within parentheses()
                      ['orange','fruit',68,1],
                      ['banana','fruit',69,2],
                      ['beetroot','vegetable',68,1],
                      ['carrot','vegetable',45,2]],
                columns=['name','type','health_index','cost_index'])
df # prints the dataframe

Result: {F34356671}

Datatypes & conversions

Available datatypes

object (string, lists, dict, tuple and other complex objects), int64, float64, datetime64

#string
s='hello world!' #s[0:2] returns 'he', s[:-1] returns 'hello world'

#list
a=[1,4,5,1,3,6] # a[0] gets first element; a[-1] gets last element. Lists elements can be added or removed with a.append(elem) or a.pop(elem)

#dict
d={'movie_name':'star wars', 'rating':5} # d['movie_name'] gets 'star wars'. Dict elements can be added or removed using d['release_year']=1977 or del d['rating']

#tuple
t=('apple','iphone','7 plus') # similar to list, t[0] gets first element. Tuples are immutable. cannot add/change values

#complex objects (combination of two or more of above)
c={'product':('apple','iphone','7 plus'),'features':['bluetooth','wifi']}

Get data types for each column in a data frame

df.dtypes # gives the number of columns by various data types in the dataframe df

Convert int column to float

You can convert each series independently using series.astype function

df['mycol'] = df.mycol.astype(float)

Convert string/object to numeric

Pandas auto-detects data type. Usually numeric column with nulls or few string values are treated as object. This piece of code will replace them to 0 and convert to int

df.ID = pd.to_numeric(df.ID, errors='coerce')\ # coerce converts non numeric value columns to NaN
            .fillna(0)\ # replaces NaN to 0
            .astype(np.int64)  # type conversion

If there are no issues in your column, you can just run pd.to_numeric like shown below. It will convert to float if you have decimal values or int otherwise.

df['mycol'] = pd.to_numeric(df.mycol)

Other conversions

dict(obj) # converts object to dict
list(tup) # converts tuple to list
tuple(list) # converts list to tuple

Select columns

Simple select

df[col1] # Returns series

df[[col1]] # Returns  dataframe with 1 col
df.loc[:,[col1]] #Same as above, : refers to all rows
df.iloc[:,[0]] # Returns dataframe with 1st col

df[[col1,col2]] # Returns dataframe with 2 cols
df.loc[:,[col1,col2]] # Same as above
df.iloc[:,[0:3]] #Retuns dataframe with first 3 columns (0,1,2)

more creative selects from a list _`here <https://teamtreehouse.com/community/how-do-i-get-the-first-4-and-last-4-items-of-a-list>`__

Select from multi-level index using tuples

The tuples () can have 2 or more levels .. code-block:: python

df[(‘label’,’count’),(‘label’,’mean’)] # selects 2 columns (first under tree label->count and second label->mean)

Creating multi-level index using tuples

The tuples () can have 2 or more levels

pd.MultiIndex.from_tuples([('Parent1', 'Child1'), ('Parent1', 'Child2')], names=['Parent', 'Child'])

Alternatively, you can construct a list and convert to tuple using zip:

arrays = [['Parent1','Parent1'],
                      ['Child1', 'Child2']]
tuples = list(zip(*arrays))
pd.MultiIndex.from_tuples(tuples, names=['Parent', 'Child'])

Programmatically select columns

df[[col for col in df if 'prob' in col]] # selects all columns containing 'prob' in column name

Get distribution of values in a column

df.health_index.value_counts() # record count for each val
df.health_index.value_counts(normalize=True) # % distribution

Filter rows

Filters internally use masking. You can just run the filter inside to see the mask data frame.

df.loc[df.col1.isin(['apple','orange']),:] # Returns all records where col1 is 'apple' or 'orange'
df.loc[~df.col1.isin(['apple','orange']),:]  # Returns all records where col1 is not 'apple' or 'orange'

Sort

Sort rows

df.sort_values('trips_per_user') #Sorts rows based on values in trips_per_user

Sort columns

df.sort_index(axis=1,level=0)  # Sorts columns by their names. axis 1 refers to columns. level 0 is the highest/default level

Group by

Group by example

groupby returns a dict object with group name and group data frame. This can be iterated thru manually or summarized using the functions below. NULL value (NA, NaT) will be ignored and will not form a group.

df.groupby('email_domain')
df.groupby(column_list)

Tip: You could also group by on a field which is not in the selector (even from another dataframe as long as indexes match up)

df['cost_index'].groupby(df.type).sum()

Tip: Group by fields are added to index in the resulting dataframe. Use reset_index() to move the index to columns

df.groupby(column_list).sum().reset_index()

Aggregate functions

apply a common operation on all non grouped columns If an aggregate function cannot be applied on a column eg.string, these nuisance columns will be silently ignored

.size() # count
.nunique() # distinct count
.sum() # supports prod, min, max, mean, median, first, last, std, var
.head(5) # returns first 5 rows in each group by order as in dataframe. also supports tail(n), nlargest(n), nsmallest(n)
.nth([4,-1]) # returns 4th element from the top and 1st element from the bottom

apply various operations on each of the columns

.agg({'col1':['mean','count'],'col2':['var']})

% of total within groups

df['health_index_%oftotal']=df.health_index/df.health_index.sum()

Transform functions

Transform function is similar to window aggregate functions on the entire window. It aggregates the data and assigns back to dataset similar to original and preserves index

df.groupby('type')['health_index'].transform('sum') # supports all aggregate functions like min, max, mean, std, var

Group Filters (SQL:having)

dff.groupby('B').filter(lambda x: len(x['C']) > 2) #Refer Lambda functions

Window aggregate functions

Overall

Applying math function on a series aggregates to the entire group. When applied on group by, returns aggregate value for each group

df.health_index.sum()

We could use this to divide on individual values to create % of total columns

df['health_index_%oftotal']=df.health_index/df.health_index.sum()

Simple Cumulative

df.cost_index.cumsum(asending=False) # similar functions: cummin, cummax, cumprod

Expanding Window

Another way of achieving cumulative sum:

df.cost_index.expanding().sum()

Rolling Window

Similar to expanding, rolling window is also supported. Here is how to do a rolling sum of current record and previous (n-1) records

df.cost_index.rolling(2).sum()

Lead/Lag

df.cost_index.shift(-2) # Get value from 2 records before
df.cost_index.shift(1) # Get value from next record

Resample

Used for time series grouping to create bins of equi-split time intervals

# Get another example!
df_re.groupby('group').resample('1D').sum()

To check if window aggregate functions are working fine, validate alongside your original dataset like below:

pd.concat([df,df.cost_index.rolling(2,min_periods=1).mean()], axis=1)

Note: No exact equivalent of rows between n preceding and n following found in pandas. Please update when you find it

Joins

Simple join

train_X.join(train[y_col]) # joins on index. Returns dataframe with all columns from train_X & y_col from train

Concat columns

concat on axis 1 is like join based on index

pd.concat([df1, df4], axis=1)

SQL like join

SQL like join can be achieved with merge:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)

Aggregate and join back to original dataset

df.join(df.groupby('month')['A'].sum(), on='month', rsuffix='_r') # aggregates column A to month level and adds that to each of the records in df and is returned

Formatting

Numeric formating

df.style.format({'B': '{:0<4.0f}', 'D': '{:+.2%}'},'C':{:.2f}')
# {:+.2%} is for % with 2 decimals
# ±{:.2f} is for 2 decimals with a +- sign in the front,
# {:0<4.0f} 0 in front fills zero, < for is left aligned, 4 digits (including - sign) before decimal and no value after decimal

Numeric format guides: here

Gradient Background

cm = sns.light_palette('green', as_cmap=True)
df.style.background_gradient(cmap=cm, subset=['B', 'C', 'D’]) # applies heatmap for columns B, C, D

Bars

df.style.bar(subset=['B']) # applies bars to column B

Nesting multiple styles

df.style.format({'B': '{:0<4.0f}', 'D': '{:+.2f}'}).bar(subset=['B'])

More styles: here

Transform

Swap level & Axis

df.swaplevel(i=0,j=1,axis=1) # swaps first 2 levels (0 & 1) on columns (axis 1)
df.swapindex(0,1) # swaps first 2 index columns

Map

df['housing'].map({'yes': 1, 'no': 0}) # changes values 'yes' to 1 and 'no' to 0

Apply

using lambda functions for column transformations (entire column is passed as series to lambda function)

df['delayed'] = df['arr_delay'].apply(lambda col: col > 0)

using lambda function across columns using axis=1

df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)

Flatten multi-level column index

df.columns=df.columns.get_level_values(0)

Pivot index levels to column

This is like a pivot table operation. Creates N columns based on N unique values in the row index. If you need to pivot on a non-index column, first add that to index using pd.set_index(col, append=True)

df=df.unstack() # This unstacks the right most index. To be more specific use order or name like df.unstack(0) or df.unstack('Year')

# Vice versa: to pivot a column level to row index
df=df.stack()

Binning based on value (cut)

cut splits a continuous data into equi-distant bins. Bins are skewed when outliers in the data

pd.cut([1,2,3,4,5,6,7,8], 4, labels=False)
# Output: array([0, 0, 1, 1, 2, 2, 3, 3])

pd.cut([1,2,3,4,5,6,7,20], 4, labels=False)
# Output: array([0, 0, 0, 0, 0, 1, 1, 3])

Binning based on records (qcut)

qcut splits any data into equi-sized bins

pd.qcut([1,2,3,4,5,6,7,20], 4, labels=False)
# Output: array([0, 0, 1, 1, 2, 2, 3, 3])

this can be used in a group by to get decile level metrics

df.groupby(pd.qcut(df.col1,10, labels=False)).mean()

Add/Remove rows & columns

Add Rows

Adding 1 row manually

df.append({'name':'banana', 'type':'fruit', 'health_index':88, 'cost_index':98},ignore_index=True) # New index sequence will be created when ignore_index is True. Else will use original index and could result in duplicate indexes

Adding rows from another dataframe with matching columns (new columns will be created for non-matching columns. Missing columns will be NULLs

df.append(new_df)

Add columns

to add a column manually

df.assign('new_col',[1,2,3,4,5])

# Alternate approach
df['new_col']=[1,2,3,4,5]

# Add column dynamically creating a column based on a logic on existing column
df['new_col']=[1 if x>0 else 0 for x in df.col]

to create a new data frame with additional column

pd.concat([df,df.cost_index/2], axis=1) # this can also be used to create a dataframe from 2 or more series

Remove rows

Drop using index

df.drop([0,1]) # removes rows with index 0 & 1

Remove columns

drop using column names

df.drop('cost_index', axis=1)

Remove a level from multi-level index

drop the highest level from multi-level column index

df.columns=df.columns.droplevel()

drop a level from multi-level row index

pd.MultiIndex.droplevel(df,level=0)

Visualizations

Simple charts

df.cost_index.plot(kind='bar') # kinds: barh, hist, box, kde, density, area, scatter, hexbin, pie
df.plot(subplots=True, figsize=(6, 6), layout=(3, -1)) # Layout -1 is to infer the number of columns based on data

Histogram

%matplotlib inline
df.hist(bins=100, figsize=(40,30))

Boxplot for each group

%matplotlib inline
df.groupby('type').boxplot()

Scatter Matrix

scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='kde')

Bootstrap Plot

bootstrap_plot(data, size=50, samples=500, color='grey')

More vizualizations

_here

Nesting them all Use  for continuity on next line

data[['email_domain',y_col]].\
       .groupby('email_domain')\
       .agg({'label':['mean','count']})\
       .sort_values([('label','count'),('label','mean')], ascending=False)\
        .style\
            .format({'B': '{:0<4.0f}', 'D': '{:+.2f}'})\
            .bar(subset=['B'])