https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.


Pandas I

Description: This notebook describes how to:

  • Create a Pandas Series or DataFrame

  • Accessing data rows, columns, elements using .loc and .iloc

  • Creating filters using boolean operators

  • Changing data in rows, columns, and elements

This is the first notebook in a series on learning to use Pandas.

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Knowledge Required:

Knowledge Recommended:

Completion Time: 75 minutes

Data Format: CSV (.csv)

Libraries Used: Pandas

Research Pipeline: None


When to use Pandas

Pandas is a Python data analysis and manipulation library. When it comes to viewing and manipulating data, most people are familiar with commercial spreadsheet software, such as Microsoft Excel or Google Sheets. While spreadsheet software and Pandas can accomplish similar tasks, each has significant advantages depending on the use-case.

Advantages of Spreadsheet Software

  • Point and click

  • Easier to learn

  • Great for small datasets (<10,000 rows)

  • Better for browsing data

Advantages of Pandas

  • More powerful data manipulation with Python

  • Can work with large datasets (millions of rows)

  • Faster for complicated manipulations

  • Better for cleaning and/or pre-processing data

  • Can automate workflows in a larger data pipeline

In short, spreadsheet software is better for browsing small datasets and making moderate adjustments. Pandas is better for automating data cleaning processes that require large or complex data manipulation.

Pandas can interpret a wide variety of data sources, including Excel files, CSV files, and Python objects like lists and dictionaries. Pandas converts these into two fundamental objects:

  • Data Series- a single column of data

  • DataFrame- a table of data containing multiple columns and rows

Pandas Series

We can think of a Series as a single column of data. A DataFrame then is made by combining Series objects side-by-side into a table that has both height and width. Let’s create a Series based on the world’s ten most-populated countries according to Wikipedia.

Population (in millions)

1,404

1,366

330

269

220

211

206

169

146

127

We will put these population numbers into a Pandas Series.

# import pandas, `as pd` allows us to shorten typing `pandas` to `pd` when we call pandas
import pandas as pd

To create our Series, we pass a list into the Series method:

variable_name = pd.Series([1, 2, 3])

# Create a data series in Pandas
worldpop = pd.Series([1404, 1366, 330, 269, 220, 211, 206, 169, 146, 127])

# Give our series a name
worldpop.name = 'World Population (In Millions)'
print(worldpop)

Underneath the Series is a dtype which describes the way the data is stored in the Series. Here we see int64, denoting the data is a 64-bit integer.

.iloc[] Integer Location Selection

To the left of each Series is an index number. This index number is very similar to a Python list index; it can help us reference a particular row for data retrieval. Also, like a Python list, the index to a Series begins with 0. We can retrieve individual elements in a Series using the .iloc attribute, which stands for “index location.”

# Return the 4th element in our series
worldpop.iloc[3]
# Return a slice of elements in our series
# This slice will not include element 4
worldpop.iloc[2:4]

By default, our Series has a numerical index like a Python list, but we can also give each row an identifier (like a key within a Python dictionary). We do this by using:

series_name.index = [name_1, name_2, name_3]

# Rename the index to use names instead of numerical indexes
worldpop.index = [
    'China',
    'India',
    'United States',
    'Indonesia',
    'Pakistan',
    'Brazil',
    'Nigeria',
    'Bangladesh',
    'Russia',
    'Mexico'
]

worldpop

.loc[] Location Selection

Now we can also reference each element by its index name, very similar to how we can supply a key to a dictionary to get a value. We use the .loc attribute.

# Return the series value for Nigeria
worldpop.loc['Nigeria']

Instead of a value, we can return a new series by supplying a list. This will return the value with the index names as well.

# Return a new series containing only Nigeria
# Note that we use two sets of brackets

worldpop.loc[['Nigeria']]
# Return a series value for Indonesia and Mexico
worldpop.loc[['Indonesia', 'Mexico']]
# Return a slice from Nigeria to Russia
# This slice will include the final element!
# This behavior is different than a list slice

worldpop.loc['Nigeria':'Russia']

A Series is like an ordered dictionary. In fact, we can create a Series out of a list (where the index will automatically be numerical starting at 0) or a dictionary (where the keys are the index).

# Creating a Series from a dictionary
# Based on most populous cities in the world according to Wikipedia

worldcitiespop = pd.Series({
    'Tokyo': 37,
    'Delhi': 28,
    'Shanghai': 25,
    'São Paulo': 21,
    'Mexico City': 21,
    'Cairo': 20,
    'Mumbai': 19,
    'Beijing': 19,
    'Dhaka': 19,
    'Osaka': 19,
}, name='World City Populations (In Millions)')

#Return the series
worldcitiespop

Boolean Expressions

We have seen already how we can select a particular value in a series by using an index name or number. We can also select particular values using Boolean expressions. An expression will evaluate to a Truth Table.

# Which countries have populations greater than 200 million?
worldpop > 200

Instead of evaluating to a Truth Table, we can also evaluate to a smaller series by putting the expression into .loc[].

# Evaluate worldpop for `worldpop > 200`
worldpop.loc[worldpop > 200]

Note that we have not changed the values of worldpop but only evaluated the expression. worldpop remains the same.

worldpop

If we wanted to store the evaluation, we would need to use an assignment statement, either for worldpop or a new variable.

# If we wanted to save this to a new series variable
new_series = worldpop[worldpop > 200]

new_series

Pandas uses | to represent or operations. It uses & to represent and operations. We can also use ~ for negation.

Pandas Operator

Boolean

Requires

&

and

All required to True

|

or

If any are True

~

not

The opposite

worldpop.loc[(worldpop > 500) | (worldpop < 250)]

Modifying a Series

We can use an initialization statement to change a value in our Series. The syntax is very similar to changing an item value in a list.

# Change the population of China to 1500
worldpop.loc['China'] = 1500
worldpop
# Change the population of several countries based on an expression
worldpop.loc[worldpop < 300] = 25
worldpop

Summary of Pandas Series

  • A Series is a single column of data that may contain a Name and Index

  • Use .iloc to select a row by index number

  • Use .loc to select a row by index name

  • Use an initialization statement to change values

  • Boolean operators include & (and), | (or), ~ (negation)

Pandas DataFrame

If a Series is like a column of data, a DataFrame is like a table connecting multiple columns together. DataFrames can contain thousands or millions of rows and columns. When working with DataFrames, we are usually using a dataset that has been compiled by someone else. Often the data will be in the form of a CSV or Excel file.

We can import a .csv file with .read_csv() method, passing in the csv location. We can also supply an index column name with index_col.

import pandas as pd

# Create a DataFrame `df` from the CSV file 'sample2.csv'
df = pd.read_csv('data/sample2.csv', index_col='Username')

Exploring DataFrame Contents

Now that we have a DataFrame called df, we need to learn a little more about its contents. The first step is usually to explore the DataFrame’s attributes. Attributes are properties of the dataset (not functions), so they do not have parentheses () after them.

Attribute

Reveals

.shape

The number of rows and columns

.info

The shape plus the first and last 5 rows

.columns

The name of each column

.rows

The name of each row

# Use `.shape` to find rows and columns in the DataFrame
df.shape
# Use `.info` to find the shape plus the first and last five rows of the DataFrame
df.info
# Use `.columns` to find the name of each column (if they are named)
df.columns

We can use .index attribute to discover the name for each row in our DataFrame. We set the index column to Username, but Identifier would also make sense. If no column is chosen, a numeric index is created starting at 0.

# Use `.index` to list the rows of our DataFrame
df.index

Preview with .head() and .tail()

We can also use the .head() and .tail methods to get a preview of our DataFrame.

# Use `.head()` to see the first five lines
# Pass an integer into .head() to see a different number of lines
df.head()
# Use `.tail()` to see the last five lines
# Pass an integer into .tail() to see a different number lines
df.tail()

Display More Rows or Columns

By default, Pandas limits the number of rows and columns to display. If desired, we can increase or decrease the number to display. If your DataFrame has limited number of rows or columns, you may wish to show all of them.

# Show all columns
# Set `None` to an integer to show a set number
pd.set_option('display.max_columns', None)

# Show all rows
# Set `None` to an integer to show a set number
# Be careful if your dataset is thousands of lines long!
pd.set_option('display.max_rows', None)

Change Column Names

If we wanted to change the column names, one option is to modify the original data file. We can also change the column names in the DataFrame.

# Updating all column names at once
df.columns = ['email', 'Identifier', 'First name', 'Last name']
df
# Updating a single column name
df.rename(columns={'email': 'Login email'}, inplace=True)
df

By default, inplace=False which means that Pandas will output what the change would look like but no make changes to the dataframe. It is a preview of the changes. This feature is intentional to make sure the user does not accidentally make a permanent change. There is no undo! Always keep a backup of your file and do not write changes over the original file unless you are sure they are correct.

Passing inplace=True tells Pandas to make the change immediately without any preview.

Reset the Index

When we created the dataframe, we used the index_col attribute to set the index column to the Username column.

df = pd.read_csv('data/sample2.csv', index_col='Username')

We could reset the index to a numerical index starting at 0 using the .reset_index() method.

# Reset the Index for the DataFrame to integers
# creating a new column
# Passing `inplace=True` makes the change immediately
df.reset_index()

For many operations that will alter a DataFrame, such as .reset_index, the changes will be previewed unless a inplace=True parameter is passed. This allows users to preview changes to the data before implementing them in a permanent fashion. Of course, you should always work on a copy of your data in case a manipulation goes awry.

# Confirm index has not been changed
df
# Make the change to reset the index
df.reset_index(inplace=True)
# Print the index, now changed
df
# Change the index back to `Username`
df.set_index('Username', inplace=True)
df

Sorting the Index

We can sort the index by using sort_index().

# Sort the DataFrame by ascending order
df.sort_index()
# Sort by descending order
df.sort_index(ascending=False)

.loc[] and .iloc[] Selection

Like Series, DataFrames can use the .iloc[] and .loc[] methods for selection. To select a particular element, we need to supply a row and a column.

# View our DataFrame for reference
df
# Return the value for the specified row and column
df.iloc[6, 3]
# Return the value for the specified row and column
df.loc['booker12', 'First name']
# Select an entire row
df.loc['redtree333', :]

Technically, we could also use: df.loc['redtree333'] for the same result, but including the , : makes our row and column selections explicit, where the : is basically a slice that includes the whole column. Using a : is required if we want to select an entire column using .loc[] since the row selection comes before the column selection.

# Select an entire column
df.loc[:, 'Login email']

Of course, we can use the : to make a slice using .loc[] or .loc.

# Slicing rows and columns using `.iloc`
df.iloc[0:3, 1:4]

Note that .iloc[] slicing is not inclusive of the final value, similar to a Python list. On the other hand, .loc[] slicing is inclusive. The reason for this difference is that it would make the code confusing since we would need to include whatever name is after the name we want to include.

# Slicing rows and columns using `.loc`
df.loc['booker12':'french999', 'Login email':'First name']

Boolean Expressions

We can also use Boolean expressions to select based on the contents of the elements. We can use these expressions to create filters for selecting particular rows or columns.

Pandas Operator

Boolean

Requires

&

and

All required to True

|

or

If any are True

~

not

The opposite

df
# Return a Truth Table for the `Identifier` column
# Where the Identifier is more than 4000
df.loc[:, 'Identifier'] > 4000
# Preview every row where the Identifier is more than 4000
id_filter = (df.loc[:, 'Identifier'] > 4000)
df.loc[id_filter, :]

# Alternatively, the whole expression can be written out
# But this can be a little more difficult to read
# In this case, it is a good idea to include parentheses
# To make clear the row filter is one expression
#df.loc[(df.loc[:, 'Identifier'] > 4000), :]
# Preview every row with Last name not "Smith"
name_filter = df.loc[:, 'Last name'] == 'Smith'
df.loc[name_filter, :]
# Select the row with `First Name` of Jamie
# And last name of `Smith`
name_filter = (df.loc[:, 'Last name'] == 'Smith') & (df.loc[:, 'First name'] == 'Jamie')
df.loc[name_filter, :]
# Find every row with Last Name not `Smith`
name_filter = (df.loc[:, 'Last name'] == 'Smith')
df.loc[~name_filter, :]

# Or alternatively
#name_filter = (df.loc[:, 'Last name'] != 'Smith')
#df.loc[name_filter, :]

Modifying a DataFrame

A single element can be changed with an initialization statement.

# Change a value using `.loc[]`
df.loc['jenkins46', 'First name'] = 'Mark'
df

We can also use filters for more powerful manipulation.

# Create a string filter that checks for email addresses containing
# 'example.com'. For missing (na) elements, output `False` instead of NaN.
email_filt = df['Login email'].str.contains('example.com', na=False)
email_filt
# Re-Initialize `df` without the users with no email address
df = df[email_filt]
df

Dropping Rows Without Data

There is also a .dropna() method specifically for dropping rows without data

# Recreate the DataFrame `df` from the CSV file 'sample2.csv'
df = pd.read_csv('data/sample2.csv', index_col='Username')
df # Confirm the NaN fields have returned
# Remove all rows without a `Login email` using `.dropna()`
df = df.dropna(subset=['Login email'])
df # Confirm the fields were dropped

Summary of Pandas DataFrames

  • A DataFrame has multiple rows and columns

  • Use attributes along with .head() and .tail() to explore the DataFrame

  • Use .iloc and .loc to select an column, row, or element

  • Use inplace=True to confirm certain manipulations

  • Filters and Boolean Operators can be powerful selectors

  • Use an initialization statement to change one or many elements

  • Drop rows without data using the .dropna() method