CC BY license logo

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.


Exploring Metadata and Pre-Processing

Description of methods in this notebook: This notebook shows how to explore and pre-process the metadata of a dataset using Pandas.

The following processes are described:

  • Importing a CSV file containing the metadata for a given dataset ID

  • Creating a Pandas dataframe to view the metadata

  • Pre-processing your dataset by filtering out unwanted texts

  • Exporting a list of relevant IDs to a CSV file

  • Visualizing the metadata of your pre-processed dataset by the number of documents/year and pages/year

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Take me to the Research Version of this notebook ->

Difficulty: Intermediate

Completion time: 45 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: CSV file

Libraries Used:

Research Pipeline: None


Import your dataset

We’ll use the constellate client to automatically retrieve the metadata for a dataset. We can retrieve metadata in a CSV file using the get_metadata method.

Enter a dataset ID in the next code cell.

If you don’t have a dataset ID, you can:

# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

Next, import the constellate client, passing the dataset_id as an argument using the get_metadata method.

# Import the `constellate` client
import constellate

# Pull in our dataset CSV using
# The .get_metadata() method downloads the CSV file for our metadata
# to the /data folder and returns a string for the file name and location
# dataset_metadata will be a string containing that file name and location
dataset_metadata = constellate.get_metadata(dataset_id)

# To download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
# dataset_metadata = constellate.download(dataset_id, 'metadata')

We are ready to import pandas for our analysis and create a dataframe. We will use the read_csv() method to create our dataframe from the CSV file.

# Import pandas 
import pandas as pd

# Create our dataframe
df = pd.read_csv(dataset_metadata)

We can confirm the size of our dataset using the len() function on our dataframe.

original_document_count = len(df)
print(f'Total original documents: {original_document_count}')

Now let’s take a look at the data in our dataframe df. We will set pandas to show all columns using set_option() then get a preview using head().

# Set the pandas option to show all columns
# Setting None gives us all columns
# To show less columns replace None with an integer
pd.set_option("max_columns", None) 

# Set maximumum number of rows to 50
pd.set_option("max_rows", 50)

# Show the first five rows of our dataframe
# To show a different number of preview rows
# Pass an integer into the .head()
df.head() 

Here are descriptions for the metadata types found in each column:

Column Name

Description

id

a unique item ID (In JSTOR, this is a stable URL)

title

the title for the item

isPartOf

the larger work that holds this title (for example, a journal title)

publicationYear

the year of publication

doi

the digital object identifier for an item

docType

the type of document (for example, article or book)

provider

the source or provider of the dataset

datePublished

the publication date in yyyy-mm-dd format

issueNumber

the issue number for a journal publication

volumeNumber

the volume number for a journal publication

url

a URL for the item and/or the item’s metadata

creator

the author or authors of the item

publisher

the publisher for the item

language

the language or languages of the item (eng is the ISO 639 code for English)

pageStart

the first page number of the print version

pageEnd

the last page number of the print version

placeOfPublication

the city of the publisher

wordCount

the number of words in the item

pageCount

the number of print pages in the item

outputFormat

what data is available (unigrams, bigrams, trigrams, and/or full-text)


Filtering out columns using Pandas

If there are any columns you would like to drop from your analysis, you can drop them with:

df = df.drop(['column_name1', 'column_name2', ...], axis=1)

# Drop each of these named columns
# axis=1 specifies we are dropping columns
# axis=0 would specify to drop rows
df = df.drop(['outputFormat', 'pageEnd', 'pageStart', 'datePublished'], axis=1)

# Show the first five rows of our updated dataframe
df.head()

Filtering out rows with Pandas

Now that we have filtered out unwanted metadata columns, we can begin filtering out any texts that may not match our research interests. Let’s examine the first and last ten rows of the dataframe to see if we can identify texts that we would like to remove. We are looking for patterns in the metadata that could help us remove many texts at once.

# Preview the first ten items in the dataframe
# Can you identify patterns to select rows to remove?
df.head(10)
# Preview the last ten items in the dataframe
# Can you identify patterns to select rows to remove?
df.tail(10)
# We create a function to report how many documents were removed.

def texts_report(pre_count):
    """Prints out a report of:
    1. How many documents were removed
    2. The total original number of documents
    3. The total current number of documents
    """
    
    removed_count = pre_count - len(df)
    print(f'{removed_count} texts were removed.')
    print(f'Total original documents: {original_document_count}')
    print('Total current documents: ', len(df))

Remove all rows without data for a particular column

For example, we may wish to remove any texts that do not have authors. (In the case of journals, this may be helpful for removing paratextual sections such as the table of contents, indices, etc.) The column of interest in this case is creator.

# Remove all texts without an author

print('Removing texts without authors...')
initial_count = len(df)
df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'

# Report the number of texts removed
texts_report(initial_count)

Remove row based on the content of a particular column

We can also remove texts, depending on whether we do (or do not) want a particular value in a column. Here are a few examples.

# Remove all items with a particular title
# Change title to desired column
# Change `Review Article` to your undesired title
title_to_remove = 'Review Article'

# Removing texts
print(f'Removing texts with title "{title_to_remove}"...')
initial_count = len(df)
df = df[df.title != title_to_remove]

# Report the number of texts removed
texts_report(initial_count)
# Keep only items with a particular language
# Change language to desired column
# Change 'eng' to your desired language
language = 'eng'

# Removing texts
print(f'Removing texts not in "{language}" language...')
initial_count = len(df)
df = (df[df.language == language]) # Change to another language code for other languages

# Report the number of texts removed
texts_report(initial_count)
# Remove all items with less than 1500 words
# Change wordCount to desired column
# Change `min_word_count to your desired expression to evaluate
min_word_count = 1500

# Removing texts
print(f'Removing texts with fewer than {min_word_count} words...')
initial_count = len(df)
df = df[df.wordCount > min_word_count]

# Report the number of texts removed
texts_report(initial_count)

Take a final look at your dataframe to make sure the current texts fit your research goals. In the next step, we will save the IDs of your pre-processed dataset.

# Preview the first 50 rows of your dataset
# If all the items look good, move to the next step.
df.head(50)

Saving a list of IDs to a CSV file

# Write the column "id" to a CSV file called `pre-processed_###.csv` where ### is the `dataset_id`
df["id"].to_csv('data/pre-processed_' + dataset_id + '.csv')

Download the “pre-processed_###.csv” file (where ### is the dataset_id) for future analysis. You can use this file in combination with the dataset ID to automatically filter your texts and reduce the processing time of your analyses.


Visualizing the Pre-Processed Data

# For displaying plots
%matplotlib inline
# Group the data by publication year and the aggregated number of ids into a bar chart
df.groupby(['publicationYear'])['id'].agg('count').plot.bar(title='Documents by year', figsize=(20, 5), fontsize=12); 

# Read more about Pandas dataframe plotting here: 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

And now let’s look at the total page numbers by year.

# Group the data by publication year and aggregated sum of the page counts into a bar chart

df.groupby(['publicationYear'])['pageCount'].agg('sum').plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12);