CC BY license logo

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email

Exploring Metadata and Pre-Processing for Researchers

Description of methods in this notebook: This notebook helps researchers generate a list of IDs and export them into a CSV file. If this is your first time doing this process, you should first see Exploring Metadata and Pre-Processing For Learners.

The code below is a starting point for:

  • Importing a CSV file containing the metadata for a given dataset ID

  • Creating a Pandas dataframe to view the metadata

  • Pre-processing your dataset by filtering out unwanted texts

  • Exporting a list of relevant IDs to a CSV file

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Take me to the Learning Version of this notebook ->

Difficulty: Intermediate

Completion time: 5 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: CSV file

Libraries Used:

Research Pipeline: None

Define Dataset ID

# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present

dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

Import Dataset and Load into Dataframe

# Import Metadata for the dataset and load it into a dataframe

import tdm_client
import pandas as pd

# Pull in our dataset CSV using 
dataset_metadata = tdm_client.get_metadata(dataset_id)

# Create our dataframe
df = pd.read_csv(dataset_metadata)

# Confirm dataset ingest
original_document_count = len(df)
print('Total documents in dataframe: ', original_document_count)

Metadata Type by Column Name

Here are descriptions for the metadata types found in each column:

Column Name



a unique item ID (In JSTOR, this is a stable URL)


the title for the item


the larger work that holds this title (for example, a journal title)


the year of publication


the digital object identifier for an item


the type of document (for example, article or book)


the source or provider of the dataset


the publication date in yyyy-mm-dd format


the issue number for a journal publication


the volume number for a journal publication


a URL for the item and/or the item’s metadata


the author or authors of the item


the publisher for the item


the language or languages of the item (eng is the ISO 639 code for English)


the first page number of the print version


the last page number of the print version


the city of the publisher


the number of words in the item


the number of print pages in the item


what data is available (unigrams, bigrams, trigrams, and/or full-text)

Choose Columns to Drop

# Set the pandas option to show all columns
pd.set_option("max_columns", None)
# Preview dataframe
# Drop each of these named columns
df = df.drop(['outputFormat', 'pageEnd', 'pageStart', 'datePublished'], axis=1)

Choose Rows to Drop

# Remove all texts without an author
print('Dropping rows without authors...', end='')
before_count = len(df)
df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'

total_dropped = before_count - len(df)
print(f'{total_dropped} rows dropped.')
# Remove all texts with a particular title
print('Dropping rows with title "Article Review"...', end='')
before_count = len(df)
df = df[df.title != 'Review Article'] # Change `Review Article` to your desired title

total_dropped = before_count - len(df)
print(f'{total_dropped} rows dropped.')
# Remove all items with less than 3000 words
print('Dropping rows with > 3000 words...', end='')
before_count = len(df)
df = df[df.wordCount > 3000] # Change `3000` to your desired number

total_dropped = before_count - len(df)
print(f'{total_dropped} rows dropped.')
# Final total
print(f'Cleaning complete \n Original document count: {original_document_count} \n Final document count:', len(df))

Saving a list of IDs to a CSV file

# Write the column "id" to a CSV file called `pre-processed_###.csv` where ### is the `dataset_id`
output = f'data/pre-processed_{dataset_id}.csv'

print(f'Pre-Processed CSV written to {output}')