CC BY license logo

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.


Exploring Metadata and Pre-Processing for Researchers

Description of methods in this notebook: This notebook helps researchers generate a list of IDs and export them into a CSV file. If this is your first time doing this process, you should first see Exploring Metadata and Pre-Processing For Learners.

The code below is a starting point for:

  • Importing a CSV file containing the metadata for a given dataset ID

  • Creating a Pandas dataframe to view the metadata

  • Pre-processing your dataset by filtering out unwanted texts

  • Exporting a list of relevant IDs to a CSV file

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Take me to the Learning Version of this notebook ->

Difficulty: Intermediate

Completion time: 5 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: CSV file

Libraries Used:

Research Pipeline: None


Define Dataset ID

# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present

dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

Import Dataset and Load into Dataframe

# Import Metadata for the dataset and load it into a dataframe

import tdm_client
import pandas as pd

# Pull in our dataset CSV using 
dataset_metadata = tdm_client.get_metadata(dataset_id)

# Create our dataframe
df = pd.read_csv(dataset_metadata)

# Confirm dataset ingest
original_document_count = len(df)
print('Total documents in dataframe: ', original_document_count)

Metadata Type by Column Name

Here are descriptions for the metadata types found in each column:

Column Name

Description

id

a unique item ID (In JSTOR, this is a stable URL)

title

the title for the item

isPartOf

the larger work that holds this title (for example, a journal title)

publicationYear

the year of publication

doi

the digital object identifier for an item

docType

the type of document (for example, article or book)

provider

the source or provider of the dataset

datePublished

the publication date in yyyy-mm-dd format

issueNumber

the issue number for a journal publication

volumeNumber

the volume number for a journal publication

url

a URL for the item and/or the item’s metadata

creator

the author or authors of the item

publisher

the publisher for the item

language

the language or languages of the item (eng is the ISO 639 code for English)

pageStart

the first page number of the print version

pageEnd

the last page number of the print version

placeOfPublication

the city of the publisher

wordCount

the number of words in the item

pageCount

the number of print pages in the item

outputFormat

what data is available (unigrams, bigrams, trigrams, and/or full-text)

Choose Columns to Drop

# Set the pandas option to show all columns
pd.set_option("max_columns", None)
# Preview dataframe
df.head()
# Drop each of these named columns
df = df.drop(['outputFormat', 'pageEnd', 'pageStart', 'datePublished'], axis=1)

Choose Rows to Drop

# Remove all texts without an author
print('Dropping rows without authors...', end='')
before_count = len(df)
df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'

total_dropped = before_count - len(df)
print(f'{total_dropped} rows dropped.')
# Remove all texts with a particular title
print('Dropping rows with title "Article Review"...', end='')
before_count = len(df)
df = df[df.title != 'Review Article'] # Change `Review Article` to your desired title

total_dropped = before_count - len(df)
print(f'{total_dropped} rows dropped.')
# Remove all items with less than 3000 words
print('Dropping rows with > 3000 words...', end='')
before_count = len(df)
df = df[df.wordCount > 3000] # Change `3000` to your desired number

total_dropped = before_count - len(df)
print(f'{total_dropped} rows dropped.')
# Final total
print(f'Cleaning complete \n Original document count: {original_document_count} \n Final document count:', len(df))

Saving a list of IDs to a CSV file

# Write the column "id" to a CSV file called `pre-processed_###.csv` where ### is the `dataset_id`
output = f'data/pre-processed_{dataset_id}.csv'

df["id"].to_csv(output)
print(f'Pre-Processed CSV written to {output}')