https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.


Finding Significant Words Using TF/IDF

Description: Discover the significant words in a corpus using Gensim TF-IDF. The following code is included:

  • Filtering based on a pre-processed ID list

  • Filtering based on a stop words list

  • Token cleaning

  • Computing TF-IDF using Gensim

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Take me to the Learning Version of this notebook ->

Difficulty: Intermediate

Completion time: 5-10 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: JSON Lines (.jsonl)

Libraries Used:

  • pandas to load a preprocessing list

  • csv to load a custom stopwords list

  • gensim to help compute the tf-idf calculations

  • NLTK to create a stopwords list (if no list is supplied)

Research Pipeline:

  1. Build a dataset

  2. Create a “Pre-Processing CSV” with Exploring Metadata (Optional)

  3. Create a “Custom Stopwords List” with Creating a Stopwords List (Optional)

  4. Complete the TF-IDF analysis with this notebook


Import Raw Dataset

# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

# Pull in the dataset that matches `dataset_id`
# in the form of a gzipped JSON lines file.
import tdm_client
dataset_file = tdm_client.get_dataset(dataset_id)

Load Pre-Processing Filter (Optional)

If you completed pre-processing with the “Exploring Metadata and Pre-processing” notebook, you can use your CSV file of dataset IDs to automatically filter the dataset.

# Import a pre-processed CSV file of filtered dataset IDs.
# If you do not have a pre-processed CSV file, the analysis
# will run on the full dataset and may take longer to complete.
import pandas as pd
import os

pre_processed_file_name = f'data/pre-processed_{dataset_id}.csv'

if os.path.exists(pre_processed_file_name):
    df = pd.read_csv(pre_processed_file_name)
    filtered_id_list = df["id"].tolist()
    use_filtered_list = True
    print('Pre-Processed CSV found. Successfully read in ' + str(len(df)) + ' documents.')
else: 
    use_filtered_list = False
    print('No pre-processed CSV file found. Full dataset will be used.')

Load Stop Words List (Optional)

The default stop words list is NLTK. You can also create a stopwords CSV with the “Creating Stop Words” notebook.

# Load a custom data/stop_words.csv if available
# Otherwise, load the nltk stopwords list in English

# Create an empty Python list to hold the stopwords
stop_words = []

# The filename of the custom data/stop_words.csv file
stopwords_list_filename = 'data/stop_words.csv'

if os.path.exists(stopwords_list_filename):
    import csv
    with open(stopwords_list_filename, 'r') as f:
        stop_words = list(csv.reader(f))[0]
    print('Custom stopwords list loaded from CSV')
else:
    # Load the NLTK stopwords list
    from nltk.corpus import stopwords
    stop_words = stopwords.words('english')
    print('NLTK stopwords list loaded')

Define a Unigram Cleaning Function

By default, this function will:

  • Lowercase all tokens

  • Remove tokens in stopwords list

  • Remove tokens with fewer than 4 characters

  • Remove tokens with non-alphabetic characters

# Define a function that will process individual tokens

def process_token(token):
    token = token.lower()
    if token in stop_words: # Remove stop tokens
        return
    if len(token) < 4: # Remove short tokens
        return
    if not(token.isalpha()): # Remove non-alphanumeric tokens
        return
    return token

Process and Collect Unigrams

# Collecting the unigrams and processing them into `documents`

limit = 500 # Change number of documents being analyzed. Set to `None` to do all documents.
#limit = None
n = 0
documents = []
document_ids = []
    
for document in tdm_client.dataset_reader(dataset_file):
    processed_document = []
    document_id = document['id']
    if use_filtered_list is True:
        # Skip documents not in our filtered_id_list
        if document_id not in filtered_id_list:
            continue
    document_ids.append(document_id)
    unigrams = document.get("unigramCount", [])
    for gram, count in unigrams.items():
        clean_gram = process_token(gram)
        if clean_gram is None:
            continue
        processed_document.append(clean_gram)
    if len(processed_document) > 0:
        documents.append(processed_document)
    n += 1
    if (limit is not None) and (n >= limit):
        break
print('Unigrams collected and processed.')

Compute “Term Frequency- Inverse Document Frequency” using Gensim

import gensim

# Create the gensim dictionary
dictionary = gensim.corpora.Dictionary(documents)

# Create the bag of words corpus
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

# Create the gensim TF-IDF model
model = gensim.models.TfidfModel(bow_corpus)

# Create TF-IDF scores for the bag of words corpus
corpus_tfidf = model[bow_corpus] # Create TF-IDF scores for the `bow_corpus` using our model

print('TF-IDF scores calculated.')

Display the Highest TF-IDF Scores in the Corpus

# Gather scores in a dictionary

td = { 
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }

sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

# Print the top 25 terms in the entire corpus
for term, weight in sorted_td[:25]: 
    print(term, weight)

Display the Highest TF-IDF Score in each Document

# Print the ID, most significant word, and TF/IDF score
# in every document

for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(document_ids[n], dictionary.get(word_id), score)
    if n >= 10:
        break