Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email

Latent Dirichlet Allocation (LDA) Topic Modeling

Description: This notebook demonstrates how to do topic modeling. The following processes are described:

  • Using the constellate client to retrieve a dataset

  • Filtering based on a pre-processed ID list

  • Filtering based on a stop words list

  • Cleaning the tokens in the dataset

  • Creating a gensim dictionary

  • Creating a gensim bag of words corpus

  • Computing a topic list using gensim

  • Visualizing the topic list with pyldavis

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Difficulty: Intermediate

Completion time: 30 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: JSON Lines (.jsonl)

Libraries Used:

  • constellate client to collect, unzip, and read our dataset

  • pandas to load a preprocessing list

  • csv to load a custom stopwords list

  • gensim to accomplish the topic modeling

  • NLTK to create a stopwords list (if no list is supplied)

  • pyldavis to visualize our topic model

Research Pipeline

  1. Build a dataset

  2. Create a “Pre-Processing CSV” with Exploring Metadata (Optional)

  3. Create a “Custom Stopwords List” with Creating a Stopwords List (Optional)

  4. Complete the Topic Modeling analysis with this notebook

What is Topic Modeling?

Topic modeling is a machine learning technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.

Topic modeling is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a supervised, clustering technique called Topic Classification, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.

Topic modeling is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. Topic Classification, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it.

Import your dataset

We’ll use the constellate client library to automatically retrieve the dataset in the JSON file format.

Enter a dataset ID in the next code cell.

If you don’t have a dataset ID, you can:

# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Independent Voices
# Independent Voices is an open access digital collection of alternative press newspapers, magazines and journals,
# drawn from the special collections of participating libraries. These periodicals were produced by feminists, 
# dissident GIs, campus radicals, Native Americans, anti-war activists, Black Power advocates, Hispanics, 
# LGBT activists, the extreme right-wing press and alternative literary magazines 
# during the latter half of the 20th century.

dataset_id = "ac017df6-06b0-d415-7235-98fb729dac4e"

Next, import the constellate client, passing the dataset_id as an argument using the get_dataset method.

# Importing your dataset with a dataset ID
import constellate
# Pull in the sampled dataset (1500 documents) that matches `dataset_id`
# in the form of a gzipped JSON lines file.
# The .get_dataset() method downloads the gzipped JSONL file
# to the /data folder and returns a string for the file name and location
dataset_file = constellate.get_dataset(dataset_id)

# To download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at:
# Then use the `` method show below.
#dataset_file =, 'jsonl')

Apply Pre-Processing Filters (if available)

If you completed pre-processing with the “Exploring Metadata and Pre-processing” notebook, you can use your CSV file of dataset IDs to automatically filter the dataset. Your pre-processed CSV file must be in the root folder.

# Import a pre-processed CSV file of filtered dataset IDs.
# If you do not have a pre-processed CSV file, the analysis
# will run on the full dataset and may take longer to complete.
import pandas as pd
import os

pre_processed_file_name = f'data/pre-processed_{dataset_id}.csv'

if os.path.exists(pre_processed_file_name):
    df = pd.read_csv(pre_processed_file_name)
    filtered_id_list = df["id"].tolist()
    use_filtered_list = True
    print('Pre-Processed CSV found. Successfully read in ' + str(len(df)) + ' documents.')
    use_filtered_list = False
    print('No pre-processed CSV file found. Full dataset will be used.')

Load Stopwords List

If you have created a stopword list in the stopwords notebook, we will import it here. (You can always modify the CSV file to add or subtract words then reload the list.) Otherwise, we’ll load the NLTK stopwords list automatically.

# Load a custom data/stop_words.csv if available
# Otherwise, load the nltk stopwords list in English

# The filename of the custom data/stop_words.csv file
stopwords_list_filename = 'data/stop_words.csv'

if os.path.exists(stopwords_list_filename):
    import csv
    with open(stopwords_list_filename, 'r') as f:
        stop_words = list(csv.reader(f))[0]
    print('Custom stopwords list loaded from CSV')
    # Load the NLTK stopwords list
    from nltk.corpus import stopwords
    stop_words = stopwords.words('english')
    print('NLTK stopwords list loaded')

Define a Function to Process Tokens

Next, we create a short function to clean up our tokens.

def process_token(token):
    token = token.lower()
    if token in stop_words:
    if len(token) < 4:
    if not(token.isalpha()):
    return token
# Limit to n documents. Set to None to use all documents.

limit = 5000

n = 0
documents = []
for document in constellate.dataset_reader(dataset_file):
    processed_document = []
    document_id = document["id"]
    if use_filtered_list is True:
        # Skip documents not in our filtered_id_list
        if document_id not in filtered_id_list:
    unigrams = document.get("unigramCount", {})
    for gram, count in unigrams.items():
        clean_gram = process_token(gram)
        if clean_gram is None:
        processed_document += [clean_gram] * count # Add the unigram as many times as it was counted
    if len(processed_document) > 0:
    if n % 1000 == 0:
        print(f'Unigrams collected for {n} documents...')
    n += 1
    if (limit is not None) and (n >= limit):
print(f'All unigrams collected for {n} documents.')

Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the Gensim LDA Model page.

import gensim
dictionary = gensim.corpora.Dictionary(documents)
doc_count = len(documents)
num_topics = 10 # Change the number of topics
passes = 8 # The number of passes the model runs
# Remove terms that appear in less than 50 documents and terms that occur in more than 90% of documents.
dictionary.filter_extremes(no_below=50, no_above=0.9)
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]
# Train the LDA model
model = gensim.models.LdaModel(

Display a List of Topics

Print the most significant terms, as determined by the model, for each topic.

for topic_num in range(0, num_topics):
    word_ids = model.get_topic_terms(topic_num)
    words = []
    for wid, weight in word_ids:
        word = dictionary.id2token[wid]
    print("Topic {}".format(str(topic_num).ljust(5)), " ".join(words))

Visualize the Topic Distances on a Flat Plane

Visualize the model using pyLDAvis. This visualization can take a while to generate depending on the size of your dataset.

import pyLDAvis.gensim
pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)
# Export this visualization as an HTML file
# An internet connection is still required to view the HTML
p = pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)
pyLDAvis.save_html(p, 'my_visualization.html')