https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email tdm@ithaka.org.


Tokenizing Text Files

Description: You may have text files and metadata that you want to tokenize into ngrams with Python.

This notebook takes as input:

  • Plain text files (.txt) in a folder

  • A metadata CSV file called ‘metadata.csv’

and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Difficulty: Intermediate

Completion time: 10-15 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: .txt, .csv, .jsonl

Libraries Used:

  • json

  • collections

  • pandas

Research Pipeline:

  1. Scan documents

  2. OCR files

  3. Clean up texts

  4. Tokenize text files (this notebook)


Import Libraries

from collections import Counter
import gzip
import json
import os

import pandas as pd

Download and inspect sample files

For purposes of this tutorial, we will download a set of sample files from Project Gutenberg using a helper function from the tdm_client.

from tdm_client import download_gutenberg_sample

text_file_directory = download_gutenberg_sample()

You now have sample text files and a CSV of metadata in your data directory.

You can list the contents of this directory with this command.

!ls -lt ~/data/gutenberg-sample

You can see the first 20 lines of a sample file with this command.

!head -n 20 ~/data/gutenberg-sample/205-0.txt

Define a tokenizing function

def constellate_ngrams(text, n=1):
    # Define a Counter object to hold our ngrams.
    c = Counter()
    # Replace line breaks in the text.
    t = text.replace("\r", " ").replace("\n", "")
    # Convert the text to a list of words.
    words = t.split()
    # Slice the words into ngrams.
    for grams in zip(*[words[i:] for i in range(n)]):
        g = " ".join(grams)
        c[g] += 1
    return c

Tokenize a text

Let’s tokenize one of the sample files using our function.

# Read in one of the texts. See note about file paths.
with open(f"{text_file_directory}{os.sep}205-0.txt") as input_file:
    text = input_file.read()
    
unigrams = constellate_ngrams(text)
unigrams.most_common(10)

You can create bigrams or trigrams (or n grams) by changing the n keyword argument passed to the function.

bigrams = constellate_ngrams(text, n=2)
bigrams.most_common(10)

Creating a Constellate JSONL file

For your analysis, you may want to create files that conform to the same data specification as the files provided by Constellate. The following steps show you how to load metadata and the raw text, create ngrams and output a JSONL (JSON lines) file that matches, in format, what you download from the Constellate web application.

df = pd.read_csv(text_file_directory + os.sep + "metadata.csv")

df.head()

Loop through the dataframe and print out some of the metadata.

for item in df.itertuples():
    print(item.title, item.author, item.url)

Now convert the metadata to the Constellate schema as defined here by mapping the column names from the source csv to the corresponding Constellate schema attributes.

# Create a list to hold our documents.
documents = []

for item in df.itertuples():
    document = {
        "id": item.url,
        "title": item.title,
        "creator": [item.author],
        "docType": "book",
        "publicationYear": item.published,
        "url": item.url,
        "language": [item.language]
    }
    documents.append(document)

Now that we have our metadata stored in a list, let’s revise our function to capture the full text of the documents and generate ngrams.

# Create a list to hold our documents.
documents = []

for item in df.itertuples():
    document = {
        "id": item.url,
        "title": item.title,
        # A document can have authors/creators, so map as a list.
        "creator": [item.author],
        "docType": "book",
        "publicationYear": item.published,
        "url": item.url,
        # A document can have multiple languages, so map as a list.
        "language": [item.language]
    }
    # Read in the full text
    with open(text_file_directory + "/" + item.file) as text_file:
        text = text_file.read()
    
    # Split the text into pages. See note below.
    document["fullText"] = text.split("\n\n\n")
    # Generate ngrams
    document["unigramCount"] = constellate_ngrams(text, n=1)
    document["bigramCount"] = constellate_ngrams(text, n=2)
    document["trigramCount"] = constellate_ngrams(text, n=3)
    # Add our document to the list of documents
    documents.append(document)
    print(f"{item.title} procesed")

Inspect the first document and print some of the metadata and content.

first_doc = documents[0]
print(first_doc["title"], first_doc["publicationYear"])

Print the twenty five most common trigrams.

for term, count in first_doc["trigramCount"].most_common(25):
    print(term, count)

Print the first 500 characters of “page” 20.

print(first_doc["fullText"][20][:500])

Generate a Constellate gzip file

You may now want to create a gzip file so that it matches what you have downloaded from Constellate. You could also then use the gzip_reader that’s part of the tdm_client to read it.

output_file = text_file_directory + os.sep + "sample_gutenberg_dataset.json.gzip"

with gzip.open(output_file, "wb") as handle:
    for doc in documents:
        # Convert the document to a string and add the line separator
        raw = json.dumps(doc) + "\n"
        handle.write(raw.encode())

Now use the dataset reader to read your file back in and verify it is what we expect.

from tdm_client import dataset_reader

for doc in dataset_reader(output_file):
    print(doc["title"], doc["creator"], doc["publicationYear"])
    # See note about assert
    assert(doc["unigramCount"] is not None)
    assert(doc["fullText"] is not None)

Notes

  • File paths - in Unix based systems (including Linux and MacOS), a file is separated with a /. On Windows the separator is a \. Python includes the helpful os.sep to find the correct file separator for your system. This allows the notebook to run just fine on multiple operating systems.

  • Pagination - the plain text files from Project Gutenberg aren’t paginated. Here we are using a simple rule of thumb: if there are three consecutive line breaks, treat this as a page break. This is unlikely to work well across all Project Gutenberg content but should be sufficient for demonstration purposes. You may be curious about more sophisticated attempts to format Project Gutenberg books, such a chapterize by Jonathan Reeve.

  • assert - Python’s assert statement can be a quick and useful way to validate your logic. By using assert, you are guaranteeing that the program won’t run if the statement is false. So in this usage, we are guaranteeing that each of our documents have a fullText and a unigramCount attribute.