Converting JSON to TSV using Python streaming

Software Engineering 4964 views

Earlier today, a friend asked my advice on how to convert a JavaScript Object Notation (JSON) file to tab-separated values (TSV) in Python. As with most things in software development, there are many ways to accomplish this, some more Pythonic than others.

I thought about it, and decided to illustrate how to do this map/reduce style (sans reducer) by streaming to STDOUT.

The Data

To start off with, he was dealing with a pretty simple JSON format for closed captioning data. There's a "cc" root key with an array of items containing: duration, content, and a timestamp. The derived schema looks like this:

{
  "$id": "http://example.com/example.json",
  "type": "object",
  "definitions": {},
  "$schema": "http://json-schema.org/draft-07/schema#",
  "properties": {
    "cc": {
      "$id": "/properties/cc",
      "type": "array",
      "items": {
        "$id": "/properties/cc/items",
        "type": "object",
        "properties": {
          "duration": {
            "$id": "/properties/cc/items/properties/duration",
            "type": "integer",
            "title": "The Duration Schema ",
            "default": 0,
            "examples": [
              2000
            ]
          },
          "content": {
            "$id": "/properties/cc/items/properties/content",
            "type": "string",
            "title": "The Content Schema ",
            "default": "",
            "examples": [
              "Roger Bingham>> We are in New York City"
            ]
          },
          "ts": {
            "$id": "/properties/cc/items/properties/ts",
            "type": "integer",
            "title": "The Ts Schema ",
            "default": 0,
            "examples": [
              1434349680000
            ]
          }
        }
      }
    }
  }
}

Since I couldn't use his actual data here, I scoured Google trying to find another example of closed captioning JSON data. That proved elusive, so I converted the sample data from w3's WebVTT Introduction to JSON. This was apparently the beginning of an audio interview between Roger Bingham and Neil deGrasse Tyson.

Here's the JSON data:

{
  "cc": [
    {
      "duration": 2000,
      "content": "Roger Bingham>> We are in New York City",
      "ts": 1434349680000
    },
    {
      "duration": 3000,
      "content": "Roger Bingham>> We’re actually at the Lucern Hotel, just down the street",
      "ts": 1434349683000
    },
    {
      "duration": 2000,
      "content": "Roger Bingham>> from the American Museum of Natural History",
      "ts": 1434349685000
    },
    {
      "duration": 2000,
      "content": "Roger Bingham>> And with me is Neil deGrasse Tyson",
      "ts": 1434349687000
    },
    {
      "duration": 2000,
      "content": "Roger Bingham>> Astrophysicist, Director of the Hayden Planetarium",
      "ts": 1434349689000
    },
    {
      "duration": 2000,
      "content": "Roger Bingham>> at the AMNH.",
      "ts": 1434349691000
    },
    {
      "duration": 2000,
      "content": "Roger Bingham>> Thank you for walking down here.",
      "ts": 1434349693000
    },
    {
      "duration": 3000,
      "content": "Roger Bingham>> And I want to do a follow-up on the last conversation we did.",
      "ts": 1434349696000
    },
    {
      "duration": 1500,
      "content": "Roger Bingham>> When we e-mailed—",
      "ts": 1434349697500
    },
    {
      "duration": 2000,
      "content": "Neil deGrasse Tyson>> Didn’t we talk about enough in that conversation?",
      "ts": 1434349699500
    },
    {
      "duration": 3500,
      "content": "Roger Bingham>> No! No no no no; 'cos 'cos obviously 'cos",
      "ts": 1434349703000
    },
    {
      "duration": 1000,
      "content": "Neil deGrasse Tyson>> Laughs",
      "ts": 1434349704000
    },
    {
      "duration": 3500,
      "content": "Roger Bingham>> You know I’m so excited my glasses are falling off here.",
      "ts": 1434349707500
    }
  ]
}

This will be pretty easy to read and flatten into TSV.

The Code

Normally, you might think to craft a class which knows how to load and read the specific JSON file, and maybe code or another class to do the writing to a TSV file. However, in this case, since I'm writing this in a map/reduce style, and streaming the data to STDOUT, I only really need the mapper (CcJsonMapper) and not a TsvWriter class or code. The mapper will map the JSON file to STDOUT as TSV. Here's how it will be used:

python cc_json_mapper.py data.json > data.tsv

So, basically, I'm calling the cc_json_mapper.py Python script, passing in the filename of the JSON file, and redirecting it's output to a file.

Here's the source code for cc_json_mapper.py:

import json
import re
from datetime import datetime as dt

TAB = "\t"


class CcJsonMapper:
    """
    An example class of how to convert JSON data to tab-separated by streaming to STDOUT.
    """

    def __init__(self, filename, headers=['duration', 'content', 'ts']):
        """
        Constructs a new CcJsonMapper class.
        :param filename: The fully-qualified filename of the JSON file to load.
        :param headers: The headers to include in the tab-separated output
        """
        self.filename = filename
        self.headers = headers

    @staticmethod
    def normalize_content(s):
        """
        Normalizes string content. Modify this method to add any newly discovered normalization rules.
        :param s: The input string to normalize
        :return: The normalized string

        Example:

        >>> CcJsonMapper.normalize_content("This is an example.")
        'This is an example.'
        """
        result = re.sub("\t|\r|\n", ' ', s)  # Replace tabs, carriage returns, and newlines with a single space
        result = re.sub("<.*?>", '', result)  # Remove any html tags
        # Any other normalization rules should go here

        return result

    @staticmethod
    def ts_to_datetime(ts):
        """
        Converts a JavaScript timestamp value to a Python Datetime
        :param ts: The JavaScript timestamp value
        :return: A Python Datetime

        Example:

        >>> CcJsonMapper.ts_to_datetime(1520620096685)
        datetime.datetime(2018, 3, 9, 13, 28, 16, 685000)
        """
        return dt.fromtimestamp(ts / 1000.0)

    def map(self):
        """
        Opens the JSON file, loops through the cc array of dictionaries, normalizes them, and prints them to STDOUT tab-separated
        """
        # Load the JSON data
        data = json.load(open(self.filename))

        # Print the headers if present, and if there's data
        if self.headers and len(data):
            print(TAB.join(self.headers))

        # Loop through the cc array of dictionaries
        for item in data['cc']:
            # Normalize the data
            duration = str(item['duration'])
            content = CcJsonMapper.normalize_content(item['content'])
            ts = str(CcJsonMapper.ts_to_datetime(item['ts']))

            # Print to STDOUT tab-separated
            print(TAB.join([duration, content, ts]))


if __name__ == '__main__':
    import sys

    mapper = CcJsonMapper(sys.argv[1])  # The first argument is the filename
    mapper.map()

Full Source Code

ejstembler/py-json-to-tsv

More Python Articles