Converting JSON to TSV using Python streaming
Earlier today, a friend asked my advice on how to convert a JavaScript Object Notation (JSON) file to tab-separated values (TSV) in Python. As with most things in software development, there are many ways to accomplish this, some more Pythonic than others.
I thought about it, and decided to illustrate how to do this map/reduce style (sans reducer) by streaming to STDOUT.
The Data
To start off with, he was dealing with a pretty simple JSON format for closed captioning data. There's a "cc" root key with an array of items containing: duration, content, and a timestamp. The derived schema looks like this:
{
"$id": "http://example.com/example.json",
"type": "object",
"definitions": {},
"$schema": "http://json-schema.org/draft-07/schema#",
"properties": {
"cc": {
"$id": "/properties/cc",
"type": "array",
"items": {
"$id": "/properties/cc/items",
"type": "object",
"properties": {
"duration": {
"$id": "/properties/cc/items/properties/duration",
"type": "integer",
"title": "The Duration Schema ",
"default": 0,
"examples": [
2000
]
},
"content": {
"$id": "/properties/cc/items/properties/content",
"type": "string",
"title": "The Content Schema ",
"default": "",
"examples": [
"Roger Bingham>> We are in New York City"
]
},
"ts": {
"$id": "/properties/cc/items/properties/ts",
"type": "integer",
"title": "The Ts Schema ",
"default": 0,
"examples": [
1434349680000
]
}
}
}
}
}
}
Since I couldn't use his actual data here, I scoured Google trying to find another example of closed captioning JSON data. That proved elusive, so I converted the sample data from w3's WebVTT Introduction to JSON. This was apparently the beginning of an audio interview between Roger Bingham and Neil deGrasse Tyson.
Here's the JSON data:
{
"cc": [
{
"duration": 2000,
"content": "Roger Bingham>> We are in New York City",
"ts": 1434349680000
},
{
"duration": 3000,
"content": "Roger Bingham>> We’re actually at the Lucern Hotel, just down the street",
"ts": 1434349683000
},
{
"duration": 2000,
"content": "Roger Bingham>> from the American Museum of Natural History",
"ts": 1434349685000
},
{
"duration": 2000,
"content": "Roger Bingham>> And with me is Neil deGrasse Tyson",
"ts": 1434349687000
},
{
"duration": 2000,
"content": "Roger Bingham>> Astrophysicist, Director of the Hayden Planetarium",
"ts": 1434349689000
},
{
"duration": 2000,
"content": "Roger Bingham>> at the AMNH.",
"ts": 1434349691000
},
{
"duration": 2000,
"content": "Roger Bingham>> Thank you for walking down here.",
"ts": 1434349693000
},
{
"duration": 3000,
"content": "Roger Bingham>> And I want to do a follow-up on the last conversation we did.",
"ts": 1434349696000
},
{
"duration": 1500,
"content": "Roger Bingham>> When we e-mailed—",
"ts": 1434349697500
},
{
"duration": 2000,
"content": "Neil deGrasse Tyson>> Didn’t we talk about enough in that conversation?",
"ts": 1434349699500
},
{
"duration": 3500,
"content": "Roger Bingham>> No! No no no no; 'cos 'cos obviously 'cos",
"ts": 1434349703000
},
{
"duration": 1000,
"content": "Neil deGrasse Tyson>> Laughs",
"ts": 1434349704000
},
{
"duration": 3500,
"content": "Roger Bingham>> You know I’m so excited my glasses are falling off here.",
"ts": 1434349707500
}
]
}
This will be pretty easy to read and flatten into TSV.
The Code
Normally, you might think to craft a class which knows how to load and read the specific JSON file, and maybe code or another class to do the writing to a TSV file. However, in this case, since I'm writing this in a map/reduce style, and streaming the data to STDOUT, I only really need the mapper (CcJsonMapper) and not a TsvWriter class or code. The mapper will map the JSON file to STDOUT as TSV. Here's how it will be used:
python cc_json_mapper.py data.json > data.tsv
So, basically, I'm calling the cc_json_mapper.py Python script, passing in the filename of the JSON file, and redirecting it's output to a file.
Here's the source code for cc_json_mapper.py:
import json
import re
from datetime import datetime as dt
TAB = "\t"
class CcJsonMapper:
"""
An example class of how to convert JSON data to tab-separated by streaming to STDOUT.
"""
def __init__(self, filename, headers=['duration', 'content', 'ts']):
"""
Constructs a new CcJsonMapper class.
:param filename: The fully-qualified filename of the JSON file to load.
:param headers: The headers to include in the tab-separated output
"""
self.filename = filename
self.headers = headers
@staticmethod
def normalize_content(s):
"""
Normalizes string content. Modify this method to add any newly discovered normalization rules.
:param s: The input string to normalize
:return: The normalized string
Example:
>>> CcJsonMapper.normalize_content("This is an example.")
'This is an example.'
"""
result = re.sub("\t|\r|\n", ' ', s) # Replace tabs, carriage returns, and newlines with a single space
result = re.sub("<.*?>", '', result) # Remove any html tags
# Any other normalization rules should go here
return result
@staticmethod
def ts_to_datetime(ts):
"""
Converts a JavaScript timestamp value to a Python Datetime
:param ts: The JavaScript timestamp value
:return: A Python Datetime
Example:
>>> CcJsonMapper.ts_to_datetime(1520620096685)
datetime.datetime(2018, 3, 9, 13, 28, 16, 685000)
"""
return dt.fromtimestamp(ts / 1000.0)
def map(self):
"""
Opens the JSON file, loops through the cc array of dictionaries, normalizes them, and prints them to STDOUT tab-separated
"""
# Load the JSON data
data = json.load(open(self.filename))
# Print the headers if present, and if there's data
if self.headers and len(data):
print(TAB.join(self.headers))
# Loop through the cc array of dictionaries
for item in data['cc']:
# Normalize the data
duration = str(item['duration'])
content = CcJsonMapper.normalize_content(item['content'])
ts = str(CcJsonMapper.ts_to_datetime(item['ts']))
# Print to STDOUT tab-separated
print(TAB.join([duration, content, ts]))
if __name__ == '__main__':
import sys
mapper = CcJsonMapper(sys.argv[1]) # The first argument is the filename
mapper.map()
Full Source Code
ejstembler/py-json-to-tsvMore Python Articles
- Apache Airflow Tips
- Advice for Python Beginners
- Built a touch screen Raspberry Pi 3 to interface with my car’s OBD II via Bluetooth
- Lane Detection using Python
- Sending an email attachment using Python
- Web Scraping with Python for a Friend
- Different ways of coding a bar chart
- PyQueryableList: LINQ's Queryable List for Python
- IronPython WinForms Example: HTML Encoder
- Replace Text within a Word document using IronPython
- Converting a Word document to text using IronPython
- Porting C# to IronPython – Example 1
- Fixing the PyAverager sample on ADC
- IronPython revisited