<a href="https://colab.research.google.com/github/gmihaila/character-mining/blob/developer/doc/json_tsv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parse transcripts to **.tsv** file

## Parse all transcripts for a more seamless experience

Using the **.json** files from each season, create a master file that contain all transcripts in a easier to work with format.

The notebook will create **friends_transcripts.tsv** which contain all seasons and episodes.

This is a sample of the **.tsv** file:

<br>

|season_id|episode_id|scene_id|utterance_id|speaker|tokens|transcript|
|:-|:-|:-|:-|:-|:-|:-|
|0|	s01|	e01|	c01|	u001|	Monica Geller|	[[There, 's, nothing, to, tell, !], [He, 's, j...|	There's nothing to tell! He's just some guy I ...|
|1|	s01|	e01|	c01|	u002|	Joey Tribbiani|	[[C'mon, ,, you, 're, going, out, with, the, g...|	C'mon, you're going out with the guy! There's ...|

# Imports

In [2]:
import requests
import json
import pandas as pd
from tqdm.notebook import tqdm

# Helper Functions

In [3]:
# define data type
friends_data = dict(season_id=[],
                    episode_id=[],
                    scene_id=[],
                    utterance_id=[],
                    speaker=[],
                    tokens=[],
                    transcript=[]
                    )

# loop through each season
print('Loading seasons...')
for season_index in tqdm(range(1, 11)):
  season_index = '0%d'%season_index if season_index <10 else str(season_index)
  # url of json file
  json_url = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_%s.json'%season_index
  # get request from url
  request = requests.get(json_url)
  # read seson from json file
  season = json.loads(request.text)
  # get season id
  season_id = season['season_id']

  # read each episode
  for episode in season['episodes']:
    episode_id = episode['episode_id']

    # read each scene
    for scene in episode['scenes']:
      scene_id = scene['scene_id']

      # read each utterance
      for utterance in scene['utterances']:
        utterance_id = utterance['utterance_id']
        speaker = utterance['speakers'][0] if utterance['speakers'] else 'unknown'
        friends_data['season_id'].append(season_id)
        friends_data['episode_id'].append(episode_id.split('_')[-1])
        friends_data['scene_id'].append(scene_id.split('_')[-1])
        friends_data['utterance_id'].append(utterance_id.split('_')[-1])
        friends_data['speaker'].append(speaker)
        friends_data['tokens'].append(utterance['tokens'])
        friends_data['transcript'].append(utterance['transcript'])

# save dicitonary to data frame
friends_df = pd.DataFrame(friends_data)

# save data frame to .tsv
friends_df.to_csv('friends_transcripts.tsv', sep='\t', index=False)

print('File saved in `friends_transcripts.tsv` !')
# show sample
friends_df.head()

Loading seasons...


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


File saved in `/content/friends_transcripts.tsv` !


Unnamed: 0,season_id,episode_id,scene_id,utterance_id,speaker,tokens,transcript
0,s01,e01,c01,u001,Monica Geller,"[[There, 's, nothing, to, tell, !], [He, 's, j...",There's nothing to tell! He's just some guy I ...
1,s01,e01,c01,u002,Joey Tribbiani,"[[C'mon, ,, you, 're, going, out, with, the, g...","C'mon, you're going out with the guy! There's ..."
2,s01,e01,c01,u003,Chandler Bing,"[[All, right, Joey, ,, be, nice, .], [So, does...","All right Joey, be nice. So does he have a hum..."
3,s01,e01,c01,u004,Phoebe Buffay,"[[Wait, ,, does, he, eat, chalk, ?]]","Wait, does he eat chalk?"
4,s01,e01,c01,u005,unknown,[],
