# Advanced Text Analysis with Visualization
The aim of the notebook is to show how the meanings evolve in the text. We use the `bokeh` module for dynamic visualization.

The main inspiration is visualization of the Royal Society Corpus:

http://corpora.ids-mannheim.de/diaviz/royalsociety.html

Presentation at CLARIN conference:

https://www.clarin.eu/sites/default/files/clarin2019_keynote_teich.pdf

LREC Paper:

Fischer, S., Knappen, J., Menzel, K., & Teich, E. (2020). The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study. LREC. https://www.aclweb.org/anthology/2020.lrec-1.99/

In [1]:
!pip install fasttext

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████▊                           | 10kB 13.6MB/s eta 0:00:01[K     |█████████▌                      | 20kB 17.8MB/s eta 0:00:01[K     |██████████████▎                 | 30kB 20.6MB/s eta 0:00:01[K     |███████████████████             | 40kB 23.5MB/s eta 0:00:01[K     |███████████████████████▉        | 51kB 26.1MB/s eta 0:00:01[K     |████████████████████████████▋   | 61kB 28.2MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 7.4MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp37-cp37m-linux_x86_64.whl size=3098440 sha256=79c9a66ed6c737c47528a6c7e6b29df25abe567acda09af11d10efb63c4f317f
  Stored in directory: /root/.cache/pip/wheels/98/ba/7f/b154944a1cf5a8cee9

In [2]:
# This takes a long time
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.cs.300.bin.gz
!gunzip cc.cs.300.bin.gz

--2021-05-31 10:37:54--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.cs.300.bin.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4502843070 (4.2G) [application/octet-stream]
Saving to: ‘cc.cs.300.bin.gz’


2021-05-31 10:41:08 (22.1 MB/s) - ‘cc.cs.300.bin.gz’ saved [4502843070/4502843070]



In [2]:
!pip install bokeh --user

Collecting bokeh
  Downloading bokeh-2.3.2.tar.gz (10.7 MB)
[K     |████████████████████████████████| 10.7 MB 391 kB/s eta 0:00:011
Building wheels for collected packages: bokeh
  Building wheel for bokeh (setup.py) ... [?25ldone
[?25h  Created wheel for bokeh: filename=bokeh-2.3.2-py3-none-any.whl size=11334264 sha256=92ab79c8d9812dbfe4be88c4edace12903aa21a4343f1ada19395e69ba74138c
  Stored in directory: /home/zuzana/.cache/pip/wheels/1b/a5/90/f38b6cd80a8276a7203765295ad3da078aa24ff8006096ae49
Successfully built bokeh
Installing collected packages: bokeh
Successfully installed bokeh-2.3.2


In [3]:
import pandas as pd

import numpy as np
import requests
import json
import fasttext

import matplotlib.pyplot as plt
import seaborn as sns
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook
from bokeh.layouts import column
from bokeh.models import CustomJS, ColumnDataSource, Slider, LabelSet
from bokeh.palettes import Turbo256 

# fetch and clear the document
from bokeh.io import curdoc

from collections import Counter

from sklearn.manifold import TSNE

In [4]:
text = None
with open('../resources/maj.txt') as f:  # modify the path if needed
    text = f.read()

In [None]:
# This takes a long time.
%time embeddings = fasttext.load_model('cc.cs.300.bin')



CPU times: user 3.39 s, sys: 7.63 s, total: 11 s
Wall time: 2min 19s


# Get the data and transform it to the right form
For this notebook, we need the information about individual segments (usually stanzas).

In [None]:
segments = text.split('\n\n')
segments[:3]

['',
 '1',
 'Byl pozdní večer – první máj –\nvečerní máj – byl lásky čas.\nHrdliččin zval ku lásce hlas,\nkde borový zaváněl háj.\nO lásce šeptal tichý mech;\nkvětoucí strom lhal lásky žel,\nsvou lásku slavík růži pěl,\nrůžinu jevil vonný vzdech.\nJezero hladké v křovích stinných\nzvučelo temně tajný bol,\nbřeh je objímal kol a kol;\na slunce jasná světů jiných\nbloudila blankytnými pásky,\nplanoucí tam co slzy lásky.']

The first stanza is in segment n. 2 (counting starts at 0).

In [None]:
orig_tokens = []
for i, segment in enumerate(segments):
    if len(segment) > 1:
        data = {"call": "tagger", 
                "lang": "cs",
                "output": "json",
                "text": segment.replace(';', ',')
              }
        uri = "https://nlp.fi.muni.cz/languageservices/service.py"
        r = requests.post(uri, data=data)
        try:
            data = r.json()
            print("segment", str(i))
            orig_tokens.extend([(i, token[0], token[1], token[2]) for token in data['vertical'] if len(token)==3])
        except:
            print(r.content)
print("Number of tokens", len(orig_tokens))

segment 2
segment 3
segment 4
segment 5
segment 6
b'<body bgcolor="#f0f0f8"><font color="#f0f0f8" size="-5"> -->\n<body bgcolor="#f0f0f8"><font color="#f0f0f8" size="-5"> --> -->\n</font> </font> </font> </script> </object> </blockquote> </pre>\n</table> </table> </table> </table> </table> </font> </font> </font><body bgcolor="#f0f0f8">\n<table width="100%" cellspacing=0 cellpadding=2 border=0 summary="heading">\n<tr bgcolor="#6622aa">\n<td valign=bottom>&nbsp;<br>\n<font color="#ffffff" face="helvetica, arial">&nbsp;<br><big><big><strong>&lt;type \'exceptions.IOError\'&gt;</strong></big></big></font></td\n><td align=right valign=bottom\n><font color="#ffffff" face="helvetica, arial">Python 2.7.17: /usr/bin/python<br>Thu Jan  7 13:40:30 2021</font></td></tr></table>\n    \n<p>A problem occurred in a Python script.  Here is the sequence of\nfunction calls leading up to the error, in the order they occurred.</p>\n<table width="100%" cellspacing=0 cellpadding=0 border=0>\n<tr><td bgcolor=

In [None]:
orig_tokens[:10]

[(2, 'Byl', 'být', 'k5eAaImAgInS'),
 (2, 'pozdní', 'pozdní', 'k2eAgInSc1d1'),
 (2, 'večer', 'večer', 'k1gInSc1'),
 (2, '–', '–', 'k?'),
 (2, 'první', 'první', 'k4xOgInSc4'),
 (2, 'máj', 'máj', 'k1gInSc1'),
 (2, '–', '–', 'k?'),
 (2, 'večerní', 'večerní', 'k2eAgInSc4d1'),
 (2, 'máj', 'máj', 'k1gInSc1'),
 (2, '–', '–', 'k?')]

In [None]:
# select only autosemantic POS
# convert to lowercase
tokens = []
for x,y,z,tag in orig_tokens:
    if tag[:2] in ['k1', 'k2', 'k5', 'k6']:
        tokens.append((x,y.lower(),z,tag))
print("Number of tokens", len(tokens))

Number of tokens 3021


In [None]:
tokens[:20]

[(2, 'byl', 'být', 'k5eAaImAgInS'),
 (2, 'pozdní', 'pozdní', 'k2eAgInSc1d1'),
 (2, 'večer', 'večer', 'k1gInSc1'),
 (2, 'máj', 'máj', 'k1gInSc1'),
 (2, 'večerní', 'večerní', 'k2eAgInSc4d1'),
 (2, 'máj', 'máj', 'k1gInSc1'),
 (2, 'byl', 'být', 'k5eAaImAgInS'),
 (2, 'lásky', 'láska', 'k1gFnSc2'),
 (2, 'čas', 'čas', 'k1gInSc1'),
 (2, 'hrdliččin', 'hrdliččin', 'k2eAgInSc1d1'),
 (2, 'zval', 'zvát', 'k5eAaImAgInS'),
 (2, 'lásce', 'láska', 'k1gFnSc3'),
 (2, 'hlas', 'hlas', 'k1gInSc1'),
 (2, 'kde', 'kde', 'k6eAd1'),
 (2, 'borový', 'borový', 'k2eAgMnSc1d1'),
 (2, 'zaváněl', 'zavánět', 'k5eAaImAgInS'),
 (2, 'háj', 'háj', 'k1gInSc1'),
 (2, 'lásce', 'láska', 'k1gFnSc6'),
 (2, 'šeptal', 'šeptat', 'k5eAaImAgInS'),
 (2, 'tichý', 'tichý', 'k2eAgInSc1d1')]

In [None]:
df = pd.DataFrame.from_dict({"segment": [segment for segment, word, lemma, tag in tokens],
                              "word": [word for _, word, lemma, tag in tokens], 
                              "lemma": [lemma for _, word, lemma, tag in tokens], 
                              "tag": [tag for _, word, lemma, tag in tokens]
                               })
pd.options.display.max_rows = len(df)
df

Unnamed: 0,segment,word,lemma,tag
0,2,byl,být,k5eAaImAgInS
1,2,pozdní,pozdní,k2eAgInSc1d1
2,2,večer,večer,k1gInSc1
3,2,máj,máj,k1gInSc1
4,2,večerní,večerní,k2eAgInSc4d1
5,2,máj,máj,k1gInSc1
6,2,byl,být,k5eAaImAgInS
7,2,lásky,láska,k1gFnSc2
8,2,čas,čas,k1gInSc1
9,2,hrdliččin,hrdliččin,k2eAgInSc1d1


In [None]:
words = list(set(df.word))
word_count = Counter(df.word.values)
embeddings_all = np.array([embeddings[t] for t in words])
embeddings_all.shape

(1526, 300)

In [None]:
word_count.most_common(20)

[('hlas', 29),
 ('tam', 26),
 ('čas', 24),
 ('den', 18),
 ('je', 18),
 ('jsem', 17),
 ('máj', 15),
 ('kde', 15),
 ('sen', 15),
 ('stín', 14),
 ('již', 14),
 ('noc', 14),
 ('vězně', 14),
 ('zemi', 14),
 ('zrak', 13),
 ('zář', 12),
 ('dál', 12),
 ('klín', 12),
 ('vězeň', 12),
 ('lásky', 11)]

In [None]:
tsne_em = TSNE(n_components=2, perplexity=30.0, n_iter=1000, verbose=1, metric='cosine', init='pca', random_state=42).fit_transform(embeddings_all)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1526 samples in 0.001s...
[t-SNE] Computed neighbors for 1526 samples in 0.141s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1526
[t-SNE] Computed conditional probabilities for sample 1526 / 1526
[t-SNE] Mean sigma: 0.232269
[t-SNE] KL divergence after 250 iterations with early exaggeration: 75.556473
[t-SNE] KL divergence after 1000 iterations: 1.730198


In [None]:
tsne_em.shape

(1526, 2)

In [None]:
df_subset = pd.DataFrame(df)
df_subset['tsne-2d-one'] = [tsne_em[words.index(w),0] for w in df.word.values]
df_subset['tsne-2d-two'] = [tsne_em[words.index(w),1] for w in df.word.values]
df_subset['count'] = [word_count[w] for w in df.word.values]
df_subset.head()

Unnamed: 0,segment,word,lemma,tag,tsne-2d-one,tsne-2d-two,count
0,2,byl,být,k5eAaImAgInS,-37.871258,21.348797,8
1,2,pozdní,pozdní,k2eAgInSc1d1,3.21056,46.294643,6
2,2,večer,večer,k1gInSc1,0.325021,50.543339,5
3,2,máj,máj,k1gInSc1,-4.407549,-8.332747,15
4,2,večerní,večerní,k2eAgInSc4d1,4.801791,46.577759,8


In [None]:
df_subset[df_subset['word']=='večer']

Unnamed: 0,segment,word,lemma,tag,tsne-2d-one,tsne-2d-two,count
2,2,večer,večer,k1gInSc1,0.325021,50.543339,5
325,10,večer,večer,k1gInSc4,0.325021,50.543339,5
2389,103,večer,večer,k1gInSc1,0.325021,50.543339,5
2725,116,večer,večer,k6eAd1,0.325021,50.543339,5
3008,120,večer,večer,k1gInSc1,0.325021,50.543339,5


In [None]:
labels = df_subset['word']
x = df_subset['tsne-2d-one']
y = df_subset['tsne-2d-two']
segment = df_subset['segment']
colors = [Turbo256[c*2] for c in df_subset['segment'].values ]
textalpha = [1. if c == df_subset['segment'].values[0] else 0. for c in df_subset['segment'].values ]
radii = (np.log(df_subset['count'])/2.)
alpha = [0.1]*len(df_subset)
alpha[0] = 1.

source = ColumnDataSource(data=dict(x=x, y=y, alpha=alpha, segment=segment, colors=colors, radii=radii, labels=labels, textalpha=textalpha))

In [None]:
curdoc().clear()

output_notebook()

TOOLS="crosshair,pan,wheel_zoom,zoom_in,zoom_out,box_zoom,undo,redo,reset,tap,save,box_select,poly_select,lasso_select,"

p = figure(tools=TOOLS,plot_width=1080, plot_height=800)

p.scatter('x', 'y', source=source, radius='radii',
          fill_color='colors', fill_alpha='alpha',
          line_color=None)

labels = LabelSet(x='x', y='y', text='labels', level='glyph', text_alpha='textalpha',
              x_offset=5, y_offset=5, source=source, render_mode='canvas')

p.add_layout(labels)

slider = Slider(start=df_subset.segment.values[0], end=df_subset.segment.values[-1], value=df_subset.segment.values[0], step=1, title="Text segment")

callback = CustomJS(args=dict(source=source, segment=slider),
                    code="""
    const data = source.data;
    var alpha = data['alpha'];
    var textalpha = data['textalpha'];
    console.log("segment", segment, "segments", data['segment']); 
    for (var i=0; i<alpha.length; i++) {
      if (data['segment'][i]==segment.value) {
        alpha[i] = 1.0;
        textalpha[i] = 1.0; }
      else {
        if (data['segment'][i]-segment.value == 1 || data['segment'][i]-segment.value == -1 ){
           alpha[i] = 0.5;
           textalpha[i] = 0.5;
         } else
         {
           alpha[i] = 0.1;
           textalpha[i] = 0.0; }
      }
    }
    console.log("alpha", alpha)
    data['alpha'] = alpha; 
    data['textalpha'] = textalpha; 
    source.change.emit();
""")



slider.js_on_change('value', callback)
#slider.js_on_change('value', callback)

layout = column(slider, p)

show(layout)