# Text Analysis using APIs
The aim of the notebook is to begin with quantitative analysis of text data. We select a Czech text, split it into tokens, perform frequency analysis, and observe the nature of the data.

Particularly, we use the Czech tagger `desamb` via Language Services API at Natural Language Processing Centre, Faculty of Informatics, Masaryk University Brno.

The tagset is described in https://nlp.fi.muni.cz/raslan/2011/paper05.pdf.


JAKUBÍČEK, Miloš, Vojtěch KOVÁŘ a Pavel ŠMERK. Czech Morphological Tagset Revisited. In Horák, Rychlý. Proceedings of Recent Advances in Slavonic Natural Language Processing 2011. Brno: Tribun EU, 2011. s. 29-42, 14 s. ISBN 978-80-263-0077-9. https://www.muni.cz/en/research/publications/959110

In [1]:
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from collections import Counter
import numpy as np

[nltk_data] Downloading package punkt to /home/zuzana/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
text = None
with open('../resources/maj.txt') as f:  # modify the path if needed
    text = f.read()

In [3]:
import requests
import json

# Make the API call
In this step, we rely on the service provided by the 3rd party. It impacts not only the actual need of annotation but also reproducibility of the results.

The result of the API call is a HTTP code (since the API is called via HTTP POST request). You can find the meaning of the numbers here: https://www.restapitutorial.com/httpstatuscodes.html. Usually, only few codes appear as the API call status code. For example, 200 means OK, the 404 is well known Not found.

In [5]:
data = {"call": "tagger", 
        "lang": "cs",
        "output": "json",
        "text": text.replace(';', ',')
       }
uri = "https://nlp.fi.muni.cz/languageservices/service.py"
r = requests.post(uri, data=data)
r

<Response [200]>

In [6]:
if r.status_code!=200:
    print(r.content)

Note that we convert automatically the data into JSON. JSON is one of the communication standards that is used when two machines communicate.

In [7]:
data = r.json()
data

{'vertical': [['<s>'],
  ['1', '#num#', 'k4'],
  ['Byl', 'být', 'k5eAaImAgInS'],
  ['pozdní', 'pozdní', 'k2eAgInSc1d1'],
  ['večer', 'večer', 'k1gInSc1'],
  ['–', '–', 'k?'],
  ['první', 'první', 'k4xOgInSc4'],
  ['máj', 'máj', 'k1gInSc1'],
  ['–', '–', 'k?'],
  ['večerní', 'večerní', 'k2eAgInSc4d1'],
  ['máj', 'máj', 'k1gFnSc1'],
  ['–', '–', 'k?'],
  ['byl', 'být', 'k5eAaImAgInS'],
  ['lásky', 'láska', 'k1gFnSc2'],
  ['čas', 'čas', 'k1gInSc1'],
  ['<g/>'],
  ['.', '.', 'kIx.'],
  ['</s>'],
  ['<s desamb="1">'],
  ['Hrdliččin', 'hrdliččin', 'k2eAgInSc1d1'],
  ['zval', 'zvát', 'k5eAaImAgInS'],
  ['ku', 'k', 'k7c3'],
  ['lásce', 'láska', 'k1gFnSc3'],
  ['hlas', 'hlas', 'k1gInSc1'],
  ['<g/>'],
  [',', ',', 'kIx,'],
  ['kde', 'kde', 'k6eAd1'],
  ['borový', 'borový', 'k2eAgMnSc1d1'],
  ['zaváněl', 'zavánět', 'k5eAaImAgInS'],
  ['háj', 'háj', 'k1gInSc1'],
  ['<g/>'],
  ['.', '.', 'kIx.'],
  ['</s>'],
  ['<s desamb="1">'],
  ['O', 'o', 'k7c6'],
  ['lásce', 'láska', 'k1gFnSc6'],
  ['šeptal',

**TASK 1**: Observe the data. Without the tagger documentation, is it possible to understand the data?

Explanations: The tagger splits text into sentences (delimited by `<s>` and `</s>`), then into tokens. The tokens are usually separated by spaces. In case they are not (typically punctuation), the tagger adds the "glue" tag `<g/>`. Each token is represented by a list with three elements: word, lemma, tag.

**TASK 2** Check the paper describing the tagset and try to "decipher" annotation for one token into human readable annotation such as "feminine noun in plural instrumental".

In [8]:
tokens = [token for token in data['vertical'] if len(token)==3]
df = pd.DataFrame.from_dict({"word": [word for word, lemma, tag in tokens], 
                              "lemma": [lemma for word, lemma, tag in tokens], 
                              "tag": [tag for word, lemma, tag in tokens]
                               })
pd.options.display.max_rows = len(df)
df

Unnamed: 0,word,lemma,tag
0,1,#num#,k4
1,Byl,být,k5eAaImAgInS
2,pozdní,pozdní,k2eAgInSc1d1
3,večer,večer,k1gInSc1
4,–,–,k?
5,první,první,k4xOgInSc4
6,máj,máj,k1gInSc1
7,–,–,k?
8,večerní,večerní,k2eAgInSc4d1
9,máj,máj,k1gFnSc1


In [9]:
pos = [tag[0:2] for tag in df["tag"]]
df["pos"] = pos
df

Unnamed: 0,word,lemma,tag,pos
0,1,#num#,k4,k4
1,Byl,být,k5eAaImAgInS,k5
2,pozdní,pozdní,k2eAgInSc1d1,k2
3,večer,večer,k1gInSc1,k1
4,–,–,k?,k?
5,první,první,k4xOgInSc4,k4
6,máj,máj,k1gInSc1,k1
7,–,–,k?,k?
8,večerní,večerní,k2eAgInSc4d1,k2
9,máj,máj,k1gFnSc1,k1
