A Textual Analysis Example

Analyze Chinese data
(by Shaojing Sun, PhD - October 2018)


This example shows how a Chinese corpus of data are imported into T-LAB for further analysis. The corpus is derived from 509 Chinese newspaper articles concerning food consumption.


All the Chinese articles are congregated and saved as CSV file (Fig 1), which is then input into T-LAB for corpus building (Fig 2). Articles are distinguished by ID numbers and type of newspaper (i.e. print vs online).

Fig 1

Fig 2

The above CSV file with each article at a row can be input into T-LAB and further be segmented and lemmatized.

N.B.: T-LAB enables the segmentation of long Chinese texts in different file formats. Fig 3 below shows how the input text file is segmented and then used for further analysis. Users can easily use various editing tools to format the segmented text file, such as adding new categorical variables, to conduct more sophisticated analyses.

Fig 3


More than 2000 keywords are provided by T-LAB, according to their frequency in the text (Fig 4). We set up a threshold and only retained those with a frequency higher than 40. As a result, 201 keywords are retained for co-occurrence analysis and 772 keywords are used for thematic analysis.

Fig 4


Fig. 5

Fig. 6

Co-occurrence analysis presents a map linking a central keyword (e.g., ) and other associated words. Here, the distance between the central word and another associated word denotes the frequency of co-occurrence. "food safety" seem to be defined by words pointing to supervision, smuggling, and food type.

By further paring down the number of keywords, we retained 100 keywords to do concept mapping of those keywords (Fig 7). Below, ach color shows a potential concept pertaining to food safety. Those concepts speak to food smuggling, added ingredients, food transportation, supervision, and crime.

Fig. 7



To better interpret the themes implied in the whole corpus, we expanded the list of keywords to 500 with the highest frequency. The below table (Fig 8) shows four major themes emerging from the corpus. Depicting each theme is a list of words ordered by chi-square value. These keywords help researcher interpret and label those themes. Also, researchers can always go back to create other types of graphs (e.g., network, word cloud) to obtain a different view of the holistic relationships among the keywords.

Fig. 8

Fig. 9

We can also plot the earlier-mentioned four themes in a 3-D space (Fig 9), showing the proximity of themes. The above 3-D plot shows that the first two themes appear to be more proximal as compared to other ones.

To further visualize the linkages between keywords and themes, a diversity of graphs (e.g., network graph, see Fig 10 below) can be created to show the centrality and closeness of them. By incorporating more keywords into network analysis, the above graph illustrates the fourth theme pertaining to the public opinion and supervision of expired meat.

Fig. 10