www.tlab.it

A Textual Analysis Example

Dealing with open-ended questions (T-LAB survey 2018)
(by Cinzia Castiglioni, PhD - May 28th 2018)

1 - INTRODUCTION

This example will show a textual analysis that was performed on the open-ended questions of the Survey about T-LAB and Text Analysis Software. The survey was conducted by a research team at the Catholic University of Milan (Italy) and aimed to analyse the user experience of T-LAB and other software for text analysis. It also aimed to compare the perceived image of T-LAB to that of its main competitors. The survey included four open-ended questions aimed at understanding T-LAB's perceived area of strengths and areas of improvement , its main competitors' strengths , and a final open comment. See Table 1 for a summary showing the open-ended questions as they appeared in the questionnaire, the variable name, and the response rate for each question (186 respondents completed the questionnaire).

Table 1

This example will show how to deal with open-ended questions. More precisely, it will explain the following processes:
a) how to prepare the corpus with answers to many questions (see section 2 below);

b) how to manage multilingual texts with T-LAB (see the short Appendix at the end of this page);
c) how to perform certain multivariate analyses (see section 3 below).

2 - BUILDING THE CORPUS

A single file (tabular format, CSV) was imported. It included 27 different variables, some of which were recoded into categorical variables. For example, the variable 'T-LAB Overall evaluation', which was originally on a scale from 0 to 10, was recoded into five different categories: very high (8.4 to 10), high (>7.7 to 8.3), medium (>7.1 to 7.7), low (>6 to 7.1), very low (0 to 6). Since the texts to be analysed need to be only in one column, each record was multiplied by the number of the open-ended question completed. In other words, if one participant answered all four open-ended questions, his/her record was multiplied by four (Fig. 1).

Fig. 1

Then, by using the T-LAB Corpus Builder tool, the corpus file has been edited and analysed with just a few clicks (Fig 2).

Fig. 2

 

3 - TEXTUAL ANALYSIS

For the following analyses, the User profile 'Expert' was selected. First, a Co-word Analysis was performed in order to identify the most desirable characteristics of textual analysis software. This analysis did not specifically focus on T-LAB, rather on software in general. Next, a Thematic Document Classification was performed to identify the most recurring themes within the open questions. Finally, a Multiple Correspondence Analysis made it possibleto investigate the relationships between the thematic clusters and other categorical variables.

3.1 - Before starting...

Before starting, you can have a quick look at the imported variables and their modalities. On the main menu select 'Other Tools - > Variable Manager. A dialog box will open, showing on the left column a list of all imported variables. By clicking on the box next to a variable, it will be possible to verify the categories of each variable (Fig. 3).

Fig. 3

3.2 - The Co-Word Analysis

First, a new variable ('RCOPEN') was created by joining two categories of the variable 'OPENQUEST'. That is, the modalities 'TLABSTRENGHTS' and 'OTHERSTRENGHT' of the variable 'OPENQUEST' became 'RC_STRENGHTS' of the variable 'RCOPEN'. By joining the strengths of both T-LAB and other softwares, it became possible to investigate the most appreciated characteristics of textual software in general.
To see which key-words (lemmas or categories) are most frequently used when referring to the strengths of textual analysis software (both T-LAB and others), a Co-word Analysis was performed. To do that, click on 'Co-occurrence analysis' in the main menu, then select 'Co-word Analysis and Concept Mapping'. In order to analyse key-word occurrences and co-occurrences in relation to the strengths of textual analysis software, I selected 'subset' in analysis context and then I selected the category 'RC_STRENGTH' of the variable RCOPEN (see Fig. 4 and Fig. 5).

Fig. 4

Fig. 5

Right-click on the resulting mapping and select 'Dominant Words -> Yes'. This will allow you to see the different dominant words on the Multidimensional Scaling Map (MDS, Fig. 6). This map allows us to interpret the relationships between the "objects" and the dimensions that organize the space in which they are represented. Compared to the word-cloud generators, which only allow us to classify words based on their occurrences, the MDS takes into account also their co-occurrences. In this way, it is possible to identify different semantic clusters representing different aspects which are desirable in a textual software:
-Graphic Interface (top-left quarter);
-Usability (bottom-left quarter);
-Analysis Issues (top-right quarter);
-Price/Open-source (bottom-right quarter).

Fig. 6

3.3 - Thematic Document Classification

A new variable (DOC_CLUSTER) was created as a result of the Thematic Document Classification. The Thematic Document Classification allows us to construct document clusters based on their similarities (i.e. documents characterized by the same patterns of key-words and/or co-occurrences of semantic traits). Five clusters were identified (Fig. 7):
- Cluster 1 (17%) consists of comments which mostly refer to the software licensing.
- Cluster 2 (38%) is related to the software interface.
- Cluster 3 (11%) deals with the ease of use of the software.
- Cluster 4 (17%) relates to the details of data analysis.
- Cluster 5 (11%) deals with the topic of open-source.
- Only 6% of the documents were not classified.

Fig. 7

A Correspondence Analysis was performed on this new variable by excluding those documents (6%) that the Thematic Document Classification was not able to classify ('Comparative Analysis -> Correspondence Analysis', see Fig. 8 and Fig. 9). The analysis results allow the drawing of graphs in which the relationships between both the corpus subsets and the lexical units that make them up are represented. Figure 9 shows the relationships between the active variable (DOC_CLUSTER) and lemmas.

Fig. 8

Fig. 9

Fig. 10

Fig. 11

Interestingly, even if their order is different, the four quadrants of the above picture match those of the MDS analysis (see Fig 6 above).
Moreover, on the left of the above picture there is a clear reference to the contrast between 'open-source' and 'pricing' of software licenses.

3.4 - Multiple Correspondence Analysis

To verify the relationship between the Thematic map (Fig. 11) and the Multidimensional Scaling Map (Fig. 6), I performed a Multiple Correspondence Analysis. This analysis, which may be considered an extension of the simple Correspondence Analysis, allows us to analyse the relationships between two or more categorical variables.
Click on 'Comparative Analysis -> Multiple Correspondence Analysis' (Fig. 12). The following variables were selected: CATOPEN (categories of the open questions), TOVERALL (overall evaluation of T-LAB), and DOC_CLUSTER (thematic document classification).

Fig. 12

After the analysis was launched, I removed the categories of those items which had not been classified or were not available. To do that, right-click on the resulting graphs and select 'Add/Remove items' ' 'Open' (see Fig. 13 and Fig. 14).

Fig. 13

Fig. 14

For a better visualization of the graph, I decided to export it to Excel. To do that, right-click on the resulting graph and select 'Export'. Then, on the dialogue boxes select 'Text/Data' -> 'Export' -> 'Maximum precision' -> 'Export' (see Fig. 15 and Fig. 16).

Fig. 15

Fig. 16

Open an Excel file and click 'Paste'. A list of key-words and categories with their coordinates will appear (see Fig. 17).

Fig. 17

Fig. 18

In the above picture three main clusters are highlighted which show the relationships between the words used by the respondents and their corresponding variable values.
Interestingly, it seems that the features of T-LAB which have a high score in the overall evaluation are - at the same time - 'strengths' and something 'to improve'.

SHORT APPENDIX

The version of T-LAB used was T-LAB Plus 2018.

The comments of respondents were in three languages: English, Italian and Spanish.
When dealing with a multilingual corpus there are two viable options. The first one is to translate the whole textual corpus into the same language before importing it. The second one is to import the textual corpus as it is and then perform some operations on the corpus dictionary through the Dictionary building function. If you choose the second option, you need to deselect the Automatic Lemmatization function when importing the corpus. In the Dictionary building menu, you can manually connect each world - regardless of its language - to a lemma of your choice. For example, the lemma 'DATA' includes the words 'dati' (Italian), 'data' (English), and 'datos' (Spanish).

Below is an excerpt from the customized dictionary which has been used in this analysis.