The Da Vinci Code

T-LAB Tools for Text Analysis

www.tlab.it

Short Samples:
The Da Vinci Code
(last update: March 15th, 2005. The version of T-LAB used was 4.2)

NOTICE: The following example has been realized by using an old version of T-LAB (4.2).
The latest version includes new tools and a new charting system. Click here to find out more.

The idea of this short sample came from a conversation between the author of T-LAB and a reader of Dan Brown's novel.
The former, without having read the novel, was interested in testing a clustering algorithm (see Thematic Document Classification).
The latter, as a precise and passionate reader, has provided precious suggestions for analysis.

Common objective: to verify if and how T-LAB was a good tool for constructing a representation of the book's "contents".

Methods:
- transformation of the novel into a corpus subdivided into 105 context units (i.e. primary documents), each corresponding to a chapter;
- use of T-LAB functions for linguistic pre-processing. In particular: a) grouping proper nouns used for identifying the characters ( e.g. "Aringarosa" and "Bishop Aringarosa", "Sophie" and "Sophie Neveu", "Collet" and "Lieutenant Collet", etc.); b) automatic lemmatization;
- selection, by means of a T-LAB function, of 1,052 lexical units (i.e. words, lemmas or lexies);
- use of a clustering algorithm (a version of bisecting K-means) for analysing a matrix 105 x 1,052 (context units x lexical units);
- measure of similarity used: the cosine coefficient.

Results:
- after number of checks, a solution with 4 clusters was chosen (NB the experimental version of this T-LAB function allows us to easily explore subdivisions from 3 to 10 clusters). The following tables summarize their characteristics: the first 35 typical words of each cluster, selected by chi-square test.

As we can observe, the clusters allow us to identify four different "themes", i.e. four subsets of words co-occurring within the same context units.
NB: Whereas the same "word" can be in more than one clusters, each chapter (i.e. each row of matrix analysed) belongs to one cluster only. Their distribution is as follows:

The same T-LAB function allows us to analyse a table of words x clusters (in this case 1,052 rows x 4 columns) and to represent it by means of Correspondence Analysis. The following is one of the charts produced.

NB:

- The same procedure has been applied for analysing the Italian version of the Dan Brown novel. For the most part, the results match (see);
- The function of T-LAB that we are testing will allow a clustering of two kinds of context units: primary documents defined by the user (e.g. newspaper articles, web pages, responses to open-ended questions, etc.) and elementary context corresponding to sentences. In the first case, the rows will contain frequency values, in the second presence/absence values (1/0).

About The Da Vinci Code: a further T-LAB tool (Word Associations) allows us to have fun in a different way (see below).

To download the demo click here.

The Da Vinci Code

Il mondo di Pinocchio

S. Freud, Cinq leçons de psychanalyse (1904)

The Da Vinci Code

The Da Vinci Code

Related posts

How to perform a Sentiment Analysis

Il mondo di Pinocchio

S. Freud, Cinq leçons de psychanalyse (1904)