www.tlab.it
n-gram
In
T-LAB an n-gram is a
sequence of two (bi-gram) or more contiguous key words present
within the same elementary context (i.e.
sentence, text fragment or paragraph).
When used for computing word co-occurrences, n-gram segmentation overlooks
both stop-words and punctuation marks.
Let's consider the following example:
The Citizens of each State shall be entitled to all Privileges and Immunities of Citizens in the several States.
By assuming that the seven items in red are included in our
key-term list and that an automatic lemmatization has been applied,
a bi-gram segmentation produces the following co-occurrence
contexts:
citizen & state
state & entitle
entitle & privilege
privilege & immunity
immunity & citizen
citizen & state.
Differently, a three-gram segmentation produces the following
co-occurrence contexts:
citizen & state & entitle
state & entitle & privilege
entitle & privilege & immunity
privilege & immunity & citizen
immunity & citizen & state
citizen & state.
It is worth recalling that, when segmenting texts into elementary
contexts, co-occurrences depend on the presence (or absence) of key
words; whereas, when using an n-gram segmentation, co-occurrences
indicate a sequential relationship between words.
In
T-LAB an n-gram based
co-occurrence analysis can be performed with the advanced options
of the Word Association tool; moreover,
a Markovian analysis of bi-grams can be performed with the Sequence Analysis tool.
|