corpus normalization has the double goal of:
a) allowing correct word detection as raw forms;
b) solving some ambiguity cases.
This means that T-LAB, in the first place, carries out
a number of processes on the file under analysis: blank space in
excess elimination, apostrophe marking, space addition after
punctuation marks, capital letter reduction, etc.
Secondly, T-LAB marks a set of strings recognized
as proper nouns; then converts the
sequences of row forms recognized as multiwords in unitary strings, in order to use
them in that form during the analysis process ("in terms of" and
"point of view" become respectively "in_terms_of" and
These operation parameters cannot be modified by the
In order to have a correct recognition of raw
forms, in the normalization routine, T-LAB uses the following marks:
, ; : . ! ? ' " ( ) < > + / = [ ]