EasyNLP

Words and their Components

This section explores how linguists analyze words as the smallest meaningful units of language, focusing on their internal structure and the components that form them.

Nikhila Reddy

01Tokens

Tokens are the basic units of text, often separated by whitespace or punctuation in languages like English.

Tokens are the basic units of text, often separated by whitespace or punctuation in languages like English. This process is called Tokenization.

Example: "I love reading the newspaper" -> Tokens include I, love, reading, the, newspaper.
There are three specific types of tokens:
1. Word Tokens: The result of dividing text into individual words.
2. Sub-word Tokens: Dividing a word into smaller units. For example, splitting "newspaper" into "news" and "paper".
3. Character tokens: Dividing text character by character. For example, breaking down the word "reading" into individual characters -> r, e, a, d, i, n, g.

02Lexemes

Lexemes are the base or canonical form of the word.

Lexemes are the base or canonical form of the word.

Example: The base word for running, ran is 'run'.
Here running, ran are called as inflections and run is called as base/ canonical form.

The process of analyzing these inflections and converting them into the base/canonical form is called Lemmatization.

03Morphemes

Morphemes are defined as the smallest meaningful unit of any language.

Morphemes are defined as the smallest meaningful unit of any language. By combining these units, one can construct meaningful words.

The morphemes are of 2 types:
1. Free Morphemes: These are independent units that carry meaning on their own and cannot be further divided. For example, the word "agree" is a free morpheme.
2. Bound Morphemes: These units are dependent on other morphemes to form a complete word. Using the word "disagreement" as an example, the "dis-" and "-ment" are bound morphemes that attach to the root "agree."

These morphemes are obtained through a linguistic process known as the morphological process.

04Typology

Typology is defined as the classification and categorization of languages based on their structure or grammatical features.

Typology is defined as the classification and categorization of languages based on their structure or grammatical features.

Languages are often grouped by how they combine morphemes to form words:
1. Isolating (or Analytical) languages: These languages have very few words formed by more than one morpheme.
Examples include Chinese, Vietnamese, and Thai.
2. Synthetic languages: These are the opposite of isolating languages, characterized by having many words formed by more than one morpheme.
3. Agglutinated languages: These languages utilize complex words to explain simple topics.
Examples include Korean, Japanese, Finnish, and Tamil.
4. Fusional languages: These have a high feature-per-morpheme ratio, meaning a single word can carry multiple meanings.
Examples include Arabic, Sanskrit, and Jammu (noting the word "bat" in English as a similar example of multiple meanings).

Up next