$$ 
A \ command \ for \ tokenizing \ words, \ in \ increasing \ specificity: \ \newline
\newline
1) \ tr \ \text{--}sc \ ’A\text{--}Za\text{--}z’ \ ’\backslash n’ \ < \ sh.txt \newline
2) \ tr \ \text{--}sc \ ’A\text{--}Za\text{--}z’ \ ’\backslash n’ \ < \ sh.txt \ | \ sort \ | \ uniq \ -c \newline
3) \ tr \ \text{--}sc \ ’A\text{--}Za\text{--}z’ \ ’\backslash n’ \ < sh.txt \ | \ tr \ A\text{--}Z \ a\text{--}z \ | \ sort \ | \ uniq \ \text{--}c \newline
4) \ tr \ \text{--}sc \ ’A\text{--}Za\text{--}z’ \ ’\backslash n’ \ < sh.txt \ | \ tr \ A\text{--}Z a\text{--}z \ | \ sort \ | \ uniq \ \text{--}c \ | \ sort \ \text{--}n \ \text{--}r  \newline
\newline
Where \ in \ each \ additional \ step: \newline
1) \ Words \ are \ tokenized \ per\text{--}line \newline
2) \ Sorts \ words \ alphabetically, \ displays \ instance \ counts \newline
3) \ Collapses \ uppercase \ letters \ to \ lowercase \newline
4) \ Sorts \ by \ frequency
$$


University of California, Santa Cruz

Unix commands such as tr, sort, and uniq can be used for simple normalization, tokenization, and frequency computation.

Unix Tools for Crude Tokenization and Normalization


An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Learn Before

Related