| prepare_and_tokenize | Split Text on Spaces | 
| prepare_text | Prepare Text for Tokenization | 
| remove_control_characters | Remove Non-Character Characters | 
| remove_diacritics | Remove Diacritical Marks on Characters | 
| remove_replacement_characters | Remove the Unicode Replacement Character | 
| space_cjk | Add Spaces Around CJK Ideographs | 
| space_punctuation | Add Spaces Around Punctuation | 
| squish_whitespace | Remove Extra Whitespace | 
| tokenize_space | Break Text at Spaces | 
| validate_utf8 | Clean Up Text to UTF-8 |