Here’s a great article on three techniques for pre-processing raw text input for use in text classification/natural language processing applications.

Modern neural networks cannot interpret labeled text as described above and data must be pre-processed before it can be given to a network for training. One straightforward way to do this is with a bag of words. A bag of words is created by scanning through every element in a data set and creating a dictionary for each unique word seen that can act as an index.