Fbi Says ‘mass Casualty Assault Thwarted’ With Arrest Of 21-year-old In Corpus Christi
Other issues like word frequency and so on. can be used easily via NLTK library. For identifying the language you must use some nice language identifiers like this (based on Google's language-detection) and this (Based on guesslanguage.cpp by Jacob R Rideout). It does not should do something linguistic, raw HTML is usable, plain Unicode textual content is healthier, but when it can additionally do things like word frequency, normalizing, lemmatizing, and so on that may be a great bonus. [...]