Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter: The data available here are in gz bundles. Each subdirectory in the bundle represents a newsgroup; each file in a subdirectory is the text of some newsgroup document that was posted to that newsgroup. The first ("19997") is the original, unmodified version.The second ("bydate") is sorted by date into training(60%) and test(40%) sets, does not include cross-posts (duplicates) and does not include newsgroup-identifying headers (Xref, Newsgroups, Path, Followup-To, Date).According to, originally the main feature of Google Groups was being able to search the Usenet archive in which scientific articles had been stored since 1981.

Some of the newsgroups are very closely related to each other (e.g.

hardware / mac.hardware), while others are highly unrelated (e.g / soc.religion.christian).

So the matlab version (below) represents 18824 documents. I used the following two scripts to produce the data files: [Added 1/14/08] The following file contains the vocabulary for the indexed data.