I also created the lists of the most frequent word forms in Internet, LDC, Wikipedia and CCA, as well as in the legal corpus.
After lemmatisation done by Majdi Sawalha there is also the frequency list of lemmas and rootsin the Arabic Internet corpus.
The corpora are:
- The Internet corpus was compiled using the procedure described in my paper in the WaCky book.
- The Al Hayat corpus — from Al Hayat data (1999-2001) compiled by the LDC.
- The Wikipedia corpus — from the public Wiki data retrieved on July 28, 2008.
- CCA corpus — from Latifa Al-Sulaiti.
- The Arabic Legal Corpus — from keywords collected by Hanem El-Farahaty, a Leeds PhD student.
- Computer Science corpus of Arabic — from keywords collected by Latifa Al-Sulaiti
The interface was developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries.