We created a test set of communications by asking workers on Amazon Mechanical Turk to respond to 10 hypothetical communication situations. Workers create one sentence in the form of a statement and one sentence in the form of a question. We manually reviewed the data, dropping garbage and correcting obvious spelling or grammar errors.
The zip below contains various forms of the test set for use in evaluating predictive text entry interfaces designed to produce conversational-style text. It also contains the list of unique words used by workers as well as a unigram language model trained on the data. This may be particularly useful for researchers in augmentative and alternative communication (AAC).
Further details about the collection methodology and analysis can be found in this ASSETS '13 poster:
@inproceedings{vertanen_comm2, author = {Keith Vertanen}, title = {A Collection of Conversational AAC-like Communications}, booktitle = {ASSETS '13: Proceedings of the ACM SIGACCESS Conference on Computers and Accessibility}, year = {2013}, }
Our procedure follows the one described in this paper: @article{venkatagiri_efficient, author = "Horabail Venkatagiri", title = "Efficient keyboard layouts for sequential access in augmentative and alternative communication", journal = "Augmentative and Alternative Communication", volume = {15}, number = {2}, year = {1999}, pages = {126--134}, }
The COMM2 collection is licensed under a licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License.
You can try this demo page to see the process we used to collect data from the Amazon Mechanical Turk workers. This demo page does not save any of your input.
Corpus: |
237K |
Zip containing the test set and other resources |
|
3K |
Description of the files in the test set (contained in the zip file). |
|
73K |
COMM2 test set, mixed case (contained in the zip file, for other variants see the zip). |