Skip to end of metadata
Go to start of metadata

Shared by Carlos Siordia (


CDC’s first public crowdsourcing competition for natural language processing kicks off today!

Would you please help us promote the CDC Text Classification Marathon?

Thank you in advance for your help!

Carlos Siordia

CDC Text Classification Marathon


Every day, work-related injury records are generated. In order to alleviate the human effort expended with coding such records, the Centers for Disease Control and Prevention (CDC) National Institute for Occupational Safety and Health (NIOSH), in close partnership with the Laboratory for Innovation Science at Harvard (LISH), is interested in improving their NLP/ML model to automatically read injury records and classify them according to the Occupational Injury and Illness Classification System (OIICS).


The task is a well-defined classification problem. The programming languages are strictly limited to Python and R.


The input training file is a spreadsheet, with 4 columns (text, sex, age, and event). This CSV file contains a header.

  1. text. This column describes the raw injury description text data.

  2. sex. This is a categorical variable, describing the sex of the related person.

  3. age. This is a positive integer variable, describing the age of the related person.

  4. event. This is the target variable, specifying the OIICS label to be classified. There are 48 unique labels in total.


You are asked to build a model based on the above training data. And your model will need to make predictions for the following test file.


The test file is a spreadsheet, with only 3 columns (text, sex, and age). This CSV file contains a header. The format is the same as the training file, but the event column will be missing. Once your model is trained, it should be able to consume the test file and produce the prediction file by filling in the Event column. Specifically, your output will be a CSV file with all 4 columns, keeping the same order as the test file.


  • No labels