MIT has removed a massive dataset after finding it contained racist, misogynistic terms and offensive images.
Artificial intelligence (AI) and machine learning (ML) systems use datasets as training data. MIT created the Tiny Images dataset, which contained some 80 million images.
In an open letter, Bill Freeman and Antonio Torralba, both professors at MIT, as well as NYU professor Rob Fergus, outlined issues they became aware of, and the steps they took to resolve them.
“It has been brought to our attention that the Tiny Images dataset contains some derogatory terms as categories and offensive images,” write the professors. “This was a consequence of the automated data collection procedure that relied on nouns from WordNet. We are greatly concerned by this and apologize to those who may have been affected.
“The dataset is too large (80 million images) and the images are so small (32 x 32 pixels) that it can be difficult for people to visually recognize its content. Therefore, manual inspection, even if feasible, will not guarantee that offensive images can be completely removed.
“We therefore have decided to formally withdraw the dataset. It has been taken offline and it will not be put back online. We ask the community to refrain from using it in future and also delete any existing copies of the dataset that may have been downloaded.”
This has been an ongoing issue with AI and ML training data, with some experts warning that it is far too easy for these systems to inadvertently develop biases based on the data. With their announcement, it appears MIT is certainly doing their share to try to rectify that issue.