The ImageNet researchers attribute the inclusion of offensive and insensitive categories to the overall size of the task, which ultimately involved 50,000 workers who evaluated 160 million candidate images. They also point out that only a fraction of the “person” images were actually used in practice. That’s because references to ImageNet typically mean a smaller version of the dataset used in the ImageNet Challenge, a competition among research teams to build AI that detects and classifies objects in the images. Out of the 20,000 or so classes of objects, the competition was limited to 1,000, representing just over a million images. Only three “person” categories—scuba diver, groom, and baseball players—were included. The best models trained using that limited version are typically the ones used in other research and real-world applications.
Paglen says the debiasing effort is a positive step, but he finds it revealing that the data apparently went unexamined for 10 years. “The people building these datasets seem to have had no idea what’s in them,” he says. (The ImageNet team says the debiasing project is part of an “ongoing” effort to make machine learning more fair.)
Wong, the Waterloo professor, who has studied biases within ImageNet, says the inattention was likely in part because, at the time the database was made, researchers were focused on the basics of getting their object detection algorithms to work. The enormous success of deep learning took the field by surprise. “We’re now getting to a point where AI is usable, and now people are looking at the social ramifications,” he says.
The ImageNet creators acknowledge that their initial attempts at quality control were ineffective. The full dataset persisted online until January, when the researchers removed all but the ImageNet Challenge images. The new release will include fewer than half of the original person images. It will also allow users to flag additional images and categories as offensive, an acknowledgement that “offensiveness is subject and also constantly evolving,” the ImageNet team writes.
The removal of images has itself proved controversial. “I was surprised a large chunk of the data just disappeared in January without anybody saying anything,” Paglen says. “This is a historically important database.” He points out that the data is likely still in the wild, downloaded on various servers and home computers; removing the data from an accessible home only makes biases more difficult to reproduce and study, he says.
Even researchers were surprised to find out that the data was removed as part of a debiasing project. Chris Dulhanty, one of Wong’s graduate students, says he had reached out to the ImageNet team to request data earlier this year, but didn’t hear back. He assumed removal had to do with technical issues on the aging ImageNet site. (The ImageNet team did not respond to questions about the decision to remove the data, but said they would discuss with other researchers the possibility of making it available again.)
In a paper accompanying ImageNet Roulette, Paglen and Crawford liken the removal of images from ImageNet to similar moves by other institutions. In June, for example, Microsoft removed its “MS-Celeb” database after a Financial Times investigation.
The ImageNet debiasing effort is a good start, says Wong. But he hopes the team will make good on plans to look at bias beyond the person categories. About 15 percent of the “non-person” images do, in fact, contain people somewhere in the frame, he notes. That could lead to inadvertent associations—say, between black people and the label “basketball,” as one research team noted, or between objects related to computers and people who are young, white, and male. Those biases are more likely embedded in widely used models than those any of those contained in the “person” labels.
Paglen says that attempts to debias may be futile. “There’s no such thing as a neutral way of organizing information,” he says. He and Crawford point to other, more recent, datasets that have attempted a more nuanced approach to sensitive labels. He points to an IBM attempt to bring more “diversity” to faces by measuring facial dimensions. The authors hope it’s an improvement over human judgments, but note it raises new questions. Is skin tone a better measure? The answers will reflect evolving social values. “Any system of classification is going to be of its moment in time,” he says. Paglen is opening an exhibition in London next week that intends to illustrate AI’s blind ignorance in that area. It begins with a Magritte painting of an apple, labeled “Ceci n’est pas une pomme.” Good luck convincing an AI algorithm of that.