Datasets and Benchmarks Column: Introduction

Editors: Bart Thomee, Martha Larson

Datasets are critical for research and development as, rather obviously, data is required for performing experiments, validating hypotheses, analyzing designs, and building applications. Over the years a plurality of multimedia collections have been put together, which can range from the one-off instances that have been exclusively created for supporting the work presented in a single paper or demo to those that have been created with multiple related or separate endeavors in mind. Unfortunately, the collected data is often not made publicly available. In some cases, it may not be possible to make a public release due to the proprietary or sensitive nature of the data, but other forces are also at work. For example, one might be reluctant to share data freely, as it has a value from the often substantial amount of time, effort, and money that was invested in collecting it. 

Once a dataset has been made public though, it becomes possible to perform validations of results reported in the literature and to make comparisons between methods using the same source of truth, although matters are complicated when the source code of the methods is not published or the ground truth labels are not made available. Benchmarks offer a useful compromise by offering a particular task to solve along with the data that one is allowed to use and the evaluation metrics that dictate what is considered success and failure. While benchmarks may not offer the cutting edge of research challenges for which utilizing the freshest data is an absolute requirement, they are a useful sanity check to ensure that methods that appear to work on paper also work in practice and are indeed as good as claimed.

Several efforts are underway to stimulate sharing of datasets and code, as well as to promote the reproducibility of experiments. These efforts provide encouragement to overcome the reluctance to share data by underlining the ways in which data becomes more valuable with community-wide use. They also offer insights on how researchers can put data sets that are publicly available to the best possible use. We provide here a couple of key examples of ongoing efforts. At the MMSys conference series, there is a special track for papers on datasets, and Qualinet maintains an index of known multimedia collections. The ACM Artifact Review and Badging policy proposal recommends journals and conferences to adopt a reviewing procedure where the submitted papers can be granted special badges to indicate to what extent the performed experiments are repeatable, replicable, and reproducible. For example, the “Artifacts Evaluated – Reusable” badge would indicate that artifacts associated with the research are found to be documented, consistent, complete, exercisable, and include appropriate evidence of verification and validation to the extent that reuse and repurposing is facilitated.

In future posts appearing in this column, we will be highlighting new public datasets and upcoming benchmarks through a series of invited guest posts, as well as provide insights and updates on the latest development in this area. The columns are edited by Bart Thomee and Martha Larson (see our bios at the end of this post).

To establish a baseline of popular multimedia datasets and benchmarks that have been used over the years by the research community, refer to the table below to see what the state of the art was as of 2015 when the data was compiled by Bart for his paper on the YFCC100M dataset. We can see the sizes of the datasets steadily increasing over the years, the license becoming less restrictive, and it now is the norm to also release additional metadata, precomputed features, and/or ground truth annotations together with the dataset. The last three entries in the table are benchmarks that include tasks such as video surveillance and object localization (TRECVID), diverse image search and music genre recognition (MediaEval), life-logging event search and medical image analysis (ImageCLEF), to name just a few. The table is most certainly not exhaustive, although it is reflective of the evolution of datasets over the last two decades. We will use this table to provide context for the datasets and benchmarks that we will cover in our upcoming columns, so stay tuned for our next post!


bartBart Thomee is a Software Engineer at Google/YouTube in San Bruno, CA, USA, where he focuses on web-scale real-time streaming and batch techniques to fight abuse, spam, and fraud. He was previously a Senior Research Scientist at Yahoo Labs and Flickr, where his research centered on the visual and spatiotemporal dimensions of media, in order to better understand how people experience and explore the world, and how to better assist them with doing so. He led the development of the YFCC100M dataset released in 2014, and previously was part of the efforts leading to the creation of both MIRFLICKR datasets. He has furthermore been part of the organization of the ImageCLEF photo annotation tasks 2012–2013, the MediaEval placing tasks 2013–2016, and the ACM MM Yahoo-Flickr Grand Challenges 2015–2016. In addition, he has served on the program committees of, amongst others, ACM MM, ICMR, SIGIR, ICWSM and ECIR. He was part of the Steering Committee of the Multimedia COMMONS 2015 workshop at ACM MM and co-chaired the workshop in 2016; he also co-organized the TAIA workshop at SIGIR 2015.

Martha Larson is professor in the area of multiSquaremedia information technology at Radboud University in Nijmegen, Netherlands. Previously, she researched and lectured in the area of audio-visual retrieval Fraunhofer IAIS, Germany, and at the University of Amsterdam, Netherlands. Larson is co-founder of the MediaEval international benchmarking initiative for Multimedia Evaluation. She has contributed to the organization of various other challenges, including CLEF NewsREEL 2015-2017, ACM RecSys Challenge 2016, and TRECVid Video Hyperlinking 2016. She has served on the program committees of numerous conferences in the area of information retrieval, multimedia, recommender systems, and speech technology. Other forms of service have included: Area Chair at ACM Multimedia 2013, 2014, 2017, and TPC Chair at ACM ICMR 2017. Currently, she is an Associated Editor for IEEE Transactions of Multimedia. She is a founding member of the ISCA Special Interest Group on Speech and Language in Multimedia and serves on the IAPR Technical Committee 12 Multimedia and Visual Information Systems. Together with Hayley Hung she developed and currently teaches an undergraduate course in Multimedia Analysis at Delft University of Technology, where she maintains a part-time membership in the Multimedia Computing Group.

Bookmark the permalink.