Author: Jaeyoung Choi
Affiliation: International Computer Science Institute / TU Delft
Editors: Martha Larson and Bart Thomee
The Yahoo-Flickr Creative Commons 100 Million (YFCC100M), the largest freely usable multimedia dataset to have been released so far, is widely used by students, researchers and engineers on topics in multimedia that range from computer vision to machine learning. However, its sheer volume, one of the traits that make the dataset unique and valuable, can pose a barrier to those who do not have access to powerful computing resources. In this article, we introduce useful information and tools to boost the usability and accessibility of the YFCC100M, including the supplemental material provided by the Multimedia Commons (MMCOMMONS) community. In particular, we provide a practical guide on how to set up a feasible and cost effective research and development environment locally or in the cloud that can access the data without having to download it first.
YFCC100M: The Largest Multimodal Public Multimedia Dataset
Datasets are unarguably one of the most important components of multimedia research. In recent years there was a growing demand for a dataset that was not specifically biased or targeted towards certain topics, sufficiently large, truly multimodal, and freely usable without licensing issues.
The YFCC100M dataset was created to meet these needs and overcome many of the issues affecting existing multimedia datasets. It is, so far, the largest publicly and freely available multimedia collection of metadata representing about 99.2 million photos and 0.8 million videos, all of which were uploaded to Flickr between 2004 and 2014. Metadata included in the dataset are, for example, title, description, tags, geo-tag, uploader information, capture device information, URL to the original item. Additional information was later released in the form of expansion packs to supplement the dataset, namely autotags (presence of visual concepts, such as people, animals, objects, events, architecture, and scenery), Exif metadata, and human-readable place labels. All items in the dataset were published under one of the Creative Commons commercial or noncommercial licenses, whereby approximately 31.8% of the dataset is marked for commercial use and 17.3% has the most liberal license that only requires attribution to the photographer. For academic purposes, the entire dataset can be used freely, which enables fair comparisons and reproducibility of published research works.
Two articles from the people who created the dataset, YFCC100M: The New Data in Multimedia Research and Ins and Outs of the YFCC100M give more detail about the the motivation, collection process, and interesting characteristics and statistics about the dataset. Since its initial release in 2014, the YFCC100M quickly gained popularity and is widely used in the research community. As of September 2017, the dataset had been requested over 1400 times and cited over 300 times in research publications with topics ranging in multimedia from computer vision to machine learning. Specific topics include, but are not limited to, image and video search, tag prediction, captioning, learning word embeddings, travel routing, event detection, and geolocation prediction. Demos that use the YFCC100M can be found here.
MMCOMMONS: Making YFCC100M More Useful and Accessible
Out of the many things that the YFCC100M offers, its sheer volume is what makes it especially valuable, but it is also what makes the dataset not so trivial to work with. The metadata alone spans 100 million lines of text and is 45GB in size, not including the expansion packs. To work with the images and/or videos of YFCC100M, they need to be downloaded first using the individual URLs contained in the metadata. Aside from the time required to download all 100 million items, which would further occupy 18TB of disk space, the main problem is that a growing number of images and videos is becoming unavailable due to the natural lifecycle of digital items, where people occasionally delete what they have shared online. In addition, the time alone to process and analyze images and videos is generally infeasible for students and scientists in small research groups who do not have access to high performance computing resources.
These issues were noted upon the creation of the dataset and the MMCOMMONS community was formed to coordinate efforts for making the YFCC100M more useful and accessible to all, and to persist the contents of the dataset over time. To that end, MMCOMMONS provides an online repository that holds supplemental material to the dataset, which can be mounted and used to directly process the dataset in the cloud. The images and videos included in the YFCC100M can be accessed and even downloaded freely from an AWS S3 bucket, which was made possible courtesy of the Amazon Public Dataset program. Note that a tiny percentage of images and videos are missing from the bucket, as they already had disappeared when organizers started the download process right after the YFCC100M was published. This notwithstanding, the images and videos hosted in the bucket still serve as a useful snapshot that researchers can use to ensure proper reproduction of and comparison with their work. Also included in the Multimedia Commons repository are visual and aural features extracted from the image and video content. The MMCOMMONS website provides a detailed description of conventional features and deep features, which include HybridNet, VGG and VLAD. These CNN features can be a good starting point for those who would like to jump right into using the dataset for their research or application.
The Multimedia Commons has been supporting multimedia researchers by generating annotations (see the YLI Media Event Detection and MediaEval Placing tasks), developing tools, as well as organizing competitions and workshops for ideas exchange and collaboration.
Setting up a Research Environment for YFCC100M and MMCOMMONS
Even with pre-extracted features available, to do meaningful research one still needs a lot of computing power to process the large amount of YFCC100M and MMCOMMONS data. We would like to lower the barrier of entry for students and scientists who don’t have access to dedicated high-performance resources. In the following we describe how one can easily set up a research environment for handling the large collection. We introduce how Apache MXNet, Amazon EC2 Spot Instance and AWS S3 can be used to create a research development environment that can handle the data in a cost-efficient way, as well as other ways to use it more efficiently.
1) Use a subset of dataset
It is not necessary to work with the entire dataset just because you can. Depending on the use case, it may make more sense to use a well-chosen subset. For instance, the YLI-GEO and YLI-MED subsets released by the MMCOMMONS can be useful for geolocation and multimedia event detection tasks, respectively. For other needs, the data can be filtered to generate a customized subset.
The YFCC100M Dataset Browser is a web-based tool you can use to search the dataset by keyword. It provides an interactive visualization with statistics that helps to better understand the search results. You can generate a list file (.csv) of the items that match the search query, which you can then use to fetch the images and/or videos afterwards. The limitations of this browser are that it only supports keyword search on the tags and that it only accepts ASCII text as valid input, as opposed to UNICODE for queries using non-Roman characters. Also, queries can take up to a few seconds to return results.
A more flexible way to search the collection with lower latency is to set up your own Apache Solr server and indexing (a subset of) the metadata. For instance, the autotags metadata can be indexed to search for images that have visual concepts of interest. A step-by-step guide to setting up a Solr server environment with the dataset can be found here. You can write Solr queries in most programming languages by using one of the Solr wrappers.
2) Work directly with data from AWS S3
Apache MXNet, a deep learning framework you can run locally on your workstation, allows training with S3 data. Most training and inference modules in MXNet accept data iterators that can read data from and write data to a local drive as well as AWS S3.
The MMCOMMONS provides a data iterator for YFCC100M images, stored as a RecordIO file, so you can process the images in the cloud without ever having to download them to your computer. If you are working with a subset that is sufficiently large, you can further filter it to generate a custom RecordIO file that suits your needs. Since the images stored in the RecordIO file are already resized and saved compactly, generating a RecordIO from an existing RecordIO file by filtering on-the-fly is more time and space efficient than downloading all images first and creating a RecordIO file from scratch. However, if you are using a subset that is relatively small, it is recommended to download just those images you need from S3 and then create a RecordIO file locally, as that will considerably speed up processing the data.
While one would generally set up Apache MXNet to run locally, you should note that the I/O latency of using S3 data can be greatly reduced if you would set it up to run on an Amazon EC2 instance in the same region as where the S3 data is stored (namely, us-west-2, Oregon), see Figure 2. Instructions for setting up a deep learning environment on Amazon EC2 can be found here.
3) Save cost by using Amazon EC2 Spot Instances
Cloud computing has become considerably cheaper in recent years. However, the price for using a GPU instance to process the YFCC100M and MMCOMMONS can still be quite expensive. For instance, Amazon EC2’s on-demand p2.xlarge instance (with a NVIDIA TESLA K80 GPU and 12GB RAM) costs 0.9 USD per hour in the us-west-2 region. This would cost approximately $650 (€540) a month if used full-time.
One way to reduce the cost is to set up a persistent Spot Instance environment. If you request an EC2 Spot Instance, you can use the instance as long as its market price is below your maximum bidding price. If the market price goes beyond your maximum bid, the instance gets terminated after a two minutes warning. To deal with such frequent interruptions it is important to store your intermediate results often to persistent storage space, such as AWS S3 or AWS EFS. The market price of the EC2 instance fluctuates, see Figure 3, so there is no guarantee as to how much you can save or how long you have to wait for your final results to be ready. But if you are willing to experiment with pricing, in our case we were able to reduce the costs by 75% during the period January-April 2017.
4) Apply for academic AWS credits
Consider applying for the AWS Cloud Credits for Research Program to receive AWS credits to run your research in the cloud. In fact, thanks to the grant we were able to release LocationNet, a pre-trained geolocation model that used all geotagged YFCC100M images.
YFCC100M is at the moment the largest multimedia dataset released to the public, but its sheer volume poses a high barrier to actually use it. To boost the usability and accessibility of the dataset, the MMCOMMONS community provides an additional AWS S3 repository with tools, features, and annotations to facilitate creating a feasible research development environment for those with fewer resources at their disposal. In this column, we provided a guide on how a subset of the dataset can be created for specific scenarios, how the hosted YFCC100M and MMCOMMONS data on S3 can be used directly for training a model with Apache MXNet, and finally how Spot Instances and academic AWS credits can make running experiments cheaper or even free.
Join the Multimedia Commons Community
Please let us know if you’re interested in contributing to the MMCOMMONS. This is a collaborative effort among research groups at several institutions (see below). We welcome contributions of annotations, features, and tools around the YFCC100M dataset, and may potentially be able to host them on AWS. What are you working on?
See the this page for information about how to help out.
This dataset would not have been possible without the effort of many people, especially those at Yahoo, Lawrence Livermore National Laboratory, International Computer Science Institute, Amazon, ISTI-CNR, and ITI-CERTH.