MPEG Column: 128th MPEG Meeting in Geneva, Switzerland

The original blog post can be found at the Bitmovin Techblog and has been modified/updated here to focus on and highlight research aspects.

The 128th MPEG meeting concluded on October 11, 2019 in Geneva, Switzerland with the following topics:

  • Low Complexity Enhancement Video Coding (LCEVC) Promoted to Committee Draft
  • 2nd Edition of Omnidirectional Media Format (OMAF) has reached the first milestone
  • Genomic Information Representation – Part 4 Reference Software and Part 5 Conformance Promoted to Draft International Standard

The corresponding press release of the 128th MPEG meeting can be found here: https://mpeg.chiariglione.org/meetings/128. In this report we will focus on video coding aspects (i.e., LCEVC) and immersive media applications (i.e., OMAF). At the end, we will provide an update related to adaptive streaming (i.e., DASH and CMAF).

Low Complexity Enhancement Video Coding

Low Complexity Enhancement Video Coding (LCEVC) has been promoted to committee draft (CD) which is the first milestone in the ISO/IEC standardization process. LCEVC is part two of MPEG-5 or ISO/IEC 23094-2 if you prefer the always easy-to-remember ISO codes. We introduced MPEG-5 already in previous posts and LCEVC is about a standardized video coding solution that leverages other video codecs in a manner that improves video compression efficiency while maintaining or lowering the overall encoding and decoding complexity.

The LCEVC standard uses a lightweight video codec to add up to two layers of encoded residuals. The aim of these layers is correcting artefacts produced by the base video codec and adding detail and sharpness for the final output video.

The target of this standard comprises software or hardware codecs with extra processing capabilities, e.g., mobile devices, set top boxes (STBs), and personal computer based decoders. Additional benefits are the reduction in implementation complexity or a corresponding expansion in spatial resolution.

LCEVC is based on existing codecs which allows for backwards-compatibility with existing deployments. Supporting LCEVC enables “softwareized” video coding allowing for release and deployment options known from software-based solutions which are well understood by software companies and, thus, opens new opportunities in improving and optimizing video-based services and applications.

Research aspects: in video coding, research efforts are mainly related to coding efficiency and complexity (as usual). However, as MPEG-5 basically adds a software layer on top of what is typically implemented in hardware, all kind of aspects related to software engineering could become an active area of research.

Omnidirectional Media Format

The scope of the Omnidirectional Media Format (OMAF) is about 360° video, images, audio and associated timed text and specifies (i) a coordinate system, (ii) projection and rectangular region-wise packing methods, (iii) storage of omnidirectional media and the associated metadata using ISOBMFF, (iv) encapsulation, signaling and streaming of omnidirectional media in DASH and MMT, and (v) media profiles and presentation profiles.

At this meeting, the second edition of OMAF (ISO/IEC 23090-2) has been promoted to committee draft (CD) which includes

  • support of improved overlay of graphics or textual data on top of video,
  • efficient signaling of videos structured in multiple sub parts,
  • enabling more than one viewpoint, and
  • new profiles supporting dynamic bitstream generation according to the viewport.

As for the first edition, OMAF includes encapsulation and signaling in ISOBMFF as well as streaming of omnidirectional media (DASH and MMT). It will reach its final milestone by the end of 2020.

360° video is certainly a vital use case towards a fully immersive media experience. Devices to capture and consume such content are becoming increasingly available and will probably contribute to the dissemination of this type of content. However, it is also understood that the complexity increases significantly, specifically with respect to large-scale, scalable deployments due to increased content volume/complexity, timing constraints (latency), and quality of experience issues.

Research aspects: understanding the increased complexity of 360° video or immersive media in general is certainly an important aspect to be addressed towards enabling applications and services in this domain. We may even start thinking that 360° video actually works (e.g., it’s possible to capture, upload to YouTube and consume it on many devices) but the devil is in the detail in order to handle this complexity in an efficient way to enable seamless and high quality of experience.

DASH and CMAF

The 4th edition of DASH (ISO/IEC 23009-1) will be published soon and MPEG is currently working towards a first amendment which will be about (i) CMAF support and (ii) event processing model. An overview of all DASH standards is depicted in the figure below, notably part one of MPEG-DASH referred to as media presentation description and segment formats.

MPEG-DASH-standard-status

The 2nd edition of the CMAF standard (ISO/IEC 23000-19) will become available very soon and MPEG is currently reviewing additional tools in the so-called technologies under considerations document as well as conducting various explorations. A working draft for additional media profiles is also under preparation.

Research aspects: with CMAF, low-latency supported is added to DASH-like applications and services. However, the implementation specifics are actually not defined in the standard and subject to competition (e.g., here). Interestingly, the Bitmovin video developer reports from both 2018 and 2019 highlight the need for low-latency solutions in this domain.

At the ACM Multimedia Conference 2019 in Nice, France I gave a tutorial entitled “A Journey towards Fully Immersive Media Access” which includes updates related to DASH and CMAF. The slides are available here.

Outlook 2020

Finally, let me try giving an outlook for 2020, not so much content-wise but events planned for 2020 that are highly relevant for this column:

  • MPEG129, Jan 13-17, 2020, Brussels, Belgium
  • DCC 2020, Mar 24-27, 2020, Snowbird, UT, USA
  • MPEG130, Apr 20-24, 2020, Alpbach, Austria
  • NAB 2020, Apr 08-22, Las Vegas, NV, USA
  • ICASSP 2020, May 4-8, 2020, Barcelona, Spain
  • QoMEX 2020, May 26-28, 2020, Athlone, Ireland
  • MMSys 2020, Jun 8-11, 2020, Istanbul, Turkey
  • IMX 2020, June 17-19, 2020, Barcelona, Spain
  • MPEG131, Jun 29 – Jul 3, 2020, Geneva, Switzerland
  • NetSoft,QoE Mgmt Workshop, Jun 29 – Jul 3, 2020, Ghent, Belgium
  • ICME 2020, Jul 6-10, London, UK
  • ATHENA summer school, Jul 13-17, Klagenfurt, Austria
  • … and many more!

Dataset Column: Report from the MMM 2019 Special Session on Multimedia Datasets for Repeatable Experimentation (MDRE 2019)

Special Session

Information retrieval and multimedia content access have a long history of comparative evaluation, and many of the advances in the area over the past decade can be attributed to the availability of open datasets that support comparative and repeatable experimentation. Sharing data and code to allow other researchers to replicate research results is needed in the multimedia modeling field, as it helps to improve the performance of systems and the reproducibility of published papers.

This report summarizes the special session on Multimedia Datasets for Repeatable Experimentation (MDRE 2019), which was organized at the 25th International Conference on MultiMedia Modeling (MMM 2019), which was held in January 2019 in Thessaloniki, Greece.

The intent of these special sessions is to be a venue for releasing datasets to the multimedia community and discussing dataset related issues. The presentation mode in 2019 was to have short presentations (8 minutes) with some questions, and an additional panel discussion after all the presentations, which was moderated by Björn Þór Jónsson. In the following we summarize the special session, including its talks, questions, and discussions.

The special session presenters: Luca Rossetto, Cathal Gurrin and Minh-Son Dao.

Presentations

A Test Collection for Interactive Lifelog Retrieval

The session started with a presentation about A Test Collection for Interactive Lifelog Retrieval [1], given by Cathal Gurrin from Dublin City University (Ireland). In their work, the authors introduced a new test collection for interactive lifelog retrieval, which consists of multi-modal data from 27 days, comprising nearly 42 thousand images and other personal data (health and activity data; more specifically, heart rate, galvanic skin response, calorie burn, steps, blood pressure, blood glucose levels, human activity, and diet log). The authors argued that, although other lifelog datasets already exist, their dataset is unique in terms of the multi-modal character, and has a reasonable and easily manageable size of 27 consecutive days. Hence, it can also be used for interactive search and provides newcomers with an easy entry into the field. The published dataset has already been used for the Lifelog Search Challenge (LSC) [5] in 2018, which is an annual competition run at the ACM International Conference on Multimedia Retrieval (ICMR).

The discussion about this work started with a question about the plans for the dataset and whether it should be extended over the years, e.g. to increase the challenge of participating in the LSC. However, the problem with public lifelog datasets is the fact that there is a conflict between releasing more content and safeguarding privacy. There is a strong need to anonymize the contained images (e.g. blurring faces and license plates), where the rules and requirements of the EU GDPR regulations make this especially important. However, anonymizing content unfortunately is a very slow process. An alternative to removing and/or masking actual content from the dataset for privacy reasons would be to create artificial datasets (e.g. containing public images or only faces from people who consent to publish), but this would likely also be a non-trivial task. One interesting aspect could be the use of Generative Adversarial Networks (GANs) for the anonymization of faces, for instance by replacing all faces appearing in the content with generated faces learned from a small group of people who gave their consent. Another way to preemptively mitigate the privacy issues could be to wear conspicuous ‘lifelogging stickers’ during recording to make people aware of the presence of the camera, which would give them the possibility to object to being filmed or to avoid being captured altogether.

SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives

The second presentation was given by Minh-Son Dao from the National Institute of Information and Communications Technology (NICT) in Japan about SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives [2]. This is a dataset that aims at combining the conditions of the environment with health-related aspects (e.g., pollution or weather data with cardio-respiratory or psychophysiological data). The creation of the dataset was motivated by the fact that people in larger cities in Japan very often do not want to go out (e.g., for some sports activities), because they are very concerned about pollution, i.e., health conditions. So it would be beneficial to have a map of the city with assigned pollution ratings, or a system that allows to perform related queries. Their dataset contains sensor data collected on routes by a few dozen volunteer  people over seven days in Fukuoka, Japan. More particularly, they collected data about the location, O3, NO2, PM2.5 (particulates), temperature, and humidity in combination with heart rate, motion behavior (from 3-axis accelerometer), relaxation level, and other personal perception data from questionnaires.

This dataset has also been used for multimedia benchmark challenges, such as the Lifelogging for Wellbeing task at MediaEval. In order to define the ground truth, volunteers were presented with specific use cases and annotation rules, and were asked to collaboratively annotate the dataset. The collected data (the feelings of participants at different locations) was also visualized using an interactive map. Although the dataset may have some inconsistent annotations, it is easy to filter them out since labels of corresponding annotators and annotator groups are contained in the dataset as well.

V3C – a Research Video Collection

The third presentation was given by Luca Rossetto from the University of Basel (Switzerland) about V3C – a Research Video Collection [3]. This is a large-scale dataset for multimedia retrieval, consisting of nearly 30,000 videos with an overall duration of about 3,800 hours. Although many other video datasets are available already (e.g., IACC.3 [6], or YFCC100M [8]), the V3C dataset is unique in the aspects of timeliness (more recent content than many other datasets and therefore more representative content for current ‘videos in the wild’) and diversity (represents many different genres or use cases), while also having no copyright restrictions (all contained videos were labelled with a Creative Commons license by their uploaders). The videos have been collected from the video sharing platform Vimeo (hence the name ‘Vimeo Creative Commons Collection’ or V3C in short) and represent video data currently used on video sharing platforms. The dataset comes together with a master shot-boundary detection ground truth, as well as keyframes and additional metadata. It is partitioned into three major parts (V3C1, V3C2, and V3C3) to make it more manageable, and it will be used by the TRECVID and the Video Browser Showdown (VBS) evaluation campaigns for several years. Although the dataset was not specifically built for retrieval, it is suitable for any use case that requires a larger video dataset.

The shot-boundary detection used to provide the master-shot reference for the V3C dataset was implemented using Cineast, which is an open source software available for download. It divides every frame into a 3×3 grid and computes color histograms for all 9 areas, which are then concatenated into a ‘regional color histogram’ feature vector that is compared between all adjacent frames. This seems to work very well for hard cuts and gradual transitions, although for grayscale content (and flashlights etc.) it is not very stable. The additional metadata provided with the dataset includes information about resolution, frame rate, uploading user and the upload date, as well as any semantic information provided by the uploader (title, description, tags, etc.). 

Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition

Originally a fourth presentation was scheduled about Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition [4], but unfortunately no author was on site to give the presentation. This dataset contains audio samples with a duration of 30 seconds (as well as extracted features and ground truth) from a metropolitan city (Athens, Greece), that have been recorded during a period of about four years by 10 different persons with the aim to provide a collection about city sounds. The metadata includes geospatial coordinates, timestamp, rating, and tagging of the sound by the recording person. The authors demonstrated in a baseline evaluation that their dataset allows to predict the soundscape quality in the city with about 42% accuracy.

Discussion

After the presentations, Björn Þór Jónsson moderated a panel discussion in which all presenters participated.

The panel started with a discussion on the size of datasets, whether the only way to make challenges more difficult is to keep increasing the dataset, or whether there are alternatives to this. Although this heavily depends on the research question one would like to solve, it was generally agreed that there is a definite need for evaluation with large datasets, because for small datasets some problems are trivial. Moreover, too small datasets often introduce some kind of content bias, so that they do not fully reflect the practical situation.

For now, it seems there is no real alternative to using larger datasets although it is clear that this will introduce additional challenges/hurdles for data management and data processing. All presenters (and the audience too) agreed that introducing larger datasets will also necessitate the need for closer collaboration with other research communities―with fields like data science, data management/engineering, and distributed and high-performance computing―in order to manage the higher data load.

However, even though we need larger datasets, we might not be ready yet to really go for true large-scale. For example, the V3C dataset is still far away from a true web-scale video search dataset; it originally was intended to be even bigger, but there were concerns from the TRECVID and VBS communities about the manageability. Datasets that are too large would set the entrance barrier for newcomers so high that an evaluation benchmark may not attract enough participants―a problem that could possibly disappear in a few years (as hardware becomes cheaper and faster/larger), but still needs to be addressed from an organizational viewpoint. 

There were notes from the audience that instead of focusing on size alone, we should also consider the problem we want to solve. It appears many researchers use datasets for use cases for which they were not designed and are not suited to. Instead of blindly going for larger size, datasets could be kept small and simple for solving essential research questions, for example by truly optimizing them to the problem to solve; different evaluations would then use different datasets. However, this would lead to a considerable dataset fragmentation and necessitate the need for combining several datasets for broader/larger evaluation tasks, which has been shown to be quite challenging in the past. For example, there are already a lot of health datasets available, and it would be interesting to take benefit from them, but the workload for the integration into competitions is often too high in practice.

Another issue that should be addressed more intensively by the research community is to figure out the situation for personal datasets that are compliant with GDPR regulations, since currently nobody really knows how to deal with this.

Acknowledgments

The session was organized by the authors of the report, in collaboration with Duc-Tien Dang-Nguyen (Dublin City University), Michael Riegler (Center for Digitalisation and Engineering & University of Oslo), and Luca Piras (University of Cagliari). The panel format of the special session made the discussions much more lively and interactive than that of a traditional technical session. We would like to thank the presenters and their co-authors for their excellent contributions, as well as the members of the audience who contributed greatly to the session.

References

[1] Gurrin, C., Schoeffmann, K., Joho, H., Munzer, B., Albatal, R., Hopfgartner, F., … & Dang-Nguyen, D. T. (2019, January). A test collection for interactive lifelog retrieval. In International Conference on Multimedia Modeling (pp. 312-324). Springer, Cham.
[2] Sato, T., Dao, M. S., Kuribayashi, K., & Zettsu, K. (2019, January). SEPHLA: Challenges and Opportunities Within Environment-Personal Health Archives. In International Conference on Multimedia Modeling (pp. 325-337). Springer, Cham.
[3] Rossetto, L., Schuldt, H., Awad, G., & Butt, A. A. (2019, January). V3C–A Research Video Collection. In International Conference on Multimedia Modeling (pp. 349-360). Springer, Cham.
[4] Giannakopoulos, T., Orfanidi, M., & Perantonis, S. (2019, January). Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition. In International Conference on Multimedia Modeling (pp. 338-348). Springer, Cham.
[5] Dang-Nguyen, D. T., Schoeffmann, K., & Hurst, W. (2018, June). LSE2018 Panel-Challenges of Lifelog Search and Access. In Proceedings of the 2018 ACM Workshop on The Lifelog Search Challenge (pp. 1-2). ACM.
[6] Awad, G., Butt, A., Curtis, K., Lee, Y., Fiscus, J., Godil, A., … & Kraaij, W. (2018, November). Trecvid 2018: Benchmarking video activity detection, video captioning and matching, video storytelling linking and video search.
[7] Lokoč, J., Kovalčík, G., Münzer, B., Schöffmann, K., Bailer, W., Gasser, R., … & Barthel, K. U. (2019). Interactive search or sequential browsing? a detailed analysis of the video browser showdown 2018. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(1), 29.
[8] Kalkowski, S., Schulze, C., Dengel, A., & Borth, D. (2015, October). Real-time analysis and visualization of the YFCC100M dataset. In Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions(pp. 25-30). ACM.

Dataset Column: Datasets for Online Multimedia Verification

Introduction

Online disinformation is a problem that has been attracting increased interest by researchers worldwide as the breadth and magnitude of its impact is progressively manifested and documented in a number of studies (Boididou et al., 2014; Zhou & Zafarani, 2018; Zubiaga et al., 2018). This emerging area of research is inherently multidisciplinary and there have been numerous treatments of the subject, each having a distinct perspective or theme, ranging from the predominant perspectives of media, journalism and communications (Wardle & Derakhshan, 2017) and political science (Allcott & Gentzkow, 2017) to those of network science (Lazer et al., 2018), natural language processing (Rubin et al., 2015) and signal processing, including media forensics (Zampoglou et al., 2017). Given the multimodal nature of the problem, it is no surprise that the multimedia community has taken a strong interest in the field.

From a multimedia perspective, two research problems have attracted the bulk of researchers’ attention: a) detection of content tampering and content fabrication, and b) detection of content misuse for disinformation. The first was traditionally studied within the field of media forensics (Rocha et al, 2011), but has recently been under the spotlight as a result of the rise of deepfake videos (Güera & Delp, 2018), i.e. a special class of generative models that are capable of synthesizing highly convincing media content from scratch or based on some authentic seed content. The second problem has focused on the problem of multimedia misuse or misappropriation, i.e. the use of media content out of its original context with the goal of spreading misinformation or false narratives (Tandoc et al., 2018).

Developing automated approaches to detect media-based disinformation is relying to a great extent on the availability of relevant datasets, both for training supervised learning models and for evaluating their effectiveness. Yet, developing and releasing such datasets is a challenge in itself for a number of reasons:

  1. Identifying, curating, understanding, and annotating cases of media-based misinformation is a very effort-intensive task. More often than not, the annotation process requires careful and extensive reading of pertinent news coverage from a variety of sources similar to the journalistic practice of verification (Brandtzaeg et al., 2016).
  2. Media-based disinformation is largely manifested in social media platforms and relevant datasets are therefore hard to collect and distribute due to the temporary nature of social media content and the numerous technical restrictions and challenges involved in collecting content (mostly due to limitations or complete lack of appropriate support by the respective APIs), as well as the legal and ethical issues in releasing social media-based datasets (due to the need to comply with the respective Terms of Service and any applicable data protection law).

In this column, we present two multimedia datasets that could be of value to researchers who study media-based disinformation and develop automated approaches to tackle the problem. The first, called Fake Video Corpus (Papadopoulou et al., 2019) is a manually curated collection of 200 debunked and 180 verified videos, along with relevant annotations, accompanied by a set of 5,193 near-duplicate instances of them that were posted on popular social media platforms. The second, called FIVR-200K (Kordopatis-Zilos et al., 2019), is an automatically collected dataset of 225,960 videos, a list of 100 video queries and manually verified annotations regarding the relation (if any) of the dataset videos to each of the queries (i.e. near-duplicate, complementary scene, same incident).

For each of the two datasets, we present the design and creation process, focusing on issues and questions regarding the relevance of the collected content, the technical means of collection, and the process of annotation, which had the dual goal of ensuring high accuracy and keeping the manual annotation cost manageable. Given that each dataset is accompanied by a detailed journal article, in this column we only limit our description to high-level information, emphasizing the utility and creation process in each case, rather than on detailed statistics, which are disclosed in the respective papers.

Following the presentation of the two datasets, we then proceed to a critical discussion, highlighting their limitations and some caveats, and delineating future steps towards high quality dataset creation for the field of multimedia-based misinformation.

Related Datasets

The complexity and challenge of the multimedia verification problem has led to the creation of numerous datasets and benchmarking efforts, each designed specifically for a particular task within this area. We can broadly classify these efforts in three areas: a) multimedia forensics, b) multimedia retrieval, and c) multimedia post classification. Datasets that are focused on the text modality, e.g. Fake News Challenge, Clickbait Challenge, Hyperpartisan News Detection, RumourEval (Derczynski et al 2017), etc. are beyond the scope of this post and are hence not included in this discussion.

Multimedia forensics: Generating high-quality multimedia forensics datasets has always been a challenge, since creating convincing forgeries is normally a manual task requiring a fair amount of skill, and as a result such datasets have generally been few and limited in scale. With respect to image splicing, our own survey (Zampoglou et al, 2017) listed a number of datasets that had been made available by this point, including our own Wild Web tampered image dataset, which consists of real-world forgeries that have been collected from the Web, including multiple near-duplicates, making it a large and particularly challenging collection. Recently, the Realistic Tampering Dataset (Korus et al,2017) was proposed, offering a large number of convincing forgeries for evaluation. On the other hand, copy-move image forgeries pose a different problem that requires specially designed datasets. Three such commonly used datasets are those produced by MICC (Amerini et al, 2011), the Image Manipulation Dataset by (Christlein et al, 2012), and CoMoFoD (Tralic et al, 2013). These datasets are still actively used in research.

With respect to video tampering, there has been relative scarcity in high-quality large-scale datasets, which is understandable given the difficulty of creating convincing forgeries. The recently proposed Multimedia Forensics Challenge datasets include some large-scale sets of tampered images and videos for the evaluation of forensics algorithms. Finally, there has recently been increased interest towards the automatic detection of forgeries made with the assistance of particular software, and specifically face-swapping software. As the quality of produced face-swaps is constantly improving, detecting face-swaps is an important emerging verification task. The FaceForensics++ dataset (Rössler et al, 2019) is a very-large scale dataset containing face-swapped videos (and untampered face videos) from a number of different algorithms, aimed for the evaluation of face-swap detection algorithms.

Multimedia retrieval: Several cases of multimedia verification can be considered to be an instance of a near-duplicate retrieval task, in which the query video (video to be verified) is run against a database of past cases/videos to check whether it has already appeared before. The most popular and publicly-available dataset for near-duplicate video retrieval is arguably the CC_WEB_VIDEO dataset (Wu et al., 2007). This consists of 12,790 user-generated videos collected from popular video sharing websites (YouTube, Google Video, and Yahoo! Video). It is organized in 24 query sets, for each of which the most popular video was selected to serve as query, and the rest of the videos were manually annotated based on their duplicity to the query. Another relevant dataset is VCDB (Jiang et al., 2014), which was compiled and annotated as a benchmark for the partial video copy detection problem and is composed of videos from popular video platforms (YouTube and Metacafe). VCDB contains two subsets of videos: a) the core, which consists of 28 discrete sets of videos with a total of 528 videos with over 9,000 pairs of manually annotated partial copies, and b) the distractors, which consists of 100,000 videos with the purpose to make the video copy detection problem more challenging.

Multimedia post classification: A benchmark task under the name “Verifying Multimedia Use” (Boididou et al., 2015; Boididou et al., 2016) was organized and took place in the context of MediaEval 2015 and 2016 respectively. The task made a dataset available of 15,629 tweets containing images and videos, each of which made a false or factual claim with respect to the shared image/video. The released tweets were posted in the context of breaking news events (e.g. Hurricane Sandy, Boston Marathon bombings) or hoaxes. 

Video Verification Datasets

The Fake Video Corpus (FVC)

The Fake Video Corpus (Papadopoulou et al., 2018) is a collection of 380 user-generated videos and 5,193 near-duplicate versions of them, all collected from three online video platforms: YouTube, Facebook, and Twitter. The videos are annotated either as “verified” (“real”) or as “debunked” (“fake”) depending on whether the information they convey is accurate or misleading. Verified videos are typically user-generated takes of newsworthy events, while debunked videos include various types of misinformation, including staged content posing as UGC, real content taken out of context, or modified/tampered content (see Figure 1 for examples). The near-duplicates of each video are arranged in temporally ordered “cascades”, and each near-duplicate video is annotated with respect to its relation to the first video of the cascade (e.g. whether it is reinforcing or debunking the original claim). The FVC is the first, to our knowledge, large-scale dataset of debunked and verified user-generated videos (UGVs). The dataset contains different kinds of metadata for its videos, including channel (user) information, video information, and community reactions (number of likes, shares and comments) at the time of their inclusion.

  
  
Figure 1. A selection of real (top row) and fake (bottom row) videos from the Fake Video Corpus. Click image to jump to larger version, description, and link to YouTube video.

The initial set of 380 videos were collected and annotated using various sources including the Context Aggregation and Analysis (CAA) service developed within the InVID project and fact-checking sites such as Snopes. To build the dataset, all videos submitted to the CAA service between November 2017 and January 2018 were collected in an initial pool of approximately 1600 videos, which were then manually inspected and filtered. The remaining videos were annotated as “verified” or “debunked” using established third party sources (news articles or blog posts), leading to the final pool of 180 verified and 200 fake unique videos. Then, keyword-based search was run on the three platforms, and near-duplicate video detection was used to identify the video duplicates within the returned results. More specifically, for each of the 380 videos, its title was reformulated in a more general form, and translated into four major languages: Russian, Arabic, French, and German. The original title, the general form and the translations were submitted as queries to YouTube, Facebook, and Twitter. Then, the  near-duplicate retrieval algorithm of Kordopatis-Zilos etal (2017) was used on the resulting pool, and the results were manually inspected to remove erroneous matches.

The purpose of the dataset is twofold: i) to be used for the analysis of the dissemination patterns of real and fake user-generated videos (by analyzing the traits of the near-duplicate video cascades), and ii) to serve as a benchmark for the evaluation of automated video verification methods. The relatively large size of the dataset is important for both of these tasks. With respect to the study of dissemination patterns, the dataset provides the opportunity to study the dissemination of the same or similar content by analyzing associations between videos not provided by the original platform APIs, combined with the wealth of associated metadata. In parallel, having a collection of 5,573 annotated “verified” or “debunked” videos- even if many are near-duplicate versions of the 380 cases – can be used for the evaluation (or even training) of verification systems, either based on visual content or the associated video metadata.

The Fine-grained Incident Video Retrieval Dataset (FIVR-200K)

The FIVR-200K dataset (Kordopatis-Zilos et al., 2019) consists of 225,960 videos associated with 4,687 Wikipedia events and 100 selected video queries (see Figure 2 for examples). It has been designed to simulate the problem of Fine-grained Incident Video Retrieval (FIVR). The objective of this problem is: given a query video, retrieve all associated videos considering several types of associations with respect to an incident of interest. FIVR contains several retrieval tasks as special cases under a single framework. In particular, we consider three types of association between videos: a) Duplicate Scene Videos (DSV), which share at least one scene (originating from the same camera) regardless of any applied transformation, b) Complementary Scene Videos (CSV), which contain part of the same spatiotemporal segment, but captured from different viewpoints, and c) Incident Scene Videos (ISV), which capture the same incident, i.e. they are spatially and temporally close, but have no overlap.

For the collection of the dataset, we first crawled Wikipedia’s Current Event page to collect a large number of major news events that occurred between 2013 and 2017 (five years). Each news event is accompanied with a topic, headline, text, date, and hyperlinks. To collect videos of the same category, we retained only news events with topic “Armed conflicts and attacks” or “Disasters and accidents”. This ultimately led to a total of 4,687 events after filtering. To gather videos around these events and build a large collection with numerous video pairs that are associated through the relations of interest (DSV, CSV and ISV), we queried the public YouTube API with the event headlines. To ensure that the collected videos capture the corresponding event, we retained only the videos published within a timespan of one week from the event date. This process resulted in the collection of 225,960 videos.

  
Figure 2. A selection of query videos from the Fine-grained Incident Video Retrieval dataset. Click image to jump to larger version, link to YouTube video, and several associated videos.

Next, we proceeded with the selection of query videos. We set up an automated filtering and ranking process that implemented the following criteria: a) query videos should be relatively short and ideally focus on a single scene, b) queries should have many near-duplicates or same-incident videos within the dataset that are published by many different uploaders, c) among a set of near-duplicate/same-instance videos, the one that was uploaded first should be selected as query. This selection process was implemented based on a graph-based clustering approach and resulted in the selection of 635 query videos, of which we used the top 100 (ranked by corresponding cluster size) as the final query set.

For the annotation of similarity relations among videos, we followed a multi-step process, in which we presented annotators with the results of a similarity-based video retrieval system and asked them to indicate the type of relation through a drop-down list of the following labels: a) Near-Duplicate (ND), a special case where the whole video is near-duplicate to the query video, b) Duplicate Scene (DS), where only some scenes in the candidate video are near-duplicates of scenes in the query video, c) Complementary Scenes (CS), d) Incident Scene (IS), and e) Distractors (DI), i.e. irrelevant videos.

To make sure that annotators were presented with as many potentially relevant videos as possible, we used visual-only, text-only and hybrid similarity in turn. As a result, each annotator reviewed video candidates that had very high similarity with the query video in terms either of their visual content, or text metadata (title and description) or the combination of similarities. Once an initial set of annotations were produced by two independent annotators, the annotators went twice again through the annotations two ensure consistency and accuracy.

FIVR-200K was designed to serve as a benchmark that poses real-world challenges for the problem of reverse video search. Given a query video to be verified, the analyst would want to know whether the same or a very similar version of it has already been published. In that way, the user would be able to easily debunk cases of out-of-context video use (i.e. misappropriation) and on the other hand, if several videos are found that depict the same scene from different viewpoints at approximately the same time, then they could be considered to corroborate the video of interest.

Discussion: Limitations and Caveats

We are confident that the two video verification datasets presented in this column can be valuable resources for researchers interested in the problem of media-based disinformation and could serve both as training sets and as benchmarks for automated video verification methods. Yet, both of them suffer from certain limitations and care should be taken when using them to draw conclusions. 

A first potential issue has to do with the video selection bias arising from the particular way that each of the two datasets was created. The videos of the Fake Video Corpus were selected in a mixed manner trying to include a number of cases that were known to the dataset creators and their collaborators, and was also enriched by a pool of test videos that were submitted for analysis to a publicly available video verification service. As a result, it is likely to be more focused on viral and popular videos. Also, videos were included, for which debunking or corroborating information was found online, which introduces yet another source of bias, potentially towards cases that were more newsworthy or clear cut. In the case of the FIVR-200K dataset, videos were intentionally collected to be between two categories of newsworthy events with the goal of ending up with a relatively homogeneous collection, which would be challenging in terms of content-based retrieval. This means that certain types of content, such as political events, sports and entertainment, are very limited or not present at all in the dataset. 

A question that is related to the selection bias of the above datasets pertains to their relevance for multimedia verification and for real-world applications. In particular, it is not clear whether the video cases offered by the Fake Video Corpus are representative of actual verification tasks that journalists and news editors face in their daily work. Another important question is whether these datasets offer a realistic challenge to automatic multimedia analysis approaches. In the case of FIVR-200K, it was clearly demonstrated (Kordopatis-Zilos et al., 2019) that the dataset is a much harder benchmark for near-duplicate detection methods compared to previous datasets such as CC_WEB_VIDEO and VCDB. Even so, we cannot safely conclude that a method, which performs very well in FIVR-200K, would perform equally well in a dataset of much larger scale (e.g. millions or even billions of videos).

Another issue that affects the access to these datasets and the reproducibility of experimental results relates to the ephemeral nature of online video content. A considerable (and increasing) part of these video collections is taken down (either by their own creators or from the video platform), which makes it impossible for researchers to gain access to the exact video set that was originally collected. To give a better sense of the problem, 21% of the Fake Video Corpus and 11% of the FIVR-200K videos were not available online on September 2019. This issue, which affects all datasets that are based on online multimedia content, raises the more general question of whether there are steps that can be taken by online platforms such as YouTube, Facebook and Twitter that could facilitate the reproducibility of social media research without violating copyright legislation or the platforms’ terms of service.

The ephemeral nature of online content is not the only factor that renders the value of multimedia datasets very sensitive to the passing of time. Especially in the case of online disinformation, there seems to be an arms’ race, where new machine learning methods constantly get better in detecting misleading or tampered content, but at the same time new types of misinformation emerge, which are increasingly AI-assisted. This is particularly profound in the case of deepfakes, where the main research paradigm is based on the concept of competition between a generator (adversary) and a detector (Goodfellow et al., 2014). 

Last but not least, one may always be concerned about the potential ethical issues arising when publicly releasing such datasets. In our case, reasonable concerns for privacy risks, which are always relevant when dealing with social media content, are addressed by complying with the relevant Terms of Service of the source platforms and by making sure that any annotation (label) assigned to the dataset videos is accurate. Additional ethical issues pertain to the potential “dual use” of the dataset, i.e. their use by adversaries to craft better tools and techniques to make misinformation campaigns more effective. A recent pertinent case was OpenAI’s delayed release of their very powerful GPT-2 model, which sparked numerous discussions and criticism, and making clear that there is no commonly accepted practice for ensuring reproducibility of research results (and empowering future research) and at the same time making sure that risks of misuse are eliminated.

Future work

Given the challenges of creating and releasing a large-scale dataset for multimedia verification, the main conclusions from our efforts towards this direction so far are the following:

  • The field of multimedia verification is in constant motion and therefore the concept of a static dataset may not be sufficient to capture the real-world nuances and latest challenges of the problem. Instead new benchmarking models, e.g. in the form of open data challenges, and resources, e.g. constantly updated repository of “fake” multimedia, appear to be more effective for empowering future research in the area.
  • The role of social media and multimedia sharing platforms (incl. YouTube, Facebook, Twitter, etc.) seems to be crucial in enabling effective collaboration between academia and industry towards addressing the real-world consequences of online misinformation. While there have been recent developments towards this direction, including the announcements by both Facebook and Alphabet’s Jigsaw of new deepfake datasets, there is also doubt and scepticism about the degree of openness and transparency that such platforms are ready to offer, given the conflicts of interest that are inherent in the underlying business model. 
  • Building a dataset that is fit for a highly diverse and representative set of verification cases appears to be a task that would require a community effort instead of effort from a single organisation or group. This would not only help towards distributing the massive dataset creation cost and effort to multiple stakeholders, but also towards ensuring less selection bias, richer and more accurate annotation and more solid governance.

References

Allcott, H., Gentzkow, M., “Social media and fake news in the 2016 election”, Journal of economic perspectives, 31(2), pp. 211–36, 2017.
Amerini, I, Ballan, L., Caldelli, R., Del Bimbo, A., Serra, G., “A SIFT-based forensic method for copy-move attack detection and transformation recovery”, IEEE Transactions on Information Forensics and Security, 6(3), pp. 1099–1110,2011.
Boididou, C., Papadopoulos, S., Kompatsiaris, Y., Schifferes, S., Newman, N., “Challenges of computational verification in social multimedia”, In Proceedings of the 23rd ACM International Conference on World Wide Web, pp. 743–748,2014.
Boididou, C., Andreadou, K., Papadopoulos, S., Dang-Nguyen, D.T., Boato, G., Riegler, M., Kompatsiaris, Y., “Verifying multimedia use at MediaEval 2015”. In Proceedings of MediaEval 2015, 2015.
Boididou C., Papadopoulos S., Dang-Nguyen D., Boato G., Riegler M., Middleton S.E., Petlund A., Kompatsiaris Y., “Verifying multimedia use at MediaEval 2016”. In Proceedings of MediaEval 2016, 2016.
Brandtzaeg, P.B., Lüders, M., Spangenberg, J., Rath-Wiggins, L., Følstad, A., “Emerging journalistic verification practices concerning social media”. Journalism Practice, 10(3), pp. 323–342, 2016.
Christlein V., Riess C., Jordan J., Riess C., Angelopoulou, E., “An evaluation of popular copy-move forgery detection approaches”. IEEE Transactions on Information Forensics & Security, 7(6), pp. 1841–1854, 2012.
Derczynski, L., Bontcheva, K., Liakata, M., Procter, R., Hoi, G.W.S., Zubiaga, A., “Semeval-2017 Task 8: Rumoureval: determining rumour veracity and support for rumours”, Proceedings of the 11th International Workshop on Semantic Evaluation,pp. 69-76, 2017.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Bengio, Y., “Generative adversarial nets”. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
Guan, H., Kozak, M., Robertson, E., Lee, Y., Yates, A.N., Delgado, A., Zhou, D., Kheyrkhah, T., Smith, J., Fiscus, J., “MFC datasets: Large-scale benchmark datasets for media forensic challenge evaluation”, In Proceedings of the 2019 IEEEWinter Applications of Computer Vision Workshops, pp. 63–72, 2019.
Güera, D., Delp, E.J., “Deepfake video detection using recurrent neural networks”, In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 1–6, 2018.
Jiang, Y. G., Jiang, Y., Wang, J., “VCDB: A large-scale database for partial copy detection in videos”. In Proceedings of the European Conference on Computer Vision, pp. 357–371, 2014.
Kiesel, J., Mestre, M., Shukla, R., Vincent, E., Adineh, P., Corney, D., Stein, B. Potthast, M., “Semeval-2019 Task 4: Hyperpartisan news detection”. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829–839,2019.
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, I., “FIVR: Fine-grained incident video retrieval”. IEEE Transactions on Multimedia, 21(10), pp. 2638–2652, 2019.
Korus, P., Huang, J., “Multi-scale analysis strategies in PRNU-based tampering localization”, IEEE Transactions on Information Forensics & Security, 21(4), pp. 809–824, 2017.
Lazer, D.M., Baum, M.A., Benkler, Y., Berinsky, A.J., Greenhill, K.M., Menczer, F., Schudson, M., “The science of fake news”, Science, 359(6380), pp. 1094–1096, 2018.
Papadopoulou, O., Zampoglou, M., Papadopoulos, S., Kompatsiaris, I., “A corpus of debunked and verified user-generated videos”. Online Information Review, 43(1), pp. 72–88, 2019.
Rocha, A., Scheirer, W., Boult, T., Goldenstein, S., “Vision of the unseen: Current trends and challenges in digital image and video forensics”, ACM Computing Surveys, 43(4), art. 26, 2011.
Rössler, A. Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M. “Faceforensics++: Learning to detect manipulated facial images”, In Proceedings of the IEEE International Conference on Computer Vision, 2019.
Rubin, V.L., Chen, Y., Conroy, N.J., “Deception detection for news: Three types of fakes”, In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community, art. 83, 2015.
Tandoc Jr, E.C., Lim, Z.W., Ling, R. “Defining “fake news”: A typology of scholarly definitions”, Digital journalism, 6(2), pp. 137–153, 2018.
Tralic, D., Zupancic I., Grgic S., Grgic M., “CoMoFoD – New database for copy-move forgery detection”. In Proceedings of the 55th International Symposium on Electronics in Marine, pp. 49–54, 2013.
Wardle, C., Derakhshan, H., “Information disorder: Toward an interdisciplinary framework for research and policy making”, Council of Europe Report, 27, 2017.
Wu, X., Hauptmann, A.G., Ngo, C.-W., “Practical elimination of near-duplicates from web video search”, In Proceedings of the 15th ACM International Conference on Multimedia, pp. 218–227, 2007.
Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y., “Detecting image splicing in the wild (web)”, In Proceedings of the 2015 IEEE International Conference on Multimedia & Expo Workshops, 2015.
Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y., “Large-scale evaluation of splicing localization algorithms for web images”, Multimedia Tools and Applications, 76(4), pp. 4801–4834, 2017.
Zhou, X., Zafarani, R., “Fake news: A survey of research, detection methods, and opportunities”. arXiv preprint arXiv:1812.00315, 2018.
Zubiaga, A., Aker, A., Bontcheva, K., Liakata, M., Procter, R., “Detection and resolution of rumours in social media: A survey”, ACM Computing Surveys, 51(2), art. 32, 2018.

Appendix A: Examples of videos in the Fake Video Corpus.

Real videos


US Airways Flight 1549 ditched in the Hudson River.


A group of musicians playing in an Istanbul park while bombs explode outside the stadium behind them.


A giant alligator crossing a Florida golf course.

Fake videos


“Syrian boy rescuing a girl amid gunfire” – Staged (fabricated content): The video was filmed by Norwegian Lars Klevberg in Malta.


“Golden Eagle Snatches Kid” – Tampered: The video was created by a team of students in Montreal as part of their course on visual effects.


“Pope Francis slaps Donald Trump’s hand for touching him” – Satire/parody: The video was digitally manipulated, and was made for the late-night television show Jimmy Kimmel Live.

Appendix B: Examples of videos in the Fine-grained Incident Video Retrieval dataset.

Example 1


Query video from the American Airlines Flight 383 fire at Chicago O’Hare International Airport in October 28, 2016.


Duplicate scene video.


Complimentary scene video.


Incident scene video.

Example 2


Query video from the Boston Marathon bombing in April 15, 2013.


Duplicate scene video.


Complimentary scene video.


Incident scene video.

Example 3


Query video from the the Las Vegas shooting in October 1, 2017.


Duplicate scene video.


Complimentary scene video.


Incident scene video.

JPEG Column: 84th JPEG Meeting in Brussels, Belgium

The 84th JPEG meeting was held in Brussels, Belgium.

This meeting was characterised by significant progress in most of JPEG projects and also exploratory studies. JPEG XL, the new image coding system, has issued the Committee Draft, giving shape to this new effective solution for the future of image coding. JPEG Pleno, the standard for new imaging technologies, Part 1 (Framework) and Part 2 (Light field coding) have also reached Draft International Standard status.

Moreover, exploration studies are ongoing in the domain of media blockchain and on the application of learning solutions for image coding (JPEG AI). Both have triggered a number of activities providing new knowledge and opening new possibilities on the future use of these technologies in future JPEG standards.

The 84th JPEG meeting had the following highlights: 84th meetingTE-66694113_10156591758739370_4025463063158194176_n

  • JPEG XL issues the Committee Draft
  • JPEG Pleno Part 1 and 2 reaches Draft International Standard status
  • JPEG AI defines Common Test Conditions
  • JPEG exploration studies on Media Blockchain
  • JPEG Systems –JLINK working draft
  • JPEG XS

In the following, a short description of the most significant activities is presented.

 

JPEG XL

The JPEG XL Image Coding System (ISO/IEC 18181) has completed the Committee Draft of the standard. The new coding technique allows storage of high-quality images at one-third the size of the legacy JPEG format. Moreover, JPEG XL can losslessly transcode existing JPEG images to about 80% of their original size simplifying interoperability and accelerating wider deployment.

The JPEG XL reference software, ready for mobile and desktop deployments, will be available in Q4 2019. The current contributors have committed to releasing it publicly under a royalty-free and open source license.

 

JPEG Pleno

A significant milestone has been reached during this meeting: the Draft International Standard (DIS) for both JPEG Pleno Part 1 (Framework) and Part 2 (Light field coding) have been completed. A draft architecture of the Reference Software (Part 4) and developments plans have been also discussed and defined.

In addition, JPEG has completed an in-depth analysis of existing point cloud coding solutions and a new version of the use-cases and requirements document has been released reflecting the future role of JPEG Pleno in point cloud compression. A new set of Common Test Conditions has been released as a guideline for the testing and evaluation of point cloud coding solutions with both a best practice subjective testing protocol and a set of objective metrics.

JPEG Pleno holography activities had significant advances on the definition of use cases and requirements, and description of Common Test Conditions. New quality assessment methodologies for holographic data defined in the framework of a collaboration between JPEG and Qualinet were established. Moreover, JPEG Pleno continues collecting microscopic and tomographic holographic data.

 

JPEG AI

The JPEG Committee continues to carry out exploration studies with deep learning-based image compression solutions, typically with an auto-encoder architecture. The promise that these types of codecs hold, especially in terms of coding efficiency, will be evaluated with several studies. In this meeting, a Common Test Conditions was produced, which includes a plan for subjective and objective quality assessment experiments as well as coding pipelines for anchor and learning-based codecs. Moreover, a JPEG AI dataset was proposed and discussed, and a double stimulus impairment scale experiment (side-by-side) was performed with a mix of experts and non-experts in a controlled environment.

 

JPEG exploration on Media Blockchain

Fake news, copyright violation, media forensics, privacy and security are emerging challenges in digital media. JPEG has determined that blockchain and distributed ledger technologies (DLT) have great potential as a technology component to address these challenges in transparent and trustable media transactions. However, blockchain and DLT need to be integrated closely with a widely adopted standard to ensure broad interoperability of protected images. JPEG calls for industry participation to help define use cases and requirements that will drive the standardization process. In order to clearly identify the impact of blockchain and distributed ledger technologies on JPEG standards, the committee has organised several workshops to interact with stakeholders in the domain.

The 4th public workshop on media blockchain was organized in Brussels on Tuesday the 16th of July 2019 during the 84th ISO/IEC JTC 1/SC 29/WG1 (JPEG) Meeting. The presentations and program of the workshop are available on jpeg.org.

The JPEG Committee has issued an updated version of the white paper entitled “Towards a Standardized Framework for Media Blockchain” that elaborates on the initiative, exploring relevant standardization activities, industrial needs and use cases.

To keep informed and to get involved in this activity, interested parties are invited to register to the ad hoc group’s mailing list.

 

JPEG Systems – JLINK

At the 84th meeting, IS text reviews for ISO/IEC 19566-5 JUMBF and ISO/IEC 19566-6 JPEG 360 were completed; IS publication will be forthcoming.  Work began on adding functionality to JUMBF, Privacy & Security, and JPEG 360; and initial planning towards developing software implementation of these parts of JPEG Systems specification.  Work also began on the new ISO/IEC 19566-7 Linked media images (JLINK) with development of a working draft.

 

JPEG XS

The JPEG Committee is pleased to announce new Core Experiments and Exploration Studies on compression of raw image sensor data. The JPEG XS project aims at the standardization of a visually lossless low-latency and lightweight compression scheme that can be used as a mezzanine codec in various markets. Video transport over professional video links (SDI, IP, Ethernet), real-time video storage in and outside of cameras, memory buffers, machine vision systems, and data compression onboard of autonomous vehicles are among the targeted use cases for raw image sensor compression. This new work on raw sensor data will pave the way towards highly efficient close-to-sensor image compression workflows with JPEG XS.

 

Final Quote

“Completion of the Committee Draft of JPEG XL, the new standard for image coding is an important milestone. It is hoped that JPEG XL can become an excellent replacement of the widely used JPEG format which has been in service for more than 25 years.” said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.

About JPEG

The Joint Photographic Experts Group (JPEG) is a Working Group of ISO/IEC, the International Organisation for Standardization / International Electrotechnical Commission, (ISO/IEC JTC 1/SC 29/WG 1) and of the International Telecommunication Union (ITU-T SG16), responsible for the popular JPEG, JPEG 2000, JPEG XR, JPSearch, JPEG XT and more recently, the JPEG XS, JPEG Systems, JPEG Pleno and JPEG XL families of imaging standards.

More information about JPEG and its work is available at www.jpeg.org.

Future JPEG meetings are planned as follows:

  • No 85, San Jose, California, U.S.A., November 2 to 8, 2019
  • No 86, Sydney, Australia, January 18 to 24, 2020

MPEG Column: 127th MPEG Meeting in Gothenburg, Sweden

The original blog post can be found at the Bitmovin Techblog and has been modified/updated here to focus on and highlight research aspects.

Plenary of the 127th MPEG Meeting in Gothenburg, Sweden.

Plenary of the 127th MPEG Meeting in Gothenburg, Sweden.

The 127th MPEG meeting concluded on July 12, 2019 in Gothenburg, Sweden with the following topics:

  • Versatile Video Coding (VVC) enters formal approval stage, experts predict 35-60% improvement over HEVC
  • Essential Video Coding (EVC) promoted to Committee Draft
  • Common Media Application Format (CMAF) 2nd edition promoted to Final Draft International Standard
  • Dynamic Adaptive Streaming over HTTP (DASH) 4th edition promoted to Final Draft International Standard
  • Carriage of Point Cloud Data Progresses to Committee Draft
  • JPEG XS carriage in MPEG-2 TS promoted to Final Draft Amendment of ISO/IEC 13818-1 7th edition
  • Genomic information representation – WG11 issues a joint call for proposals on genomic annotations in conjunction with ISO TC 276/WG 5
  • ISO/IEC 23005 (MPEG-V) 4th Edition – WG11 promotes the Fourth edition of two parts of “Media Context and Control” to the Final Draft International Standard (FDIS) stage

The corresponding press release of the 127th MPEG meeting can be found here: https://mpeg.chiariglione.org/meetings/127

Versatile Video Coding (VVC)

The Moving Picture Experts Group (MPEG) is pleased to announce that Versatile Video Coding (VVC) progresses to Committee Draft, experts predict 35-60% improvement over HEVC.

The development of the next major generation of video coding standard has achieved excellent progress, such that MPEG has approved the Committee Draft (CD, i.e., the text for formal balloting in the ISO/IEC approval process).

The new VVC standard will be applicable to a very broad range of applications and it will also provide additional functionalities. VVC will provide a substantial improvement in coding efficiency relative to existing standards. The improvement in coding efficiency is expected to be quite substantial – e.g., in the range of 35–60% bit rate reduction relative to HEVC although it has not yet been formally measured. Relative to HEVC means for equivalent subjective video quality at picture resolutions such as 1080p HD or 4K or 8K UHD, either for standard dynamic range video or high dynamic range and wide color gamut content for levels of quality appropriate for use in consumer distribution services. The focus during the development of the standard has primarily been on 10-bit 4:2:0 content, and 4:4:4 chroma format will also be supported.

The VVC standard is being developed in the Joint Video Experts Team (JVET), a group established jointly by MPEG and the Video Coding Experts Group (VCEG) of ITU-T Study Group 16. In addition to a text specification, the project also includes the development of reference software, a conformance testing suite, and a new standard ISO/IEC 23002-7 specifying supplemental enhancement information messages for coded video bitstreams. The approval process for ISO/IEC 23002-7 has also begun, with the issuance of a CD consideration ballot.

Research aspects: VVC represents the next generation video codec to be deployed in 2020+ and basically the same research aspects apply as for previous generations, i.e., coding efficiency, performance/complexity, and objective/subjective evaluation. Luckily, JVET documents are freely available including the actual standard (committee draft), software (and its description), and common test conditions. Thus, researcher utilizing these resources are able to conduct reproducible research when contributing their findings and code improvements back to the community at large. 

Essential Video Coding (EVC)

MPEG-5 Essential Video Coding (EVC) promoted to Committee Draft

Interestingly, at the same meeting as VVC, MPEG promoted MPEG-5 Essential Video Coding (EVC) to Committee Draft (CD). The goal of MPEG-5 EVC is to provide a standardized video coding solution to address business needs in some use cases, such as video streaming, where existing ISO video coding standards have not been as widely adopted as might be expected from their purely technical characteristics.

The MPEG-5 EVC standards includes a baseline profile that contains only technologies that are over 20 years old or are otherwise expected to be royalty-free. Additionally, a main profile adds a small number of additional tools, each providing significant performance gain. All main profile tools are capable of being individually switched off or individually switched over to a corresponding baseline tool. Organizations making proposals for the main profile have agreed to publish applicable licensing terms within two years of FDIS stage, either individually or as part of a patent pool.

Research aspects: Similar research aspects can be described for EVC and from a software engineering perspective it could be also interesting to further investigate this switching mechanism of individual tools or/and fall back option to baseline tools. Naturally, a comparison with next generation codecs such as VVC is interesting per se. The licensing aspects itself are probably interesting for other disciplines but that is another story…

Common Media Application Format (CMAF)

MPEG ratified the 2nd edition of the Common Media Application Format (CMAF)

The Common Media Application Format (CMAF) enables efficient encoding, storage, and delivery of digital media content (incl. audio, video, subtitles among others), which is key to scaling operations to support the rapid growth of video streaming over the internet. The CMAF standard is the result of widespread industry adoption of an application of MPEG technologies for adaptive video streaming over the Internet, and widespread industry participation in the MPEG process to standardize best practices within CMAF.

The 2nd edition of CMAF adds support for a number of specifications that were a result of significant industry interest. Those include

  • Advanced Audio Coding (AAC) multi-channel;
  • MPEG-H 3D Audio;
  • MPEG-D Unified Speech and Audio Coding (USAC);
  • Scalable High Efficiency Video Coding (SHVC);
  • IMSC 1.1 (Timed Text Markup Language Profiles for Internet Media Subtitles and Captions); and
  • additional HEVC video CMAF profiles and brands.

This edition also introduces CMAF supplemental data handling as well as new structural brands for CMAF that reflects the common practice of the significant deployment of CMAF in industry. Companies adopting CMAF technology will find the specifications introduced in the 2nd Edition particularly useful for further adoption and proliferation of CMAF in the market.

Research aspects: see below (DASH).

Dynamic Adaptive Streaming over HTTP (DASH)

MPEG approves the 4th edition of Dynamic Adaptive Streaming over HTTP (DASH)

The 4th edition of MPEG-DASH comprises the following features:

  • service description that is intended by the service provider on how the service is expected to be consumed;
  • a method to indicate the times corresponding to the production of associated media;
  • a mechanism to signal DASH profiles and features, employed codec and format profiles; and
  • supported protection schemes present in the Media Presentation Description (MPD).

It is expected that this edition will be published later this year. 

Research aspects: CMAF 2nd and DASH 4th edition come along with a rich feature set enabling a plethora of use cases. The underlying principles are still the same and research issues arise from updated application and service requirements with respect to content complexity, time aspects (mainly delay/latency), and quality of experience (QoE). The DASH-IF awards the excellence in DASH award at the ACM Multimedia Systems conference and an overview about its academic efforts can be found here.

Carriage of Point Cloud Data

MPEG progresses the Carriage of Point Cloud Data to Committee Draft

At its 127th meeting, MPEG has promoted the carriage of point cloud data to the Committee Draft stage, the first milestone of ISO standard development process. This standard is the first one introducing the support of volumetric media in the industry-famous ISO base media file format family of standards.

This standard supports the carriage of point cloud data comprising individually encoded video bitstreams within multiple file format tracks in order to support the intrinsic nature of the video-based point cloud compression (V-PCC). Additionally, it also allows the carriage of point cloud data in one file format track for applications requiring multiplexed content (i.e., the video bitstream of multiple components is interleaved into one bitstream).

This standard is expected to support efficient access and delivery of some portions of a point cloud object considering that in many cases that entire point cloud object may not be visible by the user depending on the viewing direction or location of the point cloud object relative to other objects. It is currently expected that the standard will reach its final milestone by the end of 2020.

Research aspects: MPEG’s Point Cloud Compression (PCC) comes in two flavors, video- and geometric-based but still requires to be packaged into file and delivery formats. MPEG’s choice here is the ISO base media file format and the efficient carriage of point cloud data is characterized by both functionality (i.e., enabling the required used cases) and performance (such as low overhead).

MPEG 2 Systems/Transport Stream

JPEG XS carriage in MPEG-2 TS promoted to Final Draft Amendment of ISO/IEC 13818-1 7th edition

At its 127th meeting, WG11 (MPEG) has extended ISO/IEC 13818-1 (MPEG-2 Systems) – in collaboration with WG1 (JPEG) – to support ISO/IEC 21122 (JPEG XS) in order to support industries using still image compression technologies for broadcasting infrastructures. The specification defines a JPEG XS elementary stream header and specifies how the JPEG XS video access unit (specified in ISO/IEC 21122-1) is put into a Packetized Elementary Stream (PES). Additionally, the specification also defines how the System Target Decoder (STD) model can be extended to support JPEG XS video elementary streams.

Genomic information representation

WG11 issues a joint call for proposals on genomic annotations in conjunction with ISO TC 276/WG 5

The introduction of high-throughput DNA sequencing has led to the generation of large quantities of genomic sequencing data that have to be stored, transferred and analyzed. So far WG 11 (MPEG) and ISO TC 276/WG 5 have addressed the representation, compression and transport of genome sequencing data by developing the ISO/IEC 23092 standard series also known as MPEG-G. They provide a file and transport format, compression technology, metadata specifications, protection support, and standard APIs for the access of sequencing data in the native compressed format.

An important element in the effective usage of sequencing data is the association of the data with the results of the analysis and annotations that are generated by processing pipelines and analysts. At the moment such association happens as a separate step, standard and effective ways of linking data and meta information derived from sequencing data are not available.

At its 127th meeting, MPEG and ISO TC 276/WG 5 issued a joint Call for Proposals (CfP) addressing the solution of such problem. The call seeks submissions of technologies that can provide efficient representation and compression solutions for the processing of genomic annotation data.

Companies and organizations are invited to submit proposals in response to this call. Responses are expected to be submitted by the 8th January 2020 and will be evaluated during the 129th WG 11 (MPEG) meeting. Detailed information, including how to respond to the call for proposals, the requirements that have to be considered, and the test data to be used, is reported in the documents N18648, N18647, and N18649 available at the 127th meeting website (http://mpeg.chiariglione.org/meetings/127). For any further question about the call, test conditions, required software or test sequences please contact: Joern Ostermann, MPEG Requirements Group Chair (ostermann@tnt.uni-hannover.de) or Martin Golebiewski, Convenor ISO TC 276/WG 5 (martin.golebiewski@h-its.org).

ISO/IEC 23005 (MPEG-V) 4th Edition

WG11 promotes the Fourth edition of two parts of “Media Context and Control” to the Final Draft International Standard (FDIS) stage

At its 127th meeting, WG11 (MPEG) promoted the 4th edition of two parts of ISO/IEC 23005 (MPEG-V; Media Context and Control) standards to the Final Draft International Standard (FDIS). The new edition of ISO/IEC 23005-1 (architecture) enables ten new use cases, which can be grouped into four categories: 3D printing, olfactory information in virtual worlds, virtual panoramic vision in car, and adaptive sound handling. The new edition of ISO/IEC 23005-7 (conformance and reference software) is updated to reflect the changes made by the introduction of new tools defined in other parts of ISO/IEC 23005. More information on MPEG-V and its parts 1-7 can be found at https://mpeg.chiariglione.org/standards/mpeg-v.


Finally, the unofficial highlight of the 127th MPEG meeting we certainly found while scanning the scene in Gothenburg on Tuesday night…

MPEG127_Metallica

Qualinet Databases: Central Resource for QoE Research – History, Current Status, and Plans

Introduction

Datasets are an enabling tool for successful technological development and innovation in numerous fields. Large-scale databases of multimedia content play a crucial role in the development and performance evaluation of multimedia technologies. Among those are most importantly audiovisual signal processing, for example coding, transmission, subjective/objective quality assessment, and QoE (Quality of Experience) [1]. Publicly available and widely accepted datasets are necessary for a fair comparison and validation of systems under test; they are crucial for reproducible research. In the public domain, large amounts of relevant multimedia contents are available, for example, ACM SIGMM Records Dataset Column (http://sigmm.hosting.acm.org/category/datasets-column/), MediaEval Benchmark (http://www.multimediaeval.org/), MMSys Datasets (http://www.sigmm.org/archive/MMsys/mmsys14/index.php/mmsys-datasets.html), etc. However, the description of these datasets is usually scattered – for example in technical reports, research papers, online resources – and it is a cumbersome task for one to find the most appropriate dataset for the particular needs.

The Qualinet Multimedia Databases Online platform is one of many efforts to provide an overview and comparison of multimedia content datasets – especially for QoE-related research, all in one place. The platform was introduced in the frame of ICT COST Action IC1003 European Network on Quality of Experience in Multimedia Systems and Services – Qualinet (http://www.qualinet.eu). The platform, abbreviated “Qualinet Databases” (http://dbq.multimediatech.cz/), is used to share information on databases with the community [3], [4]. Qualinet was supported as a COST Action between November 8, 2010, and November 7, 2014. It has continued as an independent entity with a new structure, activities, and management since 2015. Qualinet Databases platform fulfills the initial goal to provide a rich and internationally recognized database and has been running since 2010. It is widely considered as one of Qualinet’s most notable achievements.

In the following paragraphs, there is a summary on Qualinet Databases, including its history, current status, and plans.

Background

A commonly recognized database for multimedia content is a crucial resource required not only for QoE-related research. Among the first published efforts in this field are the image and video quality resources website by Stefan Winkler (https://stefan.winklerbros.net/resources.html) and related publications providing in-depth analysis of multimedia content databases [2]. Since 2010, one of the main interests of Qualinet and its Working Group 4 (WG4) entitled Databases and Validation (Leader: Christian Timmerer, Deputy Leaders: Karel Fliegel, Shelley Buchinger, Marcus Barkowsky) was to create an even broader database with extended functionality and take the necessary steps to make it accessible to all researchers.

Qualinet firstly decided to list and summarize available multimedia databases based on a literature search and feedback from the project members. As the number of databases in the list was rapidly increasing, the handling of the necessary updates became inefficient. Based on these findings, WG4 started the implementation of the Qualinet Databases online platform in 2011. Since then, the website has been used as Qualinet’s central resource for sharing the datasets among Qualinet members and the scientific community. To the best of our knowledge, there is no other publicly available resource for QoE research that offers similar functionality. The Qualinet Databases platform is intended to provide more features than other known similar solutions such as Consumer Video Digital Library (http://www.cdvl.org). The main difference lies in the fact that the Qualinet Databases acts as a hub to various scattered resources of multimedia content, especially with the available data, such as MOS (Mean Opinion Score), raw data from subjective experiments, eye-tracking data, and detailed descriptions of the datasets including scientific references.

In the development of Qualinet DBs within the frame of COST Action IC1003, there are several milestones, which are listed in the timeline below:

  • March 2011 (1st Qualinet General Assembly (GA), Lisbon, Portugal), an initial list of multimedia databases collected and published internally for Qualinet members, creation of Web-based portal proposed,
  • September 2011 (2nd Qualinet GA, Brussels, Belgium), Qualinet DBs prototype portal introduced, development of publicly available resource initiated,
  • February 2012 (3rd Qualinet GA, Prague, Czech Republic), hosting of the Qualinet DBs platform under development at the Czech Technical University in Prague (http://dbq.multimediatech.cz/), Qualinet DBs Wiki page (http://dbq-wiki.multimediatech.cz/) introduced,
  • October 2012 (4th Qualinet GA, Zagreb, Croatia), White paper on Qualinet DBs published [3], Qualinet DBs v1.0 online platform released to the public,
  • March 2013 (5th Qualinet GA, Novi Sad, Serbia), Qualinet DBs v1.5 online platform published with extended functionality,
  • September 2013 (6th Qualinet GA, Novi Sad, Serbia), Qualinet DBs Information leaflet published, Task Force (TF) on Standardization and Dissemination established, QoMEX 2013 Dataset Track organized,
  • March 2014 (7th Qualinet GA, Berlin, Germany), ACM MMSys 2014 Dataset Track organized, liaison with Ecma International (https://www.ecma-international.org/) on possible standardization of Qualinet DBs subset established,
  • October 2014 (8th Final Qualinet GA and Workshop, Delft, The Netherlands), final development stage v3.00 of Qualinet DBs platform reached, code freeze.

Qualinet Databases became Qualinet’s primary resource for sharing datasets publicly to Qualinet members and after registration also to the broad scientific community. At the final Qualinet General Assembly under the COST Action IC1003 umbrella (October 2014, Delft, The Netherlands) it was concluded – also based on numerous testimonials – that Qualinet DBs is one of the major assets created throughout the project. Thus it was decided that the sustainability of this resource must be ensured for the years to come. Since 2015 the Qualinet DBs platform is being kept running with the effort of a newly established Task Force, TF4 Qualinet Databases (Leader: Karel Fliegel, Deputy Leaders: Lukáš Krasula, Werner Robitza). The status and achievements are being discussed regularly at Qualinet’s Annual Meetings collocated with QoMEX (International Conference on Quality of Multimedia Experience), i.e., 7th QoMEX 2015 (Costa Navarino, Greece), 8th QoMEX 2016 (Lisbon, Portugal), 9th QoMEX 2017 (Erfurt, Germany), 10th QoMEX 2018 (Sardinia, Italy), and 11th QoMEX 2019 (Berlin, Germany).

Current Status

The basic functionality of the Qualinet Databases online platform, see Figure 1, is based on the idea that registered users (Qualinet members and other interested users from the scientific community) have access through an easy-to-use Web portal providing a list of multimedia databases. Based on their user rights, they are allowed to browse information about the particular database and eventually download the actual multimedia content from the link provided by the database owner.

qualinetDatabaseInterface

Figure 1. Qualinet Databases online platform and its current interface.

Selected users – Database Owners in particular – have rights to upload or edit their records in the list of databases. Most of the multimedia databases have a flag of “Publicly Available” and are accessible to the registered users outside Qualinet. Only Administrators (Task Force leader and deputy leaders) have the right to delete records in the database. Qualinet DBs does not contain the actual multimedia content but only the access information with provided links to the dataset files saved at the server of the Database Owner.

The Qualinet DBs is accessible to all registered users after entering valid login data. Depending on the level of the rights assigned to the particular account, the user can browse the list of the databases with description (all registered users) and has access to the actual multimedia content via a link entered by the Database Owner. It provides the user with a powerful tool to find the multimedia database that best suits his/her needs.

In the list of databases user can select visible fields for the list in the User Settings, namely:

  • Database name, Institution, Qualinet Partner (Yes/No),
  • Link, Description (abstract), Access limitations, Publicly available (Yes/No), Copyright Agreement signed (Yes/No),
  • Citation, References, Copyright notice, Database usage tracking,
  • Content type, MOS (Yes/No), Other (Eye tracking, Sensory, …),
  • Total number of contents, SRC, HRC,
  • Subjective evaluation method (DSCQS, …), Number of ratings.

Fulltext search within the selected visible fields is available. In the current version of the Qualinet DBs, users can sort databases alphabetically based on the visible fields or use the search field as described above.

The list of databases allows:

  • Opening a card with details on particular database record (accessible to all users),
  • Editing database record (accessible to the database owners and administrators),
  • Deleting database record (accessible only to administrators),
  • Requesting deletion of a database record (accessible to the database owners),
  • Requesting assignment as the database owner (accessible to all users).

As for the records available in Qualinet DBs, the listed multimedia databases are a crucial resource for various tasks in multimedia signal processing. The Qualinet DBs is focused primarily on QoE research [1] related content, where, while designing objective quality assessment algorithms, it is necessary to perform (1) Verification of model during development, (2) Validation of model after development, and (2) Benchmarking of various models.

Annotated multimedia databases contain essential ground truth, that is, test material from the subjective experiment annotated with subjective ratings. Qualinet DBs also lists other material without subjective ratings for other kinds of experiments. Qualinet DBs covers mostly image and video datasets, including special contents (e.g., 3D, HDR) and data from subjective experiments, such as subjective quality ratings or visual attention data.

A timeline with statistics on the number of records and users registered in Qualinet DBs throughout the years can be seen in Figure 2. Throughout Qualinet COST Action IC1003 the number of registered datasets grew from 64 in March 2011 to 201 in October 2014. The number of datasets created by the Qualinet partner institutions grew from 30 in September 2011 to 83 in October 2014. The number of registered users increased from 37 in March 2013 to 222 in October 2014. After the end of COST Action IC1003 in November 2014 the number of datasets increased to 246 and the number of registered users to 491. The average yearly increase of registered users is approximately 56 users, which illustrates continuous interest and value of Qualinet DBs for the community.

Figure 2. Qualinet Databases statistics on the number of records and users.

Figure 2. Qualinet Databases statistics on the number of records and users.

Besides the Qualinet DBs online platform (http://dbq.multimediatech.cz/), there are also additional resources available for download via the Wiki page (http://dbq-wiki.multimediatech.cz) and Qualinet website (http://www.qualinet.eu/). Two documents are available: (1) “QUALINET Multimedia Databases v6.5” (May 28, 2017) with a detailed description of registered datasets, and “List of QUALINET Multimedia Databases v6.5” in a searchable spreadsheet with records as of May 28, 2017.

Plans

There are indicators – especially the number of registered users – showing that Qualinet DBs is a valuable resource for the community. However, the current platform as described above has not been updated since 2014, and there are several issues to be solved, such as the burden on one institution to host and maintain the system, possible instability and an obsolete interface, issues with the Wiki page and lack of a file repository. Moreover, in the current system, user registration is required. It is a very useful feature for usage tracking, ensuring database privacy, but at the same time, it can put some people off from using and adding new datasets, and it requires handling of personal data. There are also numerous obsolete links in Qualinet DBs, which is useful for the record, but the respective databases should be archived.

A proposal for a new platform for Qualinet DBs has been presented at the 13th Qualinet General Meeting in June 2019 (Berlin, Germany) and was subsequently supported by the assembly. The new platform is planned to be based on a Git repository so that the system will be open-source and text-based, and no database will be needed. The user-friendly interface is to be provided by a static website generator; the website itself will be hosted on GitHub. A similar approach has been successfully implemented for the VQEG Software & Tools (https://vqeg.github.io/software-tools/) web portal. Among the main advantages of the new platform are (1) easier access (i.e., fast performance with simple interface, no hosting fees and thus long term sustainability, no registration necessary and thus no entry barrier), (2) lower maintenance burden (i.e., minimal technical maintenance effort needed, easy code editing), and (3) future-proofness (i.e., databases are just text files with easy format conversion, and hosting can be done on any server).

On the other hand, the new platform will not support user registration and login, which is beneficial in order to prevent data privacy issues. Tracking of registered users will no longer be available, but database usage tracking is planned to be provided via, for example, Google Analytics. There are three levels of dataset availability in the current platform: (1) Publicly available dataset, (2) Information about dataset but data not available/available upon request, and (3) Not publicly available (e.g., Qualinet members only, not supported in the new platform). The migration of Qualinet DBs to the new platform is to be completed by mid-2020. Current data are to be checked and sanitized, and obsolete records moved to the archive.

Conclusions

Broad audiovisual contents with diverse characteristics, annotated with data from subjective experiments, is an enabling resource for research in multimedia signal processing, especially when QoE is considered. The availability of training and testing data becomes even more important nowadays, with ever-increasing utilization of machine learning approaches. Qualinet Databases helps to facilitate reproducible research in the field and has become a valuable resource for the community. 

References

  • [1] Le Callet, P., Möller, S., Perkis, A. Qualinet White Paper on Definitions of Quality of Experience, European Network on Quality of Experience in Multimedia Systems and Services (COST Action IC 1003), Lausanne, Switzerland, Version 1.2, March 2013. (http://www.qualinet.eu/images/stories/QoE_whitepaper_v1.2.pdf
  • [2] Winkler, S. Analysis of public image and video databases for quality assessment, IEEE Journal of Selected Topics in Signal Processing, 6(6):616-625, 2012. (https://doi.org/10.1109/JSTSP.2012.2215007)
  • [3] Fliegel, K., Timmerer, C. (eds.) WG4 Databases White Paper v1.5: QUALINET Multimedia Database enabling QoE Evaluations and Benchmarking, Prague/Klagenfurt, Czech Republic/Austria, Version 1.5, March 2013.
  • [4] Fliegel, K., Battisti, F., Carli, M., Gelautz, M., Krasula, L., Le Callet, P., Zlokolica, V. 3D Visual Content Datasets. In: Assunção P., Gotchev A. (eds) 3D Visual Content Creation, Coding and Delivery. Signals and Communication Technology, Springer, Cham, 2019. (https://doi.org/10.1007/978-3-319-77842-6_11)

NoteThe readers interested in active contribution to extending the success of Qualinet Databases are referred to Qualinet (http://www.qualinet.eu/) and invited to join its Task Force on Qualinet Databases via email reflector. To subscribe, please send an email to (dbq.wg4.qualinet-subscribe@listes.epfl.ch). This work was partially supported by the project No. GA17-05840S “Multicriteria optimization of shift-variant imaging system models” of the Czech Science Foundation.

Report from MMSYS 2019 – by Alia Sheikh

Alia Sheikh (@alteralias) is researching immersive and interactive content. At present she is interested in the narrative language of immersive environments and how stories can best be choreographed within them.

Being part of an international academic research community and actually meeting said international research community are not exactly the same thing it turns out. After attending the 2019 ACM MMSys conference this year, I have decided that leaving the office and actually meeting the people behind the research is very worth doing.

This year I was invited to give an overview presentation at ACM MMSys ’19, which was being hosted at the University of Massachusetts. The MMSys, NOSSDAV and MMVE (International Workshop on Immersive Mixed and Virtual Environment Systems) conferences happen back to back, in a different location each year. I was asked to talk about some of our team’s experiments in immersive storytelling at MMVE. This included our current work on lightfields and my work on directing attention in, and the cinematography of, immersive environments.

To be honest it wasn’t the most convenient time to decide to catch a plane to New York and then a train to Boston for a multi-day conference, but it felt like the right time to take a break from the office and find out what the rest of the community had been working on.

Fig.1: A picturesque scene from the wonderful University of Massachussetts Amherst campus

Fig.1: A picturesque scene from the wonderful University of Massachussetts Amherst campus

I arrived at Amherst the day before the conference and (along with another delegate who had taken the same bus) wandered the tranquil university grounds slightly lost before being rescued by the ever calm and cheerful Michael Zink. Michael is the chair of the MMSys organising committee and someone who later spent much of the conference introducing people with shared interests to each other – he appeared to know every delegate by name.

Once installed in my UMass hotel room, I proceeded to spend the evening on my usual pre-conference ritual: entirely rewriting my presentation.

As the timetable would have it, I was going to be the first speaker.

Fig 2: Attendees at MMSys 2019 taking their seats

Fig. 2: Attendees at MMSys 2019 taking their seats

Fig 3: Alia in full flow during our talk on day 1

Fig. 3: Alia in full flow during our talk on day 1

I don’t actually know why I do this to myself, but there is something about turning up to the event proper that gives you a sense of what will work for that particular audience, and Michael had given me a brilliantly concise snapshot of the type of delegate that MMSys attracts – highly motivated, expert on the nuts and bolts of how to get data to where it needs to be and likely to be interested in a big picture overview of how these systems can be used to create a meaningful human connection.

Using selected examples from our research, I put together a talk on how the experience of stories in high tech immersive environments differs from more traditional formats, but, once the language of immersive cinematography is properly understood, we find that we are able to create new narrative experiences that are both meaningful and emotionally rich.

The next morning I walked into an auditorium full of strangers filing in, gave my talk (I thought it went well?) and then sank happily into a plush red flip-seat chair safe in the knowledge that I was free to enjoy the rest of the event.

The next item was the keynote and easily one of the best talks I have ever experienced at a conference. Presented by Professor Nimesha Ranasinghe it was a masterclass in taking an interesting problem (how do we transmit a full sensory experience over a network?) And presenting it in such a way as to neatly break down and explain the science (we can electrically stimulate the tongue to recreate a taste!) while never losing sight of the inherent joy in working on the kind of science you dream of as a child (therefore electrified cutlery!).

Fig. 4: Professor Nimesha Ranasinghe during his talk on Multisensory experiences

Fig. 4: Professor Nimesha Ranasinghe during his talk on Multisensory experiences

Fig 5: Multisensory enhanced multimedia - experiences of the future ?

Fig. 5: Multisensory enhanced multimedia – experiences of the future ?

Fig6: Networking and some delicious lunch

Fig. 6: Networking and some delicious lunch

At lunch I discovered the benefit of having presented my talk early – I made a lot of friends with people who had specific questions about our work, and got a useful heads up on work they were presenting either in the afternoon’s long papers session or the poster session.

We all spent the evening at the welcome reception on the top floor of UMass Hotel, where we ate a huge variety of tiny, delicious cakes and got to know each other better. It was obvious that in some cases, researchers that might collaborate remotely all year, were able to use MMSys as an excellent opportunity to catch up. As a newcomer to this ACM conference however, I have to say that I found it a very welcoming event, and I met a lot of very friendly people many of them working on research that was entirely different to my own, but which seemed to offer an interesting insight or area of overlap.

I wasn’t surprised that I really enjoyed MMVE – virtual environments are very much my topic of interest right now. But I was delighted by how much of MMSys was entirely up my street. ACM MMSys provides a forum for researchers to present and share their latest research findings in multimedia systems, and the conference cuts across all media/data types to showcase the intersections and the interplay of approaches and solutions developed for different domains. This year, the work presented on how to best encode and transport mixed reality content, as well as predict head motion to better encode and deliver the part of a spherical panorama a viewer was likely to be looking at, was particularly interesting to me. I wondered whether comparing the predicted path of user attention to the desired path of user attention, would teach us how to better control a users attention within a panoramic scene, or whether peoples viewing patterns were simply too variable. In the Open Datasets & Software track, I was fascinated by one particular dataset: “ A Dataset of Eye Movements for the Children with Autism Spectrum Disorder”. This was a timely reminder for me that diversity within the audience needed to be catered for when designing multimedia systems, to avoid consigning sections of our audience to a substandard experience.

Of the demos, there were too many interesting ones to list, but I was hugely impressed by the demo for Multi-Sensor Capture and Network Processing for Virtual Reality Conferencing. This used cameras and Kinects to turn me into a point cloud and put a live 3D representation of my own physical body in a virtual space.A brilliantly simple and incredibly effective idea and I found myself sitting next to the people responsible for it at a talk later that day and discussing ways to optimise their data compression.

Despite wearing a headset that allowed me to see the other participants, I was still able to see and therefore use my own hands in the real world – even extending to picking up and using my phone.

Fig7: Trying out some cool demos during a bustling demo session

Fig. 7: Trying out some cool demos during a bustling demo session

Fig. 8: An example of the social media interaction from my "tweeting"

Fig. 8: An example of the social media interaction from my “tweeting”

Amusingly, I found that I was (virtually) sat next to a point-cloud of TNO researcher Omar Niamut which led to my favourite twitter exchange of the whole conference. I knew Omar from online, but we had never actually managed to meet in real life. Still, this was the most life-like digital incarnation yet!

I really should mention the Women’s and Diversity lunch event which (pleasingly) was attended by both men and women and offered some absolutely fascinating insights.

These included: the value of mentors over the course of a successful academic life, how a gender pay-gap is inextricably related to work family policies and steps that have successfully been taken by some countries and organisations to improve work-life balance for all genders.

It was incredibly refreshing to see these topics being discussed both scientifically and openly. The conversations I had with people afterwards as they opened up about their own experiences of work and parenthood, were among the most interesting I have ever had on the topic.

Another nice surprise – MMSys offers childcare grants available for conference attendees who are bringing small children to the conference and require on-site childcare or who incur extra expenses in leaving their children at home. It was very cheering to see that the Inclusion Policy did not stop at simply providing interesting talks, but also translated into specific inclusive action.

Fig. 9:  Women’s and Diversity lunch! What a wonderful initiative - well done MMSys and SIGMM

Fig. 9: Women’s and Diversity lunch! What a wonderful initiative – well done MMSys and SIGMM

I am delighted that I made the decision to attend MMSys. I had not realised that I was feeling somewhat detached from my peers and the academic research community in general, until I was put in an environment which contained a concentrated amount of interesting research, interesting researchers and an air of collaboration and sheer good will. It is easy to get tunnel vision when you are focused on your own little area of work, but every conversation I had at the conference reminded me that research does not happen in a vacuum.

Fig. 10: A fascinating talk at the  Women’s and Diversity lunch - it initiated great post event discussions!

Fig. 10: A fascinating talk at the Women’s and Diversity lunch – it initiated great post event discussions!

Fig. 11: The food truck experience - one of many wonderful social aspects to MMSys 2019

Fig. 11: The food truck experience – one of many wonderful social aspects to MMSys 2019

I could write a thousand more words about every interesting thing I saw or person I met at MMSys, but that would only give you my own specific experience of the conference. (I did live tweet* a lot of the talks and demos just for my own records and that can all be found here: https://twitter.com/Alteralias/status/1148546945859952640?s=20)

Fig. 12: Receiving the SIGMM Social Media Reporter Award for MMSys 2019!

Fig. 12: Receiving the SIGMM Social Media Reporter Award for MMSys 2019!

Whether you were someone I was sitting next to at a paper session, a person I spoke to standing next to in line at the food truck (one of the many sociable meal events) or someone who demoed their PhD work to me, thank you so much for sharing this event with me.

Maybe I will see you at MMSys 2020.

* p.s it turns out that if you live-tweet an entire conference, Niall gives you a Social Media Reporter award.

An Interview with Professor Susanne Boll

Describe your journey into research from your youth up to the present. What foundational lessons did you learn from this journey? 

My journey into research started with my interest in computers and computer science at school while I was still in my early years at that time. I liked all the STEM subjects and was very good at these in school. I got in touch with programming and the first Mac in high school when my physics teacher started the first basic programming course. After highschool, I continued on this journey and became a Mathematical-Technical1 Assistant and continued studying CS and went on to do a PhD, always driven by the desire that I could learn more, could explore and understand more of this field.

Why were you initially attracted to Multimedia? 

Susanne at

Susanne Boll at the beginning of her research career in 2001

I was initially attracted by multimedia when information systems started to look at novel methods of integrating large amounts of unstructured multimedia and different media types into structured database systems. I joined the GMD Institute for Integrated Publication and Information Systems who were working on multimedia database systems. My PhD was on multimedia document models for representing and replaying multimedia presentations in the context of multimedia information systems. One of the most inspiring early events was a small but very nice IFIP working conference on Database Semantics – Semantic Issues in Multimedia Systems in New Zealand 1999 where I met many researchers from the multimedia community some of whom I still consider my research friends today. I stayed in the field of multimedia but as my work was always relating to the applications of multimedia and the interaction with the user it was not surprising that I moved into the field of Human Computer Interaction and SIGCHI in which I am an active member also today. Over the last three decades I have worked in the field of interactive multimedia and human computer interaction – in different application domains from personal media to health, from mobility to industry 4.0. To cite a much valued friend of mine whom I just met again – “I enjoy when my research makes me smile”, when I can see how research can be translated in applications for a better use.

Why did I volunteer for the role of the director for diversity and outreach? 

Professor Susanne Boll in 2019

Professor Susanne Boll in 2019

Over more than three decades now I was supporting gender equality as a mentor, in different roles, in committees and institutions, by speaking up and by driving actions. Within the multimedia community I observed that there are many individuals supporting and acting for a better gender equality, however, it remained efforts of individuals and we as a community were not able to turn this into a collective understanding. 

There were actually a few recent events related to SIGMM that made truly sad and consider if I should leave this community which I at the same time consider my home community. Some years ago I was observing in a panel in which only men were discussing the future and challenges in multimedia. Observing this was painful for me. I knew and met with each of them individually over the years and they were interesting researchers and great mentors. But that panel it made again obvious that we as a community failed to be inclusive also with regard to the women. Why would there be not an excellent woman would have her say in that panel? Why would not someone organizing the panel consider to be inclusive with regard to gender? Why would not the panelists, when they are invited, ask who else would be on the panel and encourage this?

When I talk about gender equality in these days I almost immediately get the reaction that gender is not diversity. People say that looking at gender equality would be too short sighted and that I should care more about diversity and not gender alone. So let me clearly say that I am well aware that diversity is not gender it is much more than that.  But, don’t let the perfect be the enemy of the good. My personal story starts with gender equality in STEM fields. Looking at women participation in SIGMM, I decided that the actions described in the “25 in 25’’ strategy would be a good starting point for my new role – it is just the beginning.

What are my plans serving in this position?

Within SIGMM, we need to understand and fully embrace the different dimensions of diversity. We should not use the term in the sense of an easy cover-up of a multitude of aspects in which the individual needs get blurred. I sometimes have the feeling as if one aspect of diversity could be traded for another one, and the term was used as if there was a measure that there is “sufficient” diversity in some setting. 

As a  director for diversity and outreach I will be caring about the richness of diversity.  I want to bring the different dimensions of diversity into the multimedia community and make us understand, embrace listen and take action for better diversity and outreach of SIGMM.


1Mathematical-Technical Assistant (MaTA, MA or MTA for short; also: mathematical-technical software developer) is the occupational title of a recognised training occupation according to the Vocational Training Act in Germany, which has existed since the mid-1960s. It is the first non-academic training occupation in data processing.


Bios

Prof. Susanne Boll: 

Susanne Boll is a full professor for Media Informatics and Multimedia Systems at the University of Oldenburg and a member of the board of the OFFIS-Institute for Information Technology. OFFIS belongs to the top 5% research institutes among the non-university institutes in computer science in Germany. Over the last two decades, she has consistently achieved highly competitive research results in the field of multimedia and human–computer interaction. She has actively been driving these fields of research by many scientific research projects and organization of highly visible events in the field. Her scientific results have been published in competitive peer-reviewed international conferences such as Multimedia, CHI, MobileHCI, AutomotiveUI, DIS, and IDC, as well as internationally recognized journals. Her research makes competitive contributions to the field of human computer interaction and ubiquitous computing. Her research projects also have a strong connection to industry partners and application partners and addresses highly relevant challenges in the applications field of automation in transportation systems as well as health care technologies. I am an active member of the scientific community and have co-chaired and organized many international events in my field. Her teaching follows combination of theoretical foundations with team-oriented and research-oriented practical assignments.  She currently leads a highly visible international team of researchers (PhD students, research associates, post docs, senior principal scientists).


 

Opinion Column: Fairness, Accountability and Transparency (in Multimedia)

The inclusiveness and transparency of automatic information processing methods is a research topic that exhibited growing interest in the past years. In the era of digitized decision-making software where the push for artificial intelligence happens worldwide and at different strata of the socio-economic tissue, the consequences of biased, unexplainable and opaque methods for content analysis can be dramatic.

Several initiatives have raisen to address these issues in different communities. From 2014 to 2018, the FAT/ML workshop was co-located with the International Conference on Machine Learning. This year, the FATE/CV workshop (E standing for Ethics) was co-located with the International Conference on Computer Vision and Pattern Recognition. Similarly, the FAT/MM workshop is co-located with ACM Multimedia 2019. This initiatives, and specifically the FAT/ML series of workshop, converge to the birth of the ACM FAT* conference, having its first edition in New York in 2018, this years in Atlanta, and the third edition, next year in Barcelona.

ACM FAT* is a very recent interdisciplinary conference dedicated to bringing together a multidisciplinary community of researchers from computer science, law, social sciences, and humanities to investigate and tackle issues in this emerging area. The focus of the conference is not limited to technological solutions regarding potential bias, but also to address the question of whether decisions should be outsourced to data- and code-driven computing systems. This question is very timely given the impressive number of algorithmic systems (adopted in a growing number of contexts) fueled by big data. These systems aim to filter, sort, score, recommend, personalize, and shape human experience. They increasingly make/inform decisions with major impact on credit, insurance, healthcare, and immigration, to cite a few key fields with inherent critical risks.

In this context, we believe that the multimedia community should put together the necessary efforts in the same direction, investigating how to transform the current technical tools and methodologies to derive computational models that are transparent and inclusive. Information processing is one of the fundamental pillars of multimedia, it does not matter whether data is processed for content delivery, experience or systems applications, the automatic analysis of content is used in every corner of our community. Typical risks of large-scale computational models include model bias and algorithmic discrimination. These risks become particularly prominent in the multimedia field, which historically has been focusing on user-centered technologies. This is why it is crucial to start bringing the notion of fairness, accountability and transparency into ACM Multimedia.

ACM Multimedia 2019 in Nice will benefit from mainly two initiatives to start melting with the trend of Fairness, Accountability and Transparency. First, one of the workshops co-located with ACM Multimedia 2019 (as mentioned above) will deal with Fairness, Accountability and Transparency in Multimedia (FAT/MM, held on October 27th). The FAT/MM workshop is the first attempt to foster research efforts that focus on addressing fairness, accountability and transparency issues in the Multimedia field. To ensure a healthy and constructive development of the best multimedia technologies, this workshop offers a space to discuss how to develop fair, unbiased, representative, and transparent multimedia models, bringing together researchers from different areas to present computational solutions to these issues.

Second, one of the two selected Conference Ambassadors of SIGMM for 2019 attended the FATE/CV workshop at CVPR earlier this year, identified a speaker that could be of great interest for the Multimedia field, and invited them to FAT/MM to meet and discuss with the Multimedia community. The paper selected covers topics such as age bias in datasets and the impact this could have in real-world applications, such as autonomous driving or recommendation systems.

We hope that, by organising and getting strongly involved in these two initiatives, we can raise awareness within our community, and finally come to create a group of researchers interested in analysing and solving potential issues associated to fairness, accountability and transparency in multimedia.

The V3C1 Dataset: Advancing the State of the Art in Video Retrieval

Download

In order to download the video dataset as well as its provided analysis data, please follow the instructions described here:

https://github.com/klschoef/V3C1Analysis/blob/master/README.md

Introduction

Standardized datasets are of vital importance in multimedia research, as they form the basis for reproducible experiments and evaluations. In the area of video retrieval, widely used datasets such as the IACC [5], which has formed the basis for the TRECVID Ad-Hoc Video Search Task and other retrieval-related challenges, have started to show their age. For example, IACC is no longer representative of video content as it is found in the wild [7]. This is illustrated by the figures below, showing the distribution of video age and duration across various datasets in comparison with a sample drawn from Vimeo and Youtube.

datasets1

 

datasets2

Its recently released spiritual successor, the Vimeo Creative Commons Collection (V3C) [3], aims to remedy this discrepancy by offering a collection of freely reusable content sourced from the video hosting platform Vimeo (https://vimeo.com). The figures below show the age and duration distributions of the Vimeo sample from [7] in comparison with the properties of the V3C.datasets3

datasets4

The V3C is comprised of three shards, consisting of 1000h, 1200h and 1500h of video content respectively. It consists not only of the original videos themselves, but also comes with video shot-boundary annotations, as well as representative key-frames and thumbnail images for every such video shot. In addition, all the technical and semantic video metadata that was available on Vimeo is provided as well. The V3C has already been used in the 2019 edition of the Video Browser Showdown [2] and will also be used for the TRECVID AVS Tasks (https://www-nlpir.nist.gov/projects/tv2019/) starting 2019 with a plan for future usage in the coming several years. This video provides an overview of the type of content found within the dataset

Dataset & Collections

The three shards of V3C (V3C1, V3C2, and V3C3) contain Creative Commons videos sourced from video hosting platform Vimeo. For this reason, the elements of the dataset may be freely used and publicly shared. The following table presents the composition of the dataset and the characteristics of its shards, as well as the information on the dataset as a whole.

Partition V3C1 V3C2 V3C3 Total
File Size (videos) 1.3TB 1.6TB 1.8TB 4.8TB
File Size (total) 2.4TB 3.0TB 3.3TB 8.7TB
Number of Videos 7’475 9’760 11’215 28’450
Combined

Video Duration

1’000 hours,

23 minutes,

50 seconds

1’300 hours,

52 minutes,

48 seconds

1’500 hours,

8 minutes,

57 seconds

3801 hours,

25 minutes,

35 seconds

Mean Video Duration 8 minutes,

2 seconds

7 minutes,

59 seconds

8 minutes,

1 seconds

8 minutes,

1 seconds

Number of Segments 1’082’659 1’425’454 1’635’580 4’143’693

Similarly to IACC, V3C contains a master shot reference, which segments every video into non-overlapping shots based on the visual content of the videos. For every single shot, a representative keyframe is included, as well as the thumbnail version of that keyframe. Furthermore, for each video, identified by a unique ID, a metadata file is available that contains both technical as well as semantic information, such as the categories. Vimeo categorizes every video into categories and subcategories. Some of the categories were determined to be non-relevant for visual based multimedia retrieval and analytical tasks, and were dropped during the sourcing process of V3C. For simplicity reasons, subcategories were generalized into their parent categories and are, for this reason, not included. The remaining Vimeo categories are:

  • Arts & Design
  • Cameras & Techniques
  • Comedy
  • Fashion
  • Food
  • Instructionals
  • Music
  • Narrative
  • Reporting & Journals

Ground Truth and Analysis Data

As described above, the ground truth of the dataset consists of (deliberately over-segmented) shot boundaries as well as keyframes. Additionally, for the first shard of the V3C, the V3C1, we have already performed several analyses of the video content and metadata in order to provide an overview of the dataset [1]

In particular, we have analyzed specific content characteristics of the dataset, such as:

  • Bitrate distribution of the videos
  • Resolution distribution of the videos
  • Duration of shots
  • Dominant color of the keyframes
  • Similarity of the keyframes in terms of color layout, edge histogram, and deep features (weights extracted from the last fully-connected layer of GoogLeNet).
  • Confidence range distribution of the best class for shots detected by NasNet (using the best result out of the 1000 ImageNet classes) 
  • Number of different classes for a video detected by NasNet (using the best result out of the 1000 ImageNet classes)
  • Number of shots/keyframes for a specific content class
  • Number of shots/keyframes for a specific number of detected faces

This additional analysis data is available via GitHub, so that other researchers can take advantage of it. For example, one could use a specific subset of the dataset (only shots with blue keyframes, only videos with a specific bitrate or resolution, etc.) for performing further evaluations (e.g., for multimedia streaming, video coding, but also for image and video retrieval, of course). Additionally, due the public dataset and the analysis data, one could easily create an image and video retrieval system and use it either for participation in competitions like the Video Browser Showdown [2], or for submitting other evaluation runs (TRECVID Ad-hoc Video Search Task).

Conclusion

In the broad field of multimedia retrieval and analytics, one of the key components of research is having useful and appropriate datasets in place to evaluate multimedia systems’ performance and benchmark their quality. The usage of standard and open datasets enables researchers to reproduce analytical experiments based on these datasets and thus validate their results. In this context, the V3C dataset proves to be very diverse in several useful aspects (upload time, visual concepts, resolutions, colors, etc.). Also it has no dominating characteristics and provides a low self-similarity (i.e., few near duplicates) [3].

Further, the richness of V3C in terms of content diversity and content attributes enables benchmarking multimedia systems in close-to-reality test environments. In contrast to other video datasets (cf. YouTube-8M [4] and IACC [5]), V3C also provides a vast number of different video encodings and bitrates per second, so that it enables research focusing on video retrieval and analytical tasks regarding those attributes. The large number of different video resolutions (and to a lesser extent frame-rates) makes this dataset interesting for video transport and storage applications such as the development of novel encoding schemes, streaming mechanisms or error-correction techniques. Finally, in contrast to many current datasets, V3C also provides support for creating queries for evaluation competitions, such as VBS and TRECVID [6].

References

[1] Fabian Berns, Luca Rossetto, Klaus Schoeffmann, Christian Beecks, and George Awad. 2019. V3C1 Dataset: An Evaluation of Content Characteristics. In Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR ’19). ACM, New York, NY, USA, 334-338.

[2] Jakub Lokoč, Gregor Kovalčík, Bernd Münzer, Klaus Schöffmann, Werner Bailer, Ralph Gasser, Stefanos Vrochidis, Phuong Anh Nguyen, Sitapa Rujikietgumjorn, and Kai Uwe Barthel. 2019. Interactive Search or Sequential Browsing? A Detailed Analysis of the Video Browser Showdown 2018. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 29 (February 2019), 18 pages.

[3] Rossetto, L., Schuldt, H., Awad, G., & Butt, A. A. (2019). V3C–A Research Video Collection. In International Conference on Multimedia Modeling (pp. 349-360). Springer, Cham.

[4] Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.

[5] Paul Over, George Awad, Alan F. Smeaton, Colum Foley, and James Lanagan. 2009. Creating a web-scale video collection for research. In Proceedings of the 1st workshop on Web-scale multimedia corpus (WSMC ’09). ACM, New York, NY, USA, 25-32. 

[6] Smeaton, A. F., Over, P., and Kraaij, W. 2006. Evaluation campaigns and TRECVid. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (Santa Barbara, California, USA, October 26 – 27, 2006). MIR ’06. ACM Press, New York, NY, 321-330.

[7] Luca Rossetto & Heiko Schuldt (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.