Knoche, H., De Meer, H., Kirsh, D. Utility Curves: Mean opinion scores considered biased. Proceedings of the Seventh International Workshop on Quality of Service 1999. 4. Allowance for ambiguous results. Ad 1: Disturbances that are consciously not noticeable can by denition not be measured with MOS. If, for example, audio and video are out of lip-synchronization to some degree, people, while being completely unaware of, could perceive a dierent sound than actually uttered by a speaker if confronted which such a stimulus. It is a well-known cognitive fact that a wrongly perceived, or misheard, voice can be triggered by the lip motion seen. Such an eect is referred to as the McGurk eect [2]. A mean opinion score would clearly not be sensitive to it since the subjects have no means to verify their percepts. Ad 2: MOS is an indirect measure, thereby reecting meta-cognition like 'How do I like this?'. So gradual dierences that might be apparent to the subjects of the study might be lost due to individual dierences in judging, mood, a priori estimates etc. There is no absolute scale that can be used for MOS. Ad 3: Subjects often approach a given problem from a certain perspective which is often unknown to the questioners. With respect to audio quality, for example, some subjects could be more concerned with understandability or others with tonal delity, depending on the prospective purpose. Ad 4: Multimedia systems must include multimodal tuning. Considering the quote from [1], it might be hard to actually use the MOS score for future decision making for the dierent media since MOS do not provide any knowledge how the envisioned quality will aect the users. 3 Task Oriented Performance Measures Task oriented performance measure take a dierent approach and expose the subjects to dierent levels of the stimuli (e.g., dierent frame rates) and objectively measure the outcomes. The performed task is related to a given context and the measured performance is thus relevant to an application that requires this task. Common tasks are, e.g., repetition, memorization of words or sentences. This represents an operationalized direct way of dealing with the subjects' percepts such that the additional level of self-reection is removed and validation of the obtained data is alleviated. By this approach, unconscious eects can be detected since they degrade performance. When small frame rates are responsible for McGurk eects, as described by Nakazono [5], we can measure the degradation in performance by wrong answers with the TPM-approach, whereas the MOS score might wrongly indicate a good presentation, a little jerky one perhaps. Instead of relying on users' opinon, TPMs provide an objective, yet individual, means to overcome limitations identied for MOS. We envision standard task performance tests which can be universally applied to certain scenarios, achieving comparability and reproducability. 4 Some Empirical Results 4.1 Preliminaries In our ongoing experiments, constraints on frame rates and on audio-visual skew as well as many interdependencies between these two factors are investigated. Some of intramedia parameters have already been proven elsewhere to aect human speech perception. However, only very few earlier studies have addressed the eect of how dierent frame rates and skews aect task performance. To our knowledge, no study has ever systematically addressed the interdependency of frame rate and audio-visual skew. 4.2 Outline of Experiments The 15 subjects were mostly students of the University of California San Diego, between 18-32 years old. The experiment consisted of 8 blocks interleaved by breaks. Each block was made up of 60 stimuli, resulting in a total of 480 stimuli shown to each subject which took about 50 minutes. The task was to identify the second consonant in a four syllable nonsense words. The words spanned all permutations of three consonants interleaved and headed by the vowel `a' using the the four consonants `b', `d', `g', `v', e.g., `adavaga'. This resulted in a total of 64 dierent stimuli words which were prepared with 30 dierent combinations of frame rates (30, 15, 10) and skews (160;120;80 and 0ms) (a negative skew indicating that the audio is leading the video signal). Two of the consonants ('b','v') are labial, whereas ('d','g') do not require lip movement. The experiment follows a within-subject design, i.e. all subjects are exposed to all the dierent congurations of frame rate and audio-visual skew. Sumby and Pollack [8] reported that the relative contribution of visual information is independent of the signalto-noise ratio (S/N), but the absolute contribution could be more protably exploited at low S/N. In order to explore the eects of frame rate, skew and the interactions between frame rate and skew, the audio signal was mixed with some amount of white noise, being about 11 dB louder than the signal. Therefore, the base performance was set to a level that ensured no clipping of the eects at the top of the scale. The experimental setup incorporated knowledge from earlier work, such as Massaro [3], Pandey et. al[6], McGrath et. al [4], and Nakazono [5]. 4.3 Results & Discussion Part of the results is depicted in the attached gure. A clear eect is evident for 160ms where the performance drastically decreases. Whereas a positive skew seems to degrade the performance more gradually, a negative skew beyond 120 ms has a more abrupt eect. In the study carried out by Steinmetz [7] the subjects had to detect an audio-visual skew and a MOS was dened as the level of annoyance realized by the subjects. The outcome can be referred to as based on a task-oriented measure. But it does not resemble tasks in the general and more application oriented sense, as we suggest. It is a wellaccepted fact that speech perception is best for an audiovisual skew +80 ms (video leading audio). However, 30% of Steinmetz's subjects found that skew subjectively already annoying. Although more than 90%/70% of the subjects detected distortions in synchronization for +120/-120 ms, our experiments suggest that such de-synchronizations can often be tolerated as long as task performance is of concern. The illustrated experiments indicate how TPMs come up with results that are more appropriate than those provided by MOSs. Explicit detection of skew is a very special task which may not be generalized to be applicable in other circumstances. The impact of distortions in synchronization can both be overand underestimated with MOS, depending on the task at hand. Task Performance for 30 fps 0,55 0,6 0,65 0,7 0,75 -160 -120 -80 -40 0 40 80 120 160 audiovisual skew in ms con son an t id en tific atio n in % 5 Summary In the Coqos project Task PerformanceMeasures and a corresponding framework are suggested and pursued as a novel and more suitable means for determining utility curves. TPMs are intended to avoid limits inherent in traditional measures like Mean Opinion Scores. MOS rely merely on subjective ratings rather than on more objective performance in relation to a particular task or application of interest. Informational relevance and its impact on subjects can be measured more eectively by TPMs. Inhibiting psychological and cognitive eects like consciousness or nonconsciousness of degradations or individual focusing and perspectives of subjects can be more appropriately evaluated and dealt with by means of TPMs. The increasing importance of adaptation, in particular with the advance of MPEG4, as a means for QoS provisioning, both in wireless and wired environments, require sensible techniques to eectively determine utility curves. References [1] Goodman, D. J., Nash, R. D.  `Subjective quality of the same speech transmission condition in seven dierent countries', IEEE Transactions on Communications, Vol. COM-30, No. 4, Apr. [1982] [2] McGurk, H., MacDonald, J.  `Hearing lips and seeing voices', Nature, vol. 264, (no.5588), 23 Dec., [1976] [3] Massaro, D. W., Cohen, M. M., Smeele P. M. T.  `Perception of asynchronous and conicting visual and auditory speech', Journal of the Acoustical Society of America, 100 (3), Sep. [1996] [4] McGrath, M., Summereld, Q.  `Intermodal timing relations and audio-visual speech recognition by normal hearing adults', Journal of the Acoustical Society of America 77 (2), Feb., [1985] [5] Nakazono, K.  `Frame rate as a QoS parameter and its inuence on Speech Perception', Multimedia Systems, 6, [1998] [6] Pandey, C.; Kunov, H.; Abel, S. M. `Disruptive effects of auditory signal delay on speech perception with lipreading', Journal of Auditory Research, Jan. 26 (1), [1986] [7] Steinmetz, R.  `Human perception of jitter and media synchronization', IEEE Journal on Selected Areas in Communications, vol.14, (no.1), IEEE, Jan., [1996] [8] Sumby, W., and Pollack, I.  `Visual contributions to speech intelligibility in noise.', Journal of the Acoustical Society of America, 26, [1954] [9] Wolf, S., Dvorak, C. A., Kubichek, R. F., South, C. R., Schaphorst, R. A., Voran, S. D.  `Future work relating objective and subjective telecommunications system performance', Proceedings IEEE Globecom [1991]