Video Description Markup Language
EXECUTIVE SUMMARY
I frequently watch tutorial videos on YouTube to hone my skills
in several different practices. Today I
was watching a video and it occurred to me that videos would be much more
useful if the elements in a video were independently addressable. By that I mean you could copy and paste text
from a video screen/scene, change your view or perspective on a scene, copy
narrative into text, scale, rotate and translate objects from the video,
including actors, furniture, spaceships, view a scene with and without the actors
who were originally in it.
In order to enable these activities and many more, it would
be necessary for a standard(s) to be developed for creating and consuming this
type of information. Such a standard
could be an XML derivative called Video Description Markup Language (VidDML). VDML would be created to provide a datastream
for consumers of this information and offers a new, novel, and superior way of
transmitting video which provides for unique experience well above and beyond
that available today. Current video
technologies and file formats such as MPEG-4 already provide for the ability to
store and transmit such information (https://en.wikipedia.org/wiki/MPEG-4_Part_14),
although not created specifically for working with VidDML.
The use of VidDML would provide a number of advantages over
today’s video transmission. Each
individual media object in a video could be individually addressable, searchable,
reusable, manipulatable and available for scrutiny. Mashups could be created by consumers of
VidDML / MPEG-4 streams to create new stories, facilitate understanding, enable
search on images, characters, video, audio, scenes, emotions, speed, and
automate a number of processes which are not possible with today’s current
technologies.
DETAIL
I will further expound on how this process could work, using
current technologies as an example to describe functionality, even though there
is no current framework for the way this would work, and in order to enable
these capabilities, new technologies or revisions of existing technologies
would be needed to implement or enable these capabilities.
Creating Streams
To develop a data-stream, a VidDML / MPEG-4 source of data,
many technologies need to be expanded, streamlined, and coordinated. A good example of a video instance that uses
multiple elements that would be all-encompassing is the evening news on
television. The evening news contains
and is composed of people, music, speech, on-screen text, images, a variety of topics
and specialties and expertise.
VidDML imagery could be composed using a device like Microsoft’s
Kinect. The Kinect and its software had the ability to create 3D object
point-clouds with colors and textures.
In our example of a news broadcast, a Kinect-like device would
be used to create 3D “painted” models of the on-screen reporters, their desks
and furniture, and objects in the reported stories such as crashed cars, guns, fire
fighters and their vehicles, homes and business buildings and much more.
Not only should 3D models be created of the people and objects
in the newscast, the actions of their facial expressions, limb and extremities,
and emotional affect should be recorded and interpreted.
That covers the imagery aspect quite well, how about sound
effects, music, and speech?
Sound effects could be imbedded in the VidDML / MPEG-4 stream
as well. The sound effect itself could
be stored in a library and referenced, or directly provided in the stream. Music would be similar. A riff or the audio of a musical score could
be stored in the stream. Better yet, the
musical score representation would be embedded in the stream and would be
searchable and independently addressable.
Perhaps the best idea would be to leverage MIDI technology and embed
MIDI commands in the stream.
Speech is similar.
There should be a textual representation of the spoke word. Voice information could be embedded also, so
that a text-to-speech capability could use phonemes of an actual individual to
provide searchable speech. Data that modifies
inflection, volume, and emotion should be streamed as well. Text-to-speech technology needs finessing,
like the other technologies mentioned in this document, it is not quite ready
for prime-time for use in our example of news castors.
Consuming Streams
To implement a VidDML / MPEG-4 based solution, an enormous amount
of computing power is needed. Today’s 3D
video games stretch the boundaries of what is possible, and do not facilitate
the manipulation of individual media elements in real time. Video games also do not provide real-time search
and rule-based interpretation/manipulation along all the media channels.
2D / 3D Objects - Orientation / Perspective
As mentioned earlier, a device like Microsoft’s discontinued
Kinect can create 3D objects with color and textures. Other companies have technologies which
provide the same result. In our news presentation,
the news castors could be imaged in 3D, and the viewer of the newscast could
place themselves in various locations in the studio. In addition, news stories
on location could use 3D imaging devices to create 3D “scenes” where the viewer
could be placed within the scene and view in varying directions.
Another possibility, when technology enables it, would be to
provide for the manipulation of 3D models of news castors, news story subjects,
and actors. There could be an action
stream in the VidDML / MPEG-4 that specifies the visible activities of the
newscastors. There could also be an emotion
stream that enables modifications to the voice and actions of actors that
modifies and acts as a filter to impact what the viewer sees.
Having 3D models of the newscastors, an emotion stream and a
physical action stream would enable the viewer to substitute the actor of their
choice as a newscaster, complete with their gestures, mannerisms, and voice for
that provided by default by the producer of the newscast.
Sound / Speech / Score / Music
Sounds could have multiple channels in the stream. Descriptions of the sound, its volume, and
other audio characteristics could be projected, searched against, and modified. Similarly, speech could have a channel where
the speakers voice, emotional affect, and vocal mannerisms are represented. Music streams could be composed of MIDI data,
modifiable, searchable, and replaceable.
Actors / Activity / Emotion
Each actor / newscaster would have associated physical
actions, voice, mood, emotional affect, all of which would be encoded, stored,
searchable, and modifiable. These
characters are not only representative of people, could be animals, machines, indoor
and outdoor objects.
Text
A text stream is the most useful
to implement. This stream would be
encoded into the VidDML / MPEG-4 data.
Every other stream would have a textual representation to facilitate
searching, modifying, and reuse. If
there is text in the video, like the other types of media, you would be able to
select and copy/paste from a video into other apps. You could also copy from other apps and place
modified text to modify other streams.
Close
Herein is a description of a
capability which consists of many parts.
Many of the technologies mentioned exist in one or another states of
maturity. I expect a usable system using
what is discussed could be created with an acceptable level of finesse in 5-10
years.
Equipment, software, and usability
need to be developed to enable the possibilities described herein. Sounds like fun! I would like to help make this happen!
0 Comments:
Post a Comment
<< Home