Richards Media Net is a company that specializes in helping you, your company, or your organization take advantage of Internet delivered media to deliver your message!

Friday, March 27, 2020

Video Description Markup Language


EXECUTIVE SUMMARY
I frequently watch tutorial videos on YouTube to hone my skills in several different practices.  Today I was watching a video and it occurred to me that videos would be much more useful if the elements in a video were independently addressable.  By that I mean you could copy and paste text from a video screen/scene, change your view or perspective on a scene, copy narrative into text, scale, rotate and translate objects from the video, including actors, furniture, spaceships, view a scene with and without the actors who were originally in it. 

In order to enable these activities and many more, it would be necessary for a standard(s) to be developed for creating and consuming this type of information.  Such a standard could be an XML derivative called Video Description Markup Language (VidDML).  VDML would be created to provide a datastream for consumers of this information and offers a new, novel, and superior way of transmitting video which provides for unique experience well above and beyond that available today.  Current video technologies and file formats such as MPEG-4 already provide for the ability to store and transmit such information (https://en.wikipedia.org/wiki/MPEG-4_Part_14), although not created specifically for working with VidDML.

The use of VidDML would provide a number of advantages over today’s video transmission.  Each individual media object in a video could be individually addressable, searchable, reusable, manipulatable and available for scrutiny.  Mashups could be created by consumers of VidDML / MPEG-4 streams to create new stories, facilitate understanding, enable search on images, characters, video, audio, scenes, emotions, speed, and automate a number of processes which are not possible with today’s current technologies.

DETAIL
I will further expound on how this process could work, using current technologies as an example to describe functionality, even though there is no current framework for the way this would work, and in order to enable these capabilities, new technologies or revisions of existing technologies would be needed to implement or enable these capabilities.

Creating Streams
To develop a data-stream, a VidDML / MPEG-4 source of data, many technologies need to be expanded, streamlined, and coordinated.  A good example of a video instance that uses multiple elements that would be all-encompassing is the evening news on television.  The evening news contains and is composed of people, music, speech, on-screen text, images, a variety of topics and specialties and expertise.

VidDML imagery could be composed using a device like Microsoft’s Kinect. The Kinect and its software had the ability to create 3D object point-clouds with colors and textures.
In our example of a news broadcast, a Kinect-like device would be used to create 3D “painted” models of the on-screen reporters, their desks and furniture, and objects in the reported stories such as crashed cars, guns, fire fighters and their vehicles, homes and business buildings and much more.
Not only should 3D models be created of the people and objects in the newscast, the actions of their facial expressions, limb and extremities, and emotional affect should be recorded and interpreted.
That covers the imagery aspect quite well, how about sound effects, music, and speech?

Sound effects could be imbedded in the VidDML / MPEG-4 stream as well.  The sound effect itself could be stored in a library and referenced, or directly provided in the stream.  Music would be similar.  A riff or the audio of a musical score could be stored in the stream.  Better yet, the musical score representation would be embedded in the stream and would be searchable and independently addressable.  Perhaps the best idea would be to leverage MIDI technology and embed MIDI commands in the stream.

Speech is similar.  There should be a textual representation of the spoke word.  Voice information could be embedded also, so that a text-to-speech capability could use phonemes of an actual individual to provide searchable speech.  Data that modifies inflection, volume, and emotion should be streamed as well.  Text-to-speech technology needs finessing, like the other technologies mentioned in this document, it is not quite ready for prime-time for use in our example of news castors. 

Consuming Streams
To implement a VidDML / MPEG-4 based solution, an enormous amount of computing power is needed.  Today’s 3D video games stretch the boundaries of what is possible, and do not facilitate the manipulation of individual media elements in real time.  Video games also do not provide real-time search and rule-based interpretation/manipulation along all the media channels.

2D / 3D Objects - Orientation / Perspective
As mentioned earlier, a device like Microsoft’s discontinued Kinect can create 3D objects with color and textures.  Other companies have technologies which provide the same result.  In our news presentation, the news castors could be imaged in 3D, and the viewer of the newscast could place themselves in various locations in the studio. In addition, news stories on location could use 3D imaging devices to create 3D “scenes” where the viewer could be placed within the scene and view in varying directions. 

Another possibility, when technology enables it, would be to provide for the manipulation of 3D models of news castors, news story subjects, and actors.  There could be an action stream in the VidDML / MPEG-4 that specifies the visible activities of the newscastors.  There could also be an emotion stream that enables modifications to the voice and actions of actors that modifies and acts as a filter to impact what the viewer sees. 

Having 3D models of the newscastors, an emotion stream and a physical action stream would enable the viewer to substitute the actor of their choice as a newscaster, complete with their gestures, mannerisms, and voice for that provided by default by the producer of the newscast.

Sound / Speech / Score / Music
Sounds could have multiple channels in the stream.  Descriptions of the sound, its volume, and other audio characteristics could be projected, searched against, and modified.  Similarly, speech could have a channel where the speakers voice, emotional affect, and vocal mannerisms are represented.  Music streams could be composed of MIDI data, modifiable, searchable, and replaceable.

Actors / Activity / Emotion
Each actor / newscaster would have associated physical actions, voice, mood, emotional affect, all of which would be encoded, stored, searchable, and modifiable.  These characters are not only representative of people, could be animals, machines, indoor and outdoor objects.

Text
A text stream is the most useful to implement.  This stream would be encoded into the VidDML / MPEG-4 data.  Every other stream would have a textual representation to facilitate searching, modifying, and reuse.  If there is text in the video, like the other types of media, you would be able to select and copy/paste from a video into other apps.  You could also copy from other apps and place modified text to modify other streams.

Close
Herein is a description of a capability which consists of many parts.  Many of the technologies mentioned exist in one or another states of maturity.  I expect a usable system using what is discussed could be created with an acceptable level of finesse in 5-10 years. 

Equipment, software, and usability need to be developed to enable the possibilities described herein.  Sounds like fun!  I would like to help make this happen!