This product concept is one of the most challenging and fun ideas I have explored. It will require exponential progress on a number of technical fronts, but it could open up a whole new way to tell stories and to communicate ideas.
The basic idea is to turn the words of a novel (book, play or film script) into a movie using an AI system (e.g., IBM's Watson), natural language processing, automated (or accelerated) avatar generation, text to speech, and automated machinima / animation techniques.
Watson reads the book, Watson generates a movie.
I will necessarily be using some shorthand to describe this business concept, so forgive me where the explanation sounds too simplistic. It either means I don't know what I'm talking about or that I have chosen to write at a high level to accommodate the limits of a blog. You can choose as you deem appropriate.
Key Components of the Process
The basic idea is to turn the words of a novel (book, play or film script) into a movie using an AI system (e.g., IBM's Watson), natural language processing, automated (or accelerated) avatar generation, text to speech, and automated machinima / animation techniques.
Watson reads the book, Watson generates a movie.
I will necessarily be using some shorthand to describe this business concept, so forgive me where the explanation sounds too simplistic. It either means I don't know what I'm talking about or that I have chosen to write at a high level to accommodate the limits of a blog. You can choose as you deem appropriate.
Key Components of the Process
- Categorization of story components. Parse the text via natural language processing (NLP) techniques to categorize each word, phrase and/or sentence as being related to (1) dialogue, (2) character description, (3) scene/setting, (4) objects/props, or (5) action. The first generation approach can be simplified by using TV / film scripts or plays.
- Character generation. Start with baseline human form, customize using text-based descriptors. First generation -- either automate an existing avatar generation software or select from library of existing avatars.
- Scene/setting generation. Start with basic elements of each scene (i.e., inside/ outside, location, building type, etc.). First generation – select from library of existing sets. The AI system could eventually use resources like [FrameNet, VigNet, licensed 3d party libraries, etc.] and even do supplemental web-based research (perhaps even using advances in automated image analysis) for elaborating on basic scene elements related to time period, locale and building types, artist’s renderings and other information to refine the scene to more closely match the author’s description.
- Dialogue. Start with text to speech and basic lip sync animation (e.g., Reallusion's CrazyTalk). [UPDATE: It looks like Speech Graphics' facial animation software might be the better option. Speech Graphics' software takes the audio spoken by an actor and creates a corresponding animation automatically, rather than having to painstakingly animate speech by hand. It is being used in AAA games, music videos, etc. and is very impressive.]
- Action. Animate the characters based on the author’s description of their actions (might include blocking instructions in plays and scripts). Would include the use of props/objects as system becomes more sophisticated.
- Audio. Add in other sound, special effects or relevant audio.
Potential initial approaches and evolutionary path
- TV script to show. TV scripts would be relatively easy (compared to books) to parse and produce. TV shows use minimal sets (often only a single set). Could use ready-made set and actor avatars based on actual stock content. The system could be used to produce TV show previz, facilitate directing and camera angle selection, etc. Might be used to pre-screen pilots, test new concepts, create fan fiction, or extend a show's brand. For example, juried fan/guest screen writing to be viewed, critiqued etc. on YouTube.
- Play to Film. Use a play, which can be parsed more easily than a book. Story summary/length, scene selection and dialogue is already created for a reasonable film length, timing (i.e., no story summary is required). Plays include simple cues as to who is included in each scene, who is speaking and often include action, setting, or character descriptions in brackets or other separated text.
- Script to Film. Help automate a filmmaker’s previz process. Generate characters/avatars. First generation – have relevant characters in a scene read scripted dialogue. Add in basic action as possible (via stage directions/blocking).
- Book to film. This would be the holy grail application. Likely broadest market and largest commercial value. Requires book to be summarized which is currently not possible via NLP techniques.
- Next Generation after that. Add real-time natural language based editing. Imagine real-time story telling with a visual background!
Basic framework for process of parsing a play via NLP [Admittedly way oversimplified]
- Segment, tokenize and part of speech [POS] tag the entire text [using functionality from the Natural Language Toolkit and/or other NLP tools]
- Develop list of characters using Named Entity Recognition [NER] tools. Identify all mentions of each character.
- Parse each mention to determine whether the mention is related to (1) dialogue (perhaps the easiest), (2) description of character (via adjectives and other contextual cues), or (3) the character’s actions (using verbs and other contextual cues). Perhaps parse via part of speech (POS) tags or depending on set up of script based on location of text cues, e.g., all text after a colon following a character name is dialogue).
- For avatar generation, consider whether it works to parse character mentions for adjectives etc. and use probabilistic inference to determine which surrounding words are descriptive of the character. Perhaps only take high probability descriptors, e.g., “character X’s eyes were blue”, “character X was short”, etc. Alternatively search each mention for certain key descriptive words like eyes, nose, height and size words, etc. Query whether it is most efficient to write an API to automate the avatar creation process that exists in an existing animation software such as iClone5, Blender or other by selecting the choices in the avatar generator module based on applicable text equivalencies.
- For scene/setting generation, parse text for location and scene descriptions.
- Associate all dialogue chronologically with each character and among all characters in a scene. Use existing text-to-speech and lip synch software (e.g., CrazyTalk) to have character speak their dialogue.
- For animating the action, parse the character mentions for verbs and related contextual clues to deduce action. Same as four above, do we start by trying to match action words in the text to available actions in the body puppetry/animation modules of an existing animation software such as iClone5?
- General rule -- Ignore all text which is not understood as fitting into one of these categories.
Why this concept will not happen this week
This last part is where it all started to break down. My initial objective in researching this product idea was to attempt to determine whether the team that would be required to pull it together was more like a team of 3 PhDs working for 9 months (doable) or a team of 15 PhDs working for 4 years (less doable). At the time I was pushing hard on this idea, it seemed like the latter was the far more likely. Several of the key challenges that need to be overcome (in no particular order) are as follows:
- State of the art NLP does not facilitate summarizing a book, so the movie would be a literal page by page reproduction (which would take many hours, if not days to watch). That is one of the benefits of using a script -- scripts are in essence a summary of the key elements of the novel.
- At this point, there is no commercially available software for automating the creation of an avatar (actor) for use in the film. Creating such an automated pipeline is probably doable by creating a repository of body types and features that could be matched to descriptive language in the text, but that would be a monumental task. Presumably early versions will start with generic figures, allow manual customization, or in the case of existing shows use the current images of the actual actors.
- While image analysis and recognition has come pretty far recently, actually using it in an automated process would require tremendous horsepower and a robust methodology for settling on the image to be used and then integrated into a scene.
- At this point, there is no commercially available software for automating the movements of the avatars in a scene although some basic systems appear to be in the works.
- Consumers are used to perfection (or pretty close to it) in their animation and at least initially this automated system would be a big step back in terms of quality and would have none of the film maker's art of editing. It would be a clunky, awkward animated series of scenes with bad sound and dubious editing "choices." I doubt any but the truly hardcore technology enthusiasts would appreciate the wonder of the fact that all of this had been created via an automated process.
All of that said, once we are able to overcome all (or at least most) of these challenges, there could be some amazing opportunities for new ways to tell stories. Once the movie has been created, changing out the characters (avatars) would be relatively straightforward. For a relaxing Friday night with friends, maybe you would use the Modern Family cast to play the characters of your favorite book. Imagine watching Gone With the Wind, but reversing the races of the key characters. Suddenly the story telling possibilities open up dramatically.
For a view of the current state of the art of this idea, you may want to take a look at the Plotagon app, Muvizu, Bot Colony or the robots created by the Russian company working on this idea. Anyone know their name? I recall that they were the furthest along on the text to movie concept, but I can't find them on the web anymore.
Conclusion
At the end of the day, this concept seems to be pretty well nestled in the university research phase. I still like the idea and will revisit the state of the art of the technical aspects of this challenging concept to be able to determine when combining the pieces may be more feasible. The author would enjoy hearing your thoughts.
And finally, for you incredibly hearty souls who have made it all the way to the end of this piece and who may be interested in more reading on the subject, I offer a small glossary of relevant papers on the subject.
Natural Language Processing
Annotation Tools and Knowledge Representation for a Text-To-Scene System, Bob Coyne, Alex Klapheke, Masoud Rouhizadeh, Richard Sproat, Daniel Bauer, Proceedings of COLING 2012 (2012)
Text to Scene Conversion
Text-to-Scene Conversion: An Introductory Survey, Shiqi Li, Tiejun Zhao, Hanjing Li, School of Computer Science, Harbin Institute of Technology, International Journal of Computational Science (2009)
AVDT – Automatic Visualization of Descriptive Texts, Christian Spika, Katharina Schwartz, Holger Dammertz, and Hendrik Lensch, Vision, Modeling and Visualization (2011) [Paper suggests an alternative to wordsEye using parsed text (such as prepositions) for creation of more realistic 3D scene creation taking cues from text rather than external sources.]
Automating the Animation Process
SceneMaker: Multimodal Visualization of Natural Language Film Scripts, Eva Hanser, Paul McKevitt, Tom Lunney, Joan Condell and Minhua Ma (2010)
Automatic Conversion of Natural Language to 3D Animation, PhD thesis by Minhua Ma, Faculty of Engineering, University of Ulster (2006)
Towards Automatic Animated Storyboarding, Patrick Ye and Timothy Baldwin, University of Melbourne, Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008)
Automating the Creation of 3D Animation From Annotated Fiction Text, Kevin Glass and Shaun Bangay, IADIS International Conference Computer Graphics and Visualization (2008)
Automating the Transfer of a Generic Set of Behaviors Onto a Virtual Character, Feng Huang, Xu and Shapiro, In Proceedings of the 5th International Conference on Motion in Games (MIG), Rennes, France (2012)
EMOT – An Evolutionary Approach to 3D Computer Animation, Halina Kwasnicka and Piotr Wozniak, Wroclaw University of Technology, (2006)
Acquisition of and Content Sources
VSEM: An open library for visual semantics representation, Bruni, Bordignon, Liska, Uijlings, and Sergienya, University of Trento, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (2013)
Collecting Spatial Information for Locations in a Text-to-Scene Conversion System, Masoud Rouhizadeh, Daniel Bauer, Bob Coyne, Owen Rambow and Richard Sproat
Text to Speech
A naïve, salience-based method for speaker identification in fiction books, Kevin Glass and Shaun Bangay, Rhodes University