The anatomy of a great AI image prompt for video

Ask two people to prompt the same shot and you will usually get two completely different images. It is rarely about taste. One of them gave the model a structure to work from, and the other gave it a wish. “A man walking down a city street, cinematic” is a wish. The model fills in every missing decision for you, and it makes a different decision every time you run it.

A good prompt for video is not a clever sentence. It is a small, repeatable structure. Once you can see the structure, every shot in your script gets faster to write and far more consistent from one frame to the next. Here is the shape we use, four parts in the same order, every single time.

The four parts of a cinematic prompt: subject, camera, light, and style, stacked into one shot

Subject: who, and what they are doing

Start with the thing the shot is actually about, described in concrete nouns and one clear action. Not “a person.” A 40-year-old man with a short grey beard, in a navy wool coat, reading a folded newspaper. The specifics are not decoration. They are the anchor the model returns to, and they are what you will reuse to keep the same character looking the same across a hundred shots.

Two rules that matter more than they should:

One subject, one action per shot. If your sentence has the character doing three things, the model averages them into mush.
Name the wardrobe and physical traits exactly, and keep those words identical everywhere. “Navy wool coat” in shot 1 and “dark blue jacket” in shot 40 is how characters quietly drift.

Camera: how the shot is framed

This is the part most people skip, and it is the fastest way to make a flat image look like a frame from a film. You are telling the model where the camera sits and what lens it is using. You do not need film school for this, just a handful of terms used consistently.

Shot size: wide shot, medium shot, close-up, extreme close-up.
Angle: eye level, low angle looking up, high angle looking down, over the shoulder.
Lens and depth: 35mm, 85mm portrait lens, shallow depth of field, background falling out of focus.

“Medium close-up, 85mm lens, shallow depth of field, slight low angle” does more for the look of a shot than another paragraph of adjectives about the subject.

Light: the mood, in physical terms

Mood words like “dramatic” or “moody” are weak prompts because the model has to guess what you mean. Describe light the way it actually behaves instead. Where is it coming from, how hard is it, what color.

Direction: soft window light from the left, hard sunlight from behind, a single lamp below the face.
Quality: soft and diffused, hard with sharp shadows, foggy and scattered.
Color and time: warm late-afternoon gold, cool blue pre-dawn, the green spill of a monitor in a dark room.

“Lit by a single desk lamp from below, hard shadows, warm tungsten color” is a mood. “Dramatic lighting” is a coin flip.

Style: the look that ties every shot together

Style is the layer that makes ten separate shots feel like one piece. This is your film stock, your color grade, your reference. Set it once and repeat it word for word on every shot in the video.

A medium: photoreal, 35mm film grain, hand-painted, anime cel.
A grade: muted teal and amber, high contrast and desaturated, soft pastel.
An optional reference: in the style of a 1970s thriller, cinema-grade color.

The trick is that the style block should be almost identical across your whole script. The subject and camera change shot to shot. The style does not.

One habit that pays off: write your subject description and your style block once, then paste them word for word into every shot. The boring repetition is exactly what keeps a character and a look consistent across hundreds of frames.

Putting it together

Here is a vague prompt and the same shot written with the structure. Same idea, very different odds of getting a usable frame.

Vague: “A detective in his office at night, cinematic, moody.”

Structured: “A 40-year-old detective with a short grey beard in a navy wool coat, sitting at a cluttered desk reading a file. Medium close-up, 85mm lens, shallow depth of field, slight low angle. Lit by a single desk lamp from below, hard shadows, warm tungsten color. Photoreal, 35mm film grain, muted teal and amber grade.”

The second one still leaves room for the model to surprise you, but every important decision is yours, and it will hold up when you run it again for the next shot in the scene.

The words that keep characters consistent

Consistency is the thing this audience asks about more than anything else, and most of it comes down to one unglamorous habit: exact-word repetition. Models latch onto specific tokens. “Navy bomber jacket” repeated identically across forty prompts holds far better than forty slightly reworded descriptions of a blue jacket. The same goes for faces, hair, and any signature prop. Pick the words once, lock them, and stop paraphrasing yourself.

This is also why writing prompts shot by shot, by hand, gets dangerous on a long script. By shot 80 your wording has drifted without you noticing, and the drift shows up as a character whose face and wardrobe slowly change. The fix is structural, not artistic: keep the subject and style blocks fixed, and only change what the shot actually requires.

There is already an automated way to do this

You just read the manual version. BatchFrames turns your whole script into structured prompts like these, keeps your characters worded identically across every shot, and routes the set straight to your image and video tools.

See how it works Join the waitlist

Where to go from here

You do not need a hundred adjectives. You need four parts, in the same order, with the subject and style locked so your shots feel like one film instead of a stack of unrelated images. Write one shot with the structure, then the next, and you will feel how much faster it gets once the decisions are made up front.

In the next post we will go deeper on character consistency specifically: reference images versus text, and how to keep one character steady across a few hundred shots without re-describing them every time.