Realistic Bakery

20 July 2025
veo3

For my next project working with Veo 3, my goal was to create a realistic video. I wanted to create a video that was almost indistiguishable from the real world. I had only made animated, cartoon story videos before, so producing this video was a whole different experience. Here is the final video:

Download this video ⬇️

Sections of my result video looked really real, but others not so much.

Using ChatGPT to Generate Realistic Images

To make this video I first used ChatGPT to generate realistic images from text desciptions. I had a conversation with ChatGPT to generate a an image of what a kitchen could look like. I then edited it by telling the AI to make the previous image look more like a bakery kitchen with metal tables and large ovens.

I continued to edit the realistic image of the kitchen by writing more prompts.

Have the walls be painted light pink and have the room be brighter

Add a real life baker with an apron standing in front of the dough

Keep the kitchen the same but make it a female baker

Make the image 16:9

Going From Single Frame to Video

Veo has different modes like Frame to Video and Text to Video.

In Text to Video, Veo will use a prompt to generate a video.
In Frame to Video, you can upload up to two images and write a prompt. Veo creates clips using the prompt and the videos. This mode is very helpful when extending a video. Veo 3 can only generate clips up to 8 seconds long, so in order to create a longer video, extending clips are nessesary.

I uploaded the ChatGPT images into Veo, using the Frame to Video mode.

At the moment, a limitation to Veo 3 is that it can only generate short clips, and can't handle too many actions.

I noticed that when my prompts had too many actions, the videos weren't realistic. For example, my prompt was:

The baker picks up the tray of croissants, and walks over the oven. She uses one hand to open the door to the oven, and her other hand currently carrying the tray to put the croissants in the oven.

This was the result. Look carefully at the 5 second mark.

As the video shows, the tray of croissants moves straight throught the oven glass without the baker opening the oven door. From my prompt, there are many actions like her picking up the tray, walking, using one hand to open the oven, and using one hand to hold the tray, and putting the tray in the oven. As you can see, there are too many actions in this prompt so the AI got confused.

Bloopers!

Using Veo 3, there were a lot of videos that didn't quite make sense. Here they are:

Download the bloopers ⬇️

Veo Limitations

Veo 3 is a great resource for creating shorter videos, but it struggles to create a clip that continues the action of the previous clip. Veo 3 doesn't allow directly extending the clip, so when using Frame to Video, the voice of a character wouldn't stay the same because Veo relies only on the photo when generating the clip. Veo only uses the last frame of a video if you want to expand it, so voices and scene details might change.

In the video segment where the sign flips from closed to open, the flipping action is not realistic. The sign flips weirdly, and turns upside down, still somehow attatched to the string holding it up. I tried 15 different times, but all the tries ended up beng not exactly what I wanted, so I just had to accept it and moved on.

Veo is not capable of understanding prompts with too many details. In the bloopers above where two characters were speaking, my prompt clearly stated what each speaker would say, yet the AI still mixed them up. If a prompt has too little detail, Veo will make up parts like the background and what the characters look like. A prompt needs to have enough details about the character or scene, but can't include too many actions.

Veo in a Nutshell

Throughout this project, I noticed that in comparison to what videos I had produced before, realistic videos were harder to produce than animated. Creating animated clips were more forgiving when producing a story line, because cartoons don't always have to be realistic.

When working with Veo 2 and 3, I realized that although I may have a vision in my head for what I want my video to look and sound like, I have to be open minded and be able to pivot, because the clips will never quite match exactly what I want.

← Previous
Veo 2 vs. Veo 3
Next →
Using Prompts for Dialog