How I Vibe-Coded a Ghibli-Style AI Version of Myself

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

How I Vibe-Coded a Ghibli-Style AI Version of Myself

A weekend experiment in voice cloning, open-source LLMs, and animated AI characters.

Shail Kaveti

Mar 30, 2025

Transcript

I didn't mean to make an AI version of myself.

But one weekend, I ended up cloning my voice with ElevenLabs (though it doesn’t fully capture my natural accent - it came out more British-American), scripting a short monologue using Gemma (an open-source LLM), generating a Ghibli-style portrait with OpenAI's DALL·E, and animating the whole thing with Kaiber. I vibe-coded the pipeline together in Python, stitching together APIs, model inference, and asset flow into a lightweight prototype.

What came out was a soft-spoken, slightly-too-wise character who looked and sounded eerily like me.

She spoke. I listened.

And to be honest, it felt like she said things I hadn't yet let myself say out loud.

Why I Did It

As an investor, I’ve been actively exploring trends in voice AI — and there’s no better way to understand those trends than by actually building with the tools. Vibe-coding my way through this pipeline helped me see where the real friction and magic lie.

Initially, this was just a creative coding experiment. I wanted to see what would happen if I combined multiple generative tools into a single character experience.

I also really enjoyed wearing my engineering hat again. Even though I hadn't written Python in eight years, I was able to plug in different components, debug, and figure things out by reading threads on Reddit and X. It reminded me how empowering it is to go from idea to output in a weekend.

But then it got more personal. The voice was mine. The words were... not quite mine. And the face looked like a memory of me. It became a hybrid: part tool, part story, part self.

There's something strangely grounding about hearing your voice say something you didn’t write — but still believe.

The Stack

Voice Cloning: I used ElevenLabs to clone my voice using a short audio sample. Their Python SDK made it easy to generate audio programmatically by passing in text strings and controlling output format. The model's emotional expressiveness made the synthetic voice feel uncannily human. One challenge I faced was getting my accent exactly right — my natural tone blends Indian, British, and slight American inflections. The output leaned toward a more British-American hybrid, which sounds like me at the base level, but softened and more neutralized than my usual self. Without high-quality, studio-level equipment, it can be hard for the model to fully capture the intricacies of a non-native or blended accent. ElevenLabs performs impressively given minimal input, but the subtleties of regional inflection are still a challenge for voice cloning models in general.

ElevenLabs uses deep learning-based speech synthesis, and although they haven't disclosed the full technical details, their architecture likely builds on advanced autoregressive models like VITS (Variational Inference Text-to-Speech) or similar transformer-based frameworks that capture both prosody and speaker identity.

LLM Scripting: For dialogue generation, I used Gemma, a 2B parameter open-source model from Google, running locally via Ollama. I wrote a thin wrapper that accepted input prompts and returned streamed completions. Prompting it with "soft, introspective internal monologue" yielded voice lines that felt tonally consistent with the Ghibli aesthetic.

Portrait Generation: I used OpenAI's DALL·E through the Python API, prompting with a style-focused prompt: "Studio Ghibli-style portrait of a girl in a forest, watercolor, soft lighting." I ran several generations, selected the most expressive one, and lightly upscaled it.

Animation & Lip Sync: I uploaded the portrait and audio into Kaiber and used its timeline editor to sync lip movements automatically. I used camera panning and slow-zoom effects to simulate the feel of an animated cutscene. The result felt closer to a cinematic AI short than a static demo.

Code & Pipeline: The glue was Python. I used the elevenlabs and openai SDKs, a local Gemma instance via Ollama, and pydub to layer any background ambience. Each stage of the pipeline output clean files (voice.mp3, image.png), which I fed into Kaiber for final animation.

The full flow looked like:

[text prompt] → [Gemma LLM] → [ElevenLabs voice synth] → [DALL·E image gen] → [Kaiber animation] → [video output]

What's Next

Next up, I’m prototyping a real-time conversational agent:

Speech-to-text with Whisper
LLM-generated reply via Gemma (or optionally GPT-4)
Text-to-speech output via ElevenLabs in my cloned voice
Voice looped back as output

Once complete, this will let me talk to my AI self live, in voice. Imagine a personal assistant, but Ghibli-coded.

I’m also considering releasing each voice clone monologue as a “chapter” in an AI short film series — one per theme (doubt, memory, self-worth, ambition, etc).

If you’re building experimental voice interfaces, generative characters, or LLM + TTS pipelines, I’d love to connect.

Subscribe below to follow this journey — I’ll share code, pipelines, and new builds as they come.

Shail’s Substack

How I Vibe-Coded a Ghibli-Style AI Version of Myself

Why I Did It

The Stack

What's Next

Discussion about this video