Using local LLMs to give life to game characters

I like to dabble and experiment with the idea of using LLMs to bring videogames to life.

I want to make an open discussion here about using LLMs in videogames, optimizations that could make them run faster, and ways to make them feel more realistic.

You can discuss and debate however you like, but I would like to start by asking a question:

Do you think it’s possible to strip an open source LLM so that it only contains knowledge about its fictional game world along with conversational skills, and then make variants for each character in the game that is also traind on their respective lore?

Do you think these models would be fast enough to run during a game on the average gaming PC, and do you think it’s practical enough to consider?
Disclaimer; I’ve worked with LLM APIs quite a bit, but I’ve never trained an LLM.

1 Like

Well using an API would require either you to pay for the requests or have the players link their own API-Keys to pay for it. Both seems very unattractive for playing a game.

This means that the best approach would be to implement it into the game itself. As I’m assuming you’re interested in real time generation and not previously creating dialogues that are then hardwired in the gamefiles, this would mean you’d have to have an LLM running aside of the game itself. If that game is already resource hungry, you’ll lose a big chunk of your potential audience.

Even though the idea of characters that will in real time interact with a player is quite fascinating and exciting, I think we’re not there yet. Nvidia is heavily experimenting with this:

As far as I can tell this is also tied to an API and not yet available to the public.

I’m tackling the same situation right now, more or less. Trying to bring together ML Agents, Barracuda ONNX models, OpenCV, Emerald AI, A* Pathfinding Pro, SALSA Lipsync Pro, Final IK, speech audio clips (I can’t find a suitable real-time TTS yet), and of course, LLMs :sweat_smile:. When you say knowledge stripped just to knowledge of the fictional world, that does invoke thoughts on using RAG, vector databases, and such. I haven’t tried RAG yet, so would be interesting to know if a lite LLM, way a 1.5B or 3B parameter model would suffice…and you take any or all of these assets and make something that’s able to converse with you, locomotion around its virtual environment, start figuring things out, and calling other AI to help it solve its own problems…develop its own learning curriculum. Take control of Python scripts that are narrow in focus…and pass through a GAN or more discerning AI to let it know if it passes muster.

1 Like

Yes, but what if you trimmed out as much data as possible and then retrained the model on only the information it needed to know? That would make the model faster, but it would it end up being fast enough, and would the result quality be worth the effort, I think thats the real question.

If you look at the Vtuber Neuro-sama, she is an LLM running locally on her developers PC. He also often plays/develops games with her running. So it should be possible with at least her quality.

I’m not entirely sure how that works, but I know you can do it since models will have NSFW content stripped out of them all the time, but I’m really just throwing ideas out there.

Another method would just be to train a base model from scratch, and then retrain it with added data for each specific agent, but I feel like that would take an insane amount of data and time.

1 Like

As for pathfinding, you could just have a list of things an agent could interact with, and use A* or another method to path to the target the LLM specifies without having to implement a complex LLM-driven navigation system.

1 Like

I’m just throwing ideas out there at this point, stuff that I haven’t tried but might be possible. Suppose you use Google Gemini 1.5 Pro with its 1 million token context window, and you have used any other LLM of your choice to generate all the personas of all the AI agents that you would create, and the full description for as detailed as you can make it about what the rules of the world are, how agents would behave, speak to you or each other, and collected all of the personas that you could feed into whatever agentic solution you choose to bring the characters “to life” using the LLM of your choice. What sort of things it will talk about, how it talks, all the subtleties of unique speech…and you combine it with vector databases or just simple solutions like text files unique to each NPC/agent/entity…they share the same LLM perhaps…it’s just the accessible knowledge they have access to changes as needed. You can possibly train shared brains using ML Agents…and create clones of them as the starting point for further fine tuning…would get complicated as far as version control goes. This is just something I’m coming up with on the fly, haha.

I’ve run into many issues while trying to use LLM APIs to create characters. Mainly, you run into issues with consistancy, and hallucinations as well as the models ignoring information and instructions. There’s also the matter of cost; I don’t know about Gemini specifically, since I’ve only ever used their web interface/mobile assistant app and not the API, but when using GPT4, the costs to process many requests can get quite extreme, especially if you’re saving conversation history. This isn’t that big of a deal when we’re the only ones using the model, but if the game is ever published, the cost will increase quite exponentially as more and more users use the app/game.

Both of these are reasons why I think it might be a good idea to take the time to train custom models.
Though, your point about saving promps per-character and running those against a model isn’t a bad idea, but I think you would get better performance from a light model if it’s trained on those personas instead, since the model will have to be quite a bit smarter to emulate a random persona correctly.

Keep in mind that I’m not well versed on LLM training, I only started researching this topic an hour or so ago, so I’m learning about this as we speak.

1 Like

I have not fine-tuned a model yet either. I apologize if I give you bad advice or something to follow is wrong. The Gemini API is currently free through their aistudio portal. That’d give you 60 API calls per minute, so not something you’d distribute using your own API key…if the game were to be coded where people bring their own API key, then everyone would get their 60 calls a minute. Its not really intended for production runs, that’s what Vertex portal was created for. Perhaps I’m wrong about those API scenarios, use cases…but for local testing purposes, it’s pretty great in some regards. I’ve made a multi-threaded image prompt generator that calls the API in rapid fire succession…didn’t max out the full 60 calls a minute. And I have another suggestion. Instead of using OpenAI’s GPT-4 API, use CoPilot instead…it’s free…the trick is, need some github project to get you compatible endpoints. I’ve come across two, there’s probably scores more that mostly share much of the same code/strategy. The one I use is called SydneyQT by juzeon. There’s the Go/WAILs UI, but if you look at the bottom, they talk about now having a webapi…that’s the part you want for a 1 for 1 swap out with OpenAI API…now, this certainly, definitely is not a viable thing to count on for a published build…but for testing purposes, it seems viable enough until CoPilot can no longer be jailbroken. I tried it against Pythagoras’ GPT-Pilot, and did some simple Python tests, and worked alright. I’m quite interested to find out how your per-character fine-tuning works out…as that’s something that’s going to be a very often pondered question…how to actually have very different agents that “think” or evaluate a passage of text differently. You could arrive at some sort of voting system…which would be useful when you want different perspectives weighing in on how to proceed.

60/min is not enough. Based on past projects, it’s more like 1-200 calls/min per agent (asuming they’re not just chatbots and actually think about their actions)

I saw that video i think it could interest you some how.

1 Like

I’m only inquiring the following because I’m just trying to understand…why would you need anywhere near 200 LLM calls per agent per minute? I could envision getting to that scale and beyond for a singular agent that would communicate externally across barriers to coordinate/communicate in a universally understood language…the English language as an example…but for the case of having agents, that you control, internally interact with each other, is all of that LLM usage necessary? Or could some of it be reduced to ML algorithms, behavior trees, reinforcement learning, defining action spaces as indexed entries that are known to all agents. Commands or instructions in the form of methods or functions that take some number of parameters…setting destinations, loiter times, what to do when there…I think, just my initial thoughts, as long as you have a system in place to keep track of it all, much of it could be done without LLM calls. If you come up with non-blocking code, afford some flexibility to what your agents can do on your computer…and other end-users’ devices as well, you could possibly multiplex many inter-agent data exchanges…using UDP or TCP ports. UDP works on computers, and Android at least. Are you envisioning having all these agents all speaking aloud, all at the same time for the pleasure of the human user…or are just a couple or few at a time at most…sometimes none? I’m thinking that agent observations can be shared in non-LLM approaches…I think their would need to be time relevance to these observations…have short term and long term memory…the latter being tied into RAG vector database entries for when you might need to run an LLM, no other way around it…I could also imagine having some database maintainer that periodically cleans up irrelevant entries to keep things tidy, slim. This whole topic you started isn’t just applicable to one-off game making. These are real problems that will be solved/addressed as AI entities, embodied or not, start working with each other…hand shaking some representation of things that can be reduced in computational usage. It takes a lot of RAM, CPU, GPU resources to maintain a language based dialog. Perhaps there will be domain specific templates/protocols for conversing intelligently about similar things…and it becomes more of a string or variable parsing exercise.

Just sitting here watching MattVidPro AI’s video about Quiet Star and gpt-prompt-engineer/in-context distillation…and was like, dang…this is perhaps what’s needed for personas to game characters

I don’t believe you can strip knowledge. But you can absolutely fine-tune to your world and characters. Fine-tuning could reduce your need for large prompt contexts, speeding things up and reducing costs.

There’s also something to be said for using large models to generate synthetic data to train/fine-tune smaller, cheaper, quicker models. I’d be interested to see how far you could get doing this.

That said, haven’t done it yet myself :stuck_out_tongue:

I think this question is the key, at least until hardware constraints improve. I often see people trying to use ML and LLMs when there are better tools for the job.

1 Like

I think both of you may be right. Back when I last did a large LLM project, I was making a simple ‘show’ where four agents would be placed in a small scene with many interactable opjects. The agents would talk to eachother and decide what to interact with in the scene from a list, that’s why there were so many calls

I’m also working on a project in this area.

On the question of running a local model, IME current open source models (Mixtral, Llama, Gemma) are too primitive at this point (much worse than GPT 3.5). But maybe during the period you are working on your game, the models and user hardware will become capable enough. Maybe use the service APIs for now and switch to a local model once they are viable?

On the request rate issue others have mentioned, one approach is to only have the LLM direct high level behavior settings e.g. move-to/follow/defend/attack that unit, etc) as a human would do for an RTS game unit.

Look for voice cloning on Huggingface. Voice cloning models are great for games because you can generate (or record) a voice using ElevenLabs for example, export a sample and you can then use that sample with a voice cloning model to make characters have their own voice in real time, consistently.

Xtts v2 (the best one I found, and I looked a lot haha) for example is lighting fast with low resource use. You can run it locally in the background as a localhost API no problem. coqui/XTTS-v2 · Hugging Face
Demo: XTTS - a Hugging Face Space by coqui

I wrote code to run it locally with Gradio, send me a PM and I’ll send you the code if you need it.

Lol yeah that’s the exact model I’ve been using, via Coqui TTS. It works well, but it takes about 0.7x to 0.8x real time speed to generate on my laptop…probably as good as I’m going to get for awhile…either until I get better computer, or some optimizations happen to get it down to about 0.1x real-time speed…through various optimizations…parallel processing sentences and then stitching them back together in the right order :man_shrugging:t2:…I think it’s something you have to really get under the hood to squeeze every bit of your computer’s performance…An idea I just came up with is if you’re trying to read aloud an LLM response, you might process the first bit of the response and start playing it back, with hopes that the parallel processed remainder is done before the first chunk finishes, haha. I think I’m going to try this approach today…pass a long passage in, split into various sized batches and see if I get some speed-up.

1 Like

You wouldn’t happen to know how to create and use numpy embedding out of a voice clone, would you? I think that’s key to making parallel batch operations work…that the source speech has already been preprocessed. Doing one-off, long passages has been ok to do the source sampling every time…but to carry on a conversation with AI, and do parallel processing, I think each thread would need things simplified for them.

Oops, also forgot what thread I was responding to…sorry KieeeShadow for taking it down a tangent that probably wouldn’t work in the context of a shipped game…Python solutions are great for many things, but if this is to be a mobile device, WebGL…can’t count on Python to be part of the solution unless you have a dedicated server to provide stuff. Which does lead me to another question…how are you thinking about wrapping all this up? The service APIs for LLMs are one thing, but the local LLMs…how are you planning to run them? That’s a tricky road ahead…the Llama.cpp related bindings and all that. I haven’t dabbled with that much yet, but are you planning to ship fine-tuned models…or have them hosted somewhere and then fetched via game installation? These are all interesting things to wonder. Also, what game engine are you planning to use? The biggest thing I want to learn is how to build a wrapper around the Llama.cpp such that LLMs can be called completely from with Unity with C# code…so there is no dependence on having Python nor LM Studio or similar be necessary for serving one’s LLMs.
Actually, I just checked…If you use Unity, looks like somebody already appears to have solved the very problem I posed. LLMUnity by undreamai has what appears to be, a viable solution for computers, though no mention of Android or iOS…so probably much further reworking. Though, that still positions your game’s LLM incorporation much better, if Unity is your game engine of choice. If you made it in Unreal, GoDot, or something else, you might just need to take the source code of something already built around llama.cpp and llamafile libraries, and strip it down to act as a necessary local server…or build up the workings of a server from scratch.

The asset for Unity can be gotten on the Unity Asset Store for free as LLM For Unity, or on the Github page. I’m trying it out now, looks very promising.

Following that developer’s Discord profile, there’s pertinent discussion about the TTS for Unity topic under the text-to-speech channel. Don’t think it is quite yet figured out, but it’s a work in progress that people are trying to figure out.