Misadventures with Minecraft agents & transformers & hubris
Note: This is, practically verbatim, something I wrote in response to an application question. It was probably not the best response to that question, but I would like to preserve it so I can look back in the future and think "Wow, I didn't know anything back then." So here goes nothing.
Q: Tell us about a paper/blog post/tutorial that delighted you recently. What made it stand out?
A paper that delighted me recently was Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts (link for convenience). This delighted me because, well... I love Minecraft!
Okay, but really. This paper got me out of a doomed prototype spiral just today! Last weekend, I was confined to my parents' house due to a particularly nasty winter storm, and I couldn't resist the urge to Do Something. I'd been repressing my very strong Do Something proclivities since finals in the first week of December, and the inertia from that had kept me lazy up and through winter break.
It was now-- err, last-weekend-now --almost the end of January, so I sat down with my crummy old laptop and got to work on making my Minecraft agent monstrosity. I had big, idealistic dreams about an "architecture" for it. I've passively consumed a lot of machine learning content through internet osmosis, but as an engineering student my coursework thus far has been all physics, circuits, statics, thermodynamics... not much CS, and certainly not any ML. I spent a little bit of time learning the ropes then decided to dive straight into the deep end and try to pick up stuff about LLMs, mostly as an experiment and not because I thought I was actually ready for Real, Serious ML. This brought me to transformers, which eventually led me to... The Idea.
The Idea was nothing short of a Dunning-Kruger case study, but it was my Dunning-Kruger case study, dammit, and I learned a lot through trying and failing to make it work. I wanted to create a "low-level" movement model trained on gameplay, using raycasting to "see" the world around it, and a "high-level" planning and reasoning model meant to boss the low-level model around and hopefully accomplish some... things? I decided to let the higher-level details percolate as I set up the practical part of my system. I created a set of Minecraft mods and disseminated them amongst my friends, who graciously provided me with a lot of data while playing on my server.
Maybe it was sheer luck, but the first run of my movement model seemed to yield pretty promising results, considering it was trained on only an hour of gameplay. (The main issue was, of course, that it wasn't doing much, nor was it motivated by any goals. But it was cool, okay?)
(Original captions in blocks)
He is now Two Hours of data old and seems to be much better at Minecraft navigation⦠Iām pleasantly surprised bc most of this data was my friends strip mining and constantly dying. This is only the movement proof-of-concept, my full model is training rn
Also so excited to mess around more w/ embeddings for items and other random stuff I think of as I go. 95% prolly wonāt work out but Iām trying to actually learn ML stuff in this process lol
swimming is a struggle bc I donāt think anyone swam during our session⦠should prolly do that tomorrow, but weāre mostly just running an smp and not particularly focusing on data
But what about The Idea? Well, that was for my "reasoning model," which was... interesting. High on novelty-fumes, I went: "Hey, what if I made a transformer... for Minecraft?" and that was that, for better or worse bar the destruction of society as we know it ā and maybe not even that ā there was no way to drive me from that path.
I used in-game events like breaking blocks, specific types of movement, and crafting and tokenized them to the best of my ability, pairing them with what I decided were āembeddingsā of the stuff they involved, normalized by their positions in game event sequences. Basically, the mod loader (Fabric) serves events to my mod, which it then keeps in its logs. This happens on both the server (which is used to track "storylines" or individual user actions over time) and on the client, though I found it much easier to track inventory deltas and important crafting event stuff through server side mods. Once I find time after my midterms, I will probably open source the logger set and the additional world rotator mod I created to ensure nice clean testing data. Maybe someone else will find them useful :)
The model was⦠sub-okay. I mean, it kind of understood the basics of how to play, though it was not smart at all. It predictably started to break down at any divergence or progression from the most common game start sequence (āpunch a tree -> make a crafting tableā).
I'd anticipated something like this happening, but drove those thoughts away because I was having so much fun building my nonsense machine. And there were obviously more problems than just the model. I mean, even if this had worked, how would it actually even link to the lower level one ā Iād considered stuff like skills, but didnāt have the data nor the compute to get them up and running. I'd rushed the movement because I wanted to see results early. It felt like cheating to accept some sort of Mineflayer configuration (a bot command layer) over my beautiful, semi-functional, ego-fueled raycasting! I wasted a day running random experiments hoping it would magically work.
Finally, after a good nightās rest, I decided to look to see if there were any recent papers on Minecraft models published. I had looked a bit originally and found a few, but most seemed out of date. This time, I looked specifically for Chinese papersā before, everything had defaulted to America-only. Lo and behold, there was Optimus V3, and another paper published this month about Minecraft NPC task optimization. After a reading break, I decided to mess around with other, more effective frameworks before moving forward; I had wanted to avoid using LLMs to instruct my agent originally, but Iāve realized they are probably inevitable here.
I still think the idea of raycast nav is interesting, and I really wanted my smart aleck-y hackshenanigans to work out, but obviously they did not and once I convinced myself of that I was able to move on.
I think the best path for more generalized Minecraft agents is live video processing. Deepmind demo'd something related to this way back in 2021, and there's been continuous progress on world modelling with this approach up and through this year. There's so much gameplay of Minecraft on YouTube that, with the proper pipeline, you'd have an endless supply of decent data-- for every 10 bad videos, you'll find 1 good one, but does that really matter when everyone and their mom and their horse's pet dog has a Let's Play up?
Right now, I've been captured by another silly project (finagling reasoning and perhaps formalized math out of a time-gated corpora). I'm also still trying to get the basics of ML down, since I don't think I will properly be able to appreciate nor model these projects without a good appreciation for the math behind them. However, there is no question that I will be returning to this. Even if someone more capable perfects Minecraft agents in my absence, I cannot resist of the 'craft...