Considering how poorly penis detection algorithms have worked in sandbox games, I hope you’ll forgive me for not being terribly optimistic about object recognition. It’s easy for humans, but it’s fundamentally a very hard problem for computers, especially since computers lack the context cues that humans have. Humans have a contextual understanding of what things represent and mean that computers lack. After we’ve told a computer what a dog is, it can do a decent job of recognizing dogs. But it doesn’t really understand what a dog is. It’s a little like ascribing attributes to a book (good, bad, sad, funny) and asking a computer to characterize other books based on those. The computer may make decent recommendations, in general, but it doesn’t really understand why the book is good, just that books with x, y, and z characteristics are typically considered “good”, and this book has a blend of characteristics that would make it “good”. It doesn’t really understand the mental processes that lead to humans enjoying the book.
This makes the whole object recognition thing particularly hard for computers. While it could probably learn to recognize chairs, for instance, it might be a lot worse at recognizing the broader category of “phallic-suggesting items”. Is a giant pencil in the category of “phallic-suggesting items”? ¯\_(ツ)_/¯ Possibly? It really depends on context. In a world of oversized books and desks, probably not. In a porno with a woman in a schoolgirl’s outfit? Most likely. Outside of any context or in the ambiguous context of user generated content? It’s likely impossible to tell. Some things in the “phallic suggesting items” category may be more obviously phallic without context than others, but, in say, a Lego game, how do you distinguish Lego brick penises from, say, Lego brick obelisks?
So I’m actually not terribly optimistic about computer vision as a solvable problem, at least as a general use, scalable solution. Computer vision is likely going to be limited to recognizing certain types of things, as opposed to being able to recognize any object you put in front of it. And, in a world as open to user generated content as, say, Second Life (which had its own scripting engine for creating interactive objects whose whole state is user generated), you’d need the general solution.