It shouldn't be too hard to achieve this at all. But as you have already found, you can't just mix the music straight in dry, because it sounds unnatural to the scene in which its supposedly being played. I expect some effects units/software have presets to do just this.
But to get you in the right ballpark.. Wipe off most of the top end frequencies, and most of the bottom too. Until it sounds quite dull and boxy. Then stick a room ambience on it from whatever effects you have available to you there. You'll have to experiment with settings there however, until it starts to sound like it fits in with your scene. Careful not to add too much, its easy, and compression later on could highlight it even more. You might want to check out some convolution reverbs too if you are serious about making the effect convincing. Might be overkill however..
You have to have a good ear here though. As you are left with a lot of midrange of course, which may conflict with spoken line stuff. So its important to keep spoken line material nice and clean. The 2 sounds need different characteristics to be clearly differentiated from each other. But as the music level will likely be quite low anyway, it shouldn't be too much problem.
Check how it all sounds together with tons of compression piled on too. As this is what TV & Radio stations do to bastardise sound quality in favour of loudness these days.
Good luck 🙂