I don’t think you even need the actual stuff to train a neural network to recognize it. For example, if I wanted to train a neural network to recognize pictures of lions, but I didn’t have any actual pictures of lions, I could use pictures of lion-shaped things, lion-colored things and locations where lions might appear. If a picture is hitting all three of those, it’s very likely to be a lion. Very likely is all a neural network can do, so it’s good enough for my purposes.
Available image generators are already capable of generating those images and they weren’t even trained on it. Once a neural network can detect/generate two separate concepts, it can detect/generate the overlap. It won’t be as fine-tuned obviously, but can still turn out scarily accurate.