Holding my Shiba Inu – TechCrunch
The AI world is still figuring out how to deal with the amazing feat that is DALL-E 2’s ability to draw/paint/imagine just about anything…but OpenAI isn’t the only one working on something. thing like that. Google Research was quick to release a similar model it’s working on, which it says is even better.
Imagen (get it?) is a text-to-image streaming-based generator built on great transformer language models that… okay, let’s slow down and decompress real quick.
Text-to-image templates take text inputs like “a dog on a bike” and output a corresponding image, something that’s been done for years but has recently seen huge leaps in quality and usability. accessibility.
Some of this uses diffusion techniques, which basically start with a pure noise image and slowly refine it bit by bit until the model thinks it can’t make it look more like a dog on a bike than he already does. This was an improvement over top-down builders that could get hilariously wrong at first, and others that could easily be misled.
The other part is improving language understanding through large language models using the transformer approach, which I won’t (and can’t) touch on here, but that and a few other recent advances have leads to compelling language models like GPT-3 and others.
Imagen starts by generating a small image (64×64 pixels) and then does two “super res” passes over it to bring it up to 1024×1024. It’s not like normal scaling, because the super- AI resolution creates new details in harmony with the smaller image, using the original as a base.
Say, for example, you have a dog on a bike, and the dog’s eye is 3 pixels across in the first frame. Not much room for expression! But on the second image, it’s 12 pixels wide. Where do the necessary details for this come from? Well, the AI knows what a dog’s eye looks like, so it generates more detail as it draws. Then it happens again when the eye is redone, but at 48 pixels in diameter. But at no time did the AI have to remove 48 dog-eye pixels from its… let’s say magic bag. Like many artists, it started with the equivalent of a rough sketch, completed it in a study, and then really went to town on the final canvas.
It’s not unprecedented, and in fact, artists working with AI models are already using this technique to create pieces much larger than the AI can handle in one go. If you split a canvas into multiple pieces and super-solve them all separately, you end up with something much bigger and more detailed; you can even do it repeatedly. An interesting example from an artist I know:
The advances that Google researchers claim with Imagen are multiple. They say that existing text templates can be used for the encoding part of the text and their quality is more important than just increasing visual fidelity. This makes sense intuitively, because a detailed picture of nonsense is definitely worse than a slightly less detailed picture of exactly what you asked for.
For example, in the article describing Imagen, they compare the results for it and DALL-E 2 making “a panda making latte art”. In all of the latter’s images, it’s panda latte art; in most images, a panda does the art. (None have been able to make an astronaut out of a horse, showing otherwise in all attempts. It’s a work in progress.)
In Google’s tests, Imagen came out on top in human evaluation tests for both accuracy and fidelity. It’s obviously quite subjective, but to match the perceived quality of DALL-E 2, which until today was considered a huge leap forward compared to everything else, is quite impressive. I’ll only add that while it’s pretty good, none of these images (from any generator) will stand up to more than a cursory examination before people notice they’re generated or have any serious suspicions.
OpenAI is a step or two ahead of Google in several respects, however. DALL-E 2 is more than a research paper, it’s a private beta with people using it just like they used its predecessor and GPT-2 and 3. Ironically the company with “open” in its name has focused on producing its text-looking images, while the fabulously profitable internet giant has yet to dabble in it.
This is more than clear from the choice made by the DALL-E 2 researchers, to keep the training dataset in advance and to remove any content that might violate their own guidelines. The model couldn’t do anything NSFW if he tried. The Google team, however, used known large datasets to include inappropriate material. In an insightful section of the Imagen site describing “Limitations and societal impact”, the researchers write:
The downstream applications of text-image models are varied and can have a complex impact on society. Potential risks of misuse raise concerns about responsible open source code and demos. At this time, we have decided not to release any public code or demo.
The data requirements of text-image models have led researchers to rely heavily on large datasets, mostly uncurated and retrieved from the web. While this approach has enabled rapid algorithmic advances in recent years, datasets of this nature often reflect social stereotypes, oppressive viewpoints, and disparaging or otherwise harmful associations to marginalized identity groups. While a subset of our training data was filtered to remove noise and unwanted content, such as pornographic images and toxic language, we also used the LAION-400M dataset which is known to contain a wide range of inappropriate content, including pornographic images, racial slurs and harmful social stereotypes. Imagen relies on text encoders trained on uncured web-scale data, and thus inherits the social biases and limitations of large language models. As such, there is a risk that Imagen has encoded harmful stereotypes and portrayals, which guides our decision not to release Imagen for public use without further safeguards in place.
While some might criticize this, saying that Google is afraid its AI isn’t politically correct enough, it’s an uncharitable and short-sighted view. An AI model is only as good as the data it’s trained on, and not every team can put in the time and effort to remove the truly awful stuff these scrapers pick up when they put together many millions of images or several billions. word data sets.
These biases are believed to emerge during the research process, which exposes how the systems work and provides an unfettered testing ground to identify these and other limitations. Otherwise, how would we know that an AI can’t draw hairstyles common to black people – hairstyles that any child could draw? Or that when asked to write stories about work environments, the AI invariably makes the boss a man? In these cases, an AI model works perfectly and as designed – it has successfully learned the biases that permeate the media it is trained on. Not unlike people!
But while unlearning systemic bias is a lifelong project for many humans, an AI has it easier, and its creators can remove the content that made it misbehave in the first place. Perhaps one day it will be necessary for an AI to write in the style of a racist and sexist pundit of the 1950s, but for now the benefits of including this data are small and the risks significant.
Either way, Imagen, like the others, is clearly still in the experimental phase, not ready to be used other than under strict human supervision. When Google gets around to making its capabilities more accessible, I’m sure we’ll learn more about how and why it works.