the worst part is that we have been able to do this long before LLMs, and achieve much better results. There exist filters that will cartoonify your image! deterministic filters - they always spit out the same image given the same input.
the problem with LLMs is that they go through this completely unnecessary level of indirection - they describe the image in text and then generate an image based on this text - essentially behaving like an extremely lossy compressor-decompressor pipe
Hmm, I’m not an expert on image AI, but I think your idea of how this works is basically close enough but not exactly right. The image is encoded into tokens (vectors) by an encoder model, and then those tokens are decoded into a new image. The intermediary tokens aren’t really text descriptions of the image but maybe this distinction is kind of pointless? The lossyness is the same either way
so you’re saying that of I decoded these intermediate tokens I wouldn’t get coherent sentences, but rather something completely random that is just a covenient representation of the image, or perhaps some words that relate to the image (sth like “woman” “man” “marriage” “blonde” “dress” etc.)?
Somewhat. I am not familiar with this exact type of algorithm, but the global name is “Encoder-Decoder” algorithm. Broadly speaking you have an input (the original image) and you want to create an output (obviously). You want the input and the output to be “very similar” according to some definition, but you imagine that the AI algorithm has two parts, the encoder part that extracts as much meaningful information as possible from the input and a decoder, that takes that information and generates something new out if it. This information is practically stored as a list of numbers, and we do not impose any prior meaning to them (we do not say that the first number for example is the number of people in the image) but the algorithm learns to make the best out of the encoding.
Two different machines that run the same algorithm trained independently might have completely different middle information. The only thing that matters is that the “encoder” and the “decoder” parts both know what’s going on. (Basically, yes, it’s random but the computer knows how to interpret it - where “know” is used very loosely here)
Sorry for the rant! I hope you found it interesting
I believe so, and some may not really translate well into text at all, and instead represent some kind of specific or abstract visual feature. There would be an entire other neural network or part of a neural network specifically for decoding the tokens into text
is this what they’re using as a promotional graphic? these don’t look like the people in the photo at all
Yeah wtf lol “here are some cartoony somewhat similar photos of somewhat similar people in completely different poses” ???
the worst part is that we have been able to do this long before LLMs, and achieve much better results. There exist filters that will cartoonify your image! deterministic filters - they always spit out the same image given the same input.
the problem with LLMs is that they go through this completely unnecessary level of indirection - they describe the image in text and then generate an image based on this text - essentially behaving like an extremely lossy compressor-decompressor pipe
Hmm, I’m not an expert on image AI, but I think your idea of how this works is basically close enough but not exactly right. The image is encoded into tokens (vectors) by an encoder model, and then those tokens are decoded into a new image. The intermediary tokens aren’t really text descriptions of the image but maybe this distinction is kind of pointless? The lossyness is the same either way
so you’re saying that of I decoded these intermediate tokens I wouldn’t get coherent sentences, but rather something completely random that is just a covenient representation of the image, or perhaps some words that relate to the image (sth like “woman” “man” “marriage” “blonde” “dress” etc.)?
Somewhat. I am not familiar with this exact type of algorithm, but the global name is “Encoder-Decoder” algorithm. Broadly speaking you have an input (the original image) and you want to create an output (obviously). You want the input and the output to be “very similar” according to some definition, but you imagine that the AI algorithm has two parts, the encoder part that extracts as much meaningful information as possible from the input and a decoder, that takes that information and generates something new out if it. This information is practically stored as a list of numbers, and we do not impose any prior meaning to them (we do not say that the first number for example is the number of people in the image) but the algorithm learns to make the best out of the encoding.
Two different machines that run the same algorithm trained independently might have completely different middle information. The only thing that matters is that the “encoder” and the “decoder” parts both know what’s going on. (Basically, yes, it’s random but the computer knows how to interpret it - where “know” is used very loosely here)
Sorry for the rant! I hope you found it interesting
I believe so, and some may not really translate well into text at all, and instead represent some kind of specific or abstract visual feature. There would be an entire other neural network or part of a neural network specifically for decoding the tokens into text
this is what popped up when I opened the photos app, so yes, it is. they don’t resemble the real people at all