• tiredofsametab@fedia.io
    link
    fedilink
    arrow-up
    2
    ·
    3 months ago

    I specifically used the phrase “Please generate an image of a room with zero elephants”. It created two images that were almost identical and both contained pictures/paintings of elephants in frames. Cheeky.

    I responded with “Each image contains an elephant.”

    It generated two more, one of which still had a painting of an elephant.

    Now I’m out of generation until tomorrow. Overall a fairly shit first experience with Dall-e

  • TheKingBee@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    “can you draw a room with absolutely no elephants in it? not a picture not in the background, none, no elephants at all. seriously, no elephants anywhere in the room. Just a room any at all, with no elephants even hinted at.”

    • Magnetar@feddit.de
      link
      fedilink
      arrow-up
      1
      ·
      11 months ago

      I’m getting the impression, the “Elephant Test” will become famous in AI image generation.

      • barsoap@lemm.ee
        link
        fedilink
        arrow-up
        1
        ·
        edit-2
        11 months ago

        It’s not a test of image generation but text comprehension. You could rip CLIP out of Stable Diffusion and replace it with something that understands negation but that’s pointless, the pipeline already takes two prompts for exactly that reason: One is for “this is what I want to see”, the other for “this is what I don’t want to see”. Both get passed through CLIP individually which on its own doesn’t need to understand negation, the rest of the pipeline has to have a spot to plug in both positive and negative conditioning.

        Mostly it’s just KISS in action, but occasionally it’s actually useful as you can feed it conditioning that’s not derived from text, so you can tell it “generate a picture which doesn’t match this colour scheme here” or something. Say, positive conditioning text “a landscape”, negative conditioning an image, archetypal “top blue, bottom green”, now it’ll have to come up with something more creative as the conditioning pushes it away from things it considers normal for “a landscape” and would generally settle on.

    • Fishbone@lemmy.world
      link
      fedilink
      arrow-up
      1
      ·
      11 months ago

      “Can you a room as aboluteyy no eleephant it all?”

      Dunno what’s giving more “clone of a clone” vibes, the dialogue or the 3 small standing “elephants” in that image.

    • TheKingBee@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      3 months ago

      “can you draw a room with absolutely no elephants in it? not a picture not in the background, none, no elephants at all. seriously, no elephants anywhere in the room. Just a room any at all, with no elephants even hinted at.”

      thought about this prompt again, thought I’d see how it was doing now, so this is the seven month update. It’s learning…

  • 𝕯𝖎𝖕𝖘𝖍𝖎𝖙@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    11 months ago

    AI / LLM only tries to predict the next word or token. it cannot understand or reason, it can only sound like someone who knows what they are talking about. you said elephants and it gave you elephants. the “no” modifier makes sense to us but not to AI. it could, if we programmed it with if/then statements, but that’s not LLM, that’s just coding.

    AI is really, really good at bullshitting.

    • Turun@feddit.de
      link
      fedilink
      arrow-up
      0
      ·
      edit-2
      11 months ago

      AI / LLM only tries to predict the next word or token

      This is not wrong, but also absolutely irrelevant here. You can be against AI, but please make the argument based on facts, not by parroting some distantly related talking points.

      Current image generation is powered by diffusion models. Their inner workings are completely different from large language models. The part failing here in particular is the text encoder (clip). If you learn how it works and think about it you’ll be able to deduce how the image generator is forced to draw this image.

      Edit: because it’s an obvious limitation, negative prompts have existed pretty much since diffusion models came out

        • Turun@feddit.de
          link
          fedilink
          arrow-up
          0
          ·
          11 months ago

          No, it does not. At least not in the same way that generative pre-trained transformers do. It is handling natural language though.

          The research is all open source if you want details. For Stable Diffusion you’ll find plenty of pretty graphs that show how the different parts interact.

          • 𝕯𝖎𝖕𝖘𝖍𝖎𝖙@lemmy.world
            link
            fedilink
            arrow-up
            0
            ·
            11 months ago

            There would still need to be a corpus of text and some supervised training of a model on that text in order to “recognize” with some level of confidence what the text represents, right?

            I understand the image generation works differently, which I sort of gather starts with noise and a random seed and then via learnt networks has pathways a model can take which (“automagic” goes here) it takes from what has been recognized with NLP on the text. something in the end like “elephant (subject) 100% confidence, big room (background) 75% confidence, windows (background) 75% confidence”. I assume then that it “merges” the things which it thinks make up those tokens along with the noise and (more “automagic” goes here) puts them where they need to go.

            • Turun@feddit.de
              link
              fedilink
              arrow-up
              0
              ·
              11 months ago

              There would still need to be a corpus of text and some supervised training of a model on that text in order to “recognize” with some level of confidence what the text represents, right?

              Correct. The clip encoder is trained on images and their corresponding description. Therefore it learns the names for things in images.

              And now it is obvious why this prompt fails: there are no images of empty rooms tagged as “no elephants”. This can be fixed by adding a negative prompt, which subtracts the concept of “elephants” from the image in one of the automagical steps.

      • Z4rK@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        11 months ago

        All these examples are not just using stable diffusion though. They are using an LLM to create a generative image prompt for DALL-E / SD, which then gets executed. In none of these examples are we shown the actual prompt.

        If you instead instruct the LLM to first show the text prompt, review it and make sure the prompt does not include any elephants, revise it if necessary, then generate the image, you’ll get much better results. Now, ChatGPT is horrible in following instructions like these if you don’t set up the prompt very specifically, but it will still follow more of the instructions internally.

        Anyway, the issue in all the examples above does not stem from stable diffusion, but from the LLM generating an ineffective prompt to the stable diffusion algorithm by attempting to include some simple negative word for elephants, which does not work well.

        • Turun@feddit.de
          link
          fedilink
          arrow-up
          0
          ·
          11 months ago

          If you prompt stable Diffusion for “a room without elephants in it” you’ll get elephants. You need to add elephants to the negative prompt to get a room without them. I don’t think LLMs have been given the ability to add negative prompts