• schnurrito@discuss.tchncs.de
    link
    fedilink
    arrow-up
    6
    ·
    7 hours ago

    Who is “we”? My understanding is LLMs are mostly being trained on a large amount of publicly available texts, including both reddit posts and research papers.

  • Trainguyrom@reddthat.com
    link
    fedilink
    English
    arrow-up
    10
    ·
    12 hours ago

    Short answer: they already are

    Slightly longer answer: GPT models like ChatGPT are part of an experiment in “if we train the AI model on shedloads of data does it make a more powerful AI model?” and after OpenAI made such big waves every company is copying them including trying to train models similar to ChatGPT rather than trying to innovate and do more

    Even longer answer: There’s tons of different AI models out there for doing tons of different things. Just look at the over 1 million models on Hugging Face (a company which operates as a repository for AI models among other services) and look at all of the different types of models you can filter for on the left.

    Training an image generation model on research papers probably would make it a lot worse at generating pictures of cats, but training a model that you want to either generate or process research papers on existing research papers would probably make a very high quality model for either goal.

    More to your point, there’s some neat very targeted models with smaller training sets out there like Microsoft’s PHI-3 model which is primarily trained on textbooks

    As for saving the world, I’m curious what you mean by that exactly? These generative text models are great at generating text similar to their training data, and summarization models are great at summarizing text. But ultimately AI isn’t going to save the world. Once the current hype cycle dies down AI will be a better known and more widely used technology, but ultimately its just a tool in the toolbox.

    • Umbrias@beehaw.org
      link
      fedilink
      arrow-up
      1
      ·
      7 hours ago

      also the answer to that question, shitloads of data for a better ai, is yes… with logarithmic returns. massively underpriced (by cost to generate) returns that have questionable value statement at best.

  • howrar@lemmy.ca
    link
    fedilink
    arrow-up
    21
    ·
    18 hours ago

    I find it amusing that everyone is answering the question with the assumption that the premise of OP’s question is correct. You’re all hallucinating the same way that an LLM would.

    LLMs are rarely trained on a single source of data exclusively. All the big ones you find will have been trained on a huge dataset including Reddit, research papers, books, letters, government documents, Wikipedia, GitHub, and much more.

    Example datasets:

    • andrewta@lemmy.world
      link
      fedilink
      arrow-up
      6
      arrow-down
      1
      ·
      17 hours ago

      Rules of lemmy

      Ignore facts, don’t do research to see if the comment/post is correct, don’t look at other comments to see if anyone else has corrected the post/comment already, there is only one right side (and that is the side of the loudest group)

  • Strayce@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    15 hours ago

    They are. T&F recently cut a deal with Microsoft. Without author’s consent, of course.

    I’m fairly sure a few others have too, but that’s the only article I could find quickly.

    • spongebue@lemmy.world
      link
      fedilink
      arrow-up
      18
      ·
      edit-2
      23 hours ago

      Machine learning has some pretty cool potential in certain areas, especially in the medical field. Unfortunately the predominant use of it now is slop produced by copyright laundering shoved down our throats by every techbro hoping they’ll be the next big thing.

  • TheOubliette@lemmy.ml
    link
    fedilink
    arrow-up
    21
    arrow-down
    1
    ·
    20 hours ago

    “AI” is a parlor trick. Very impressive at first, then you realize there isn’t much to it that is actually meaningful. It regurgitates language patterns, patterns in images, etc. It can make a great Markov chain. But if you want to create an “AI” that just mines research papers, it will be unable to do useful things like synthesize information or describe the state of a research field. It is incapable of critical or analytical approaches. It will only be able to answer simple questions with dubious accuracy and to summarize texts (also with dubious accuracy).

    Let’s say you want to understand research on sugar and obesity using only a corpus from peer reviewed articles. You want to ask something like, “what is the relationship between sugar and obesity?”. What will LLMs do when you ask this question? Well, they will just attempt to do associations and to construct reasonable-sounding sentences based on their set of research articles. They might even just take an actual semtence from an article and reframe it a little, just like a high schooler trying to get away with plagiarism. But they won’t be able to actually mechanistically explain the overall mechanisms and will fall flat on their face when trying to discern nonsense funded by food lobbies from critical research. LLMs do not think or criticize. Of they do produce an answer that suggests controversy it will be because they either recognized diversity in the papers or, more likely, their corpus contains reviee articles that criticize articles funded by the food industry. But it will be unable to actually criticize the poor work or provide a summary of the relationship between sugar and obesity based on any actual understanding that questions, for example, whether this is even a valid question to ask in the first place (bodies are not simple!). It can only copy and mimic.

    • Brahvim Bhaktvatsal@lemmy.kde.social
      link
      fedilink
      isiZulu
      arrow-up
      3
      ·
      13 hours ago

      They might even just take an actual semtence from an article and reframe it a little

      Case for many things that can be answered via stackoverflow searches. Even the order in which GPT-4o brings up points is the exact same as SO answers or comments.

      • TheOubliette@lemmy.ml
        link
        fedilink
        arrow-up
        2
        ·
        13 hours ago

        Yeah it’s actually one of the ways I caught a previous manager using AI for their own writing (things that should not have been done with AI). They were supposed to write about something in a hyper-specific field and an entire paragraph ended up just being a rewording of one of two (third party) website pages that discuss this topic directly.

    • howrar@lemmy.ca
      link
      fedilink
      arrow-up
      2
      ·
      edit-2
      14 hours ago

      Why does everyone keep calling them Markov chains? They’re missing all the required properties, including the eponymous Markovian property. Wouldn’t it be more correct to call them stochastic processes?

      Edit: Correction, turns out the only difference between a stochastic process and a Markov process is the Markovian property. It’s literally defined as “stochastic process but Markovian”.

      • TheOubliette@lemmy.ml
        link
        fedilink
        arrow-up
        3
        ·
        17 hours ago

        Because it’s close enough. Turn off beam and redefine your state space and the property holds.

        • howrar@lemmy.ca
          link
          fedilink
          arrow-up
          4
          ·
          17 hours ago

          Why settle for good enough when you have a term that is both actually correct and more widely understood?

                • howrar@lemmy.ca
                  link
                  fedilink
                  arrow-up
                  2
                  ·
                  14 hours ago

                  That’s basically like saying that typical smartphones are square because it’s close enough to rectangle and rectangle is too vague of a term. The point of more specific terms is to narrow down the set of possibilities. If you use “square” to mean the set of rectangles, then you lose the ability to do that and now both words are equally vague.

  • RangerJosie@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    15 hours ago

    Saving the world isn’t profitable in the short term.

    Vulture capitalists don’t care about the future. They care about the immediate. Short term profitability. And nothing else.

  • ryathal@sh.itjust.works
    link
    fedilink
    arrow-up
    34
    ·
    1 day ago

    Both are happening. Samples of casual writing are more valuable to use to generate an article than research papers though.

    • FaceDeer@fedia.io
      link
      fedilink
      arrow-up
      9
      arrow-down
      1
      ·
      24 hours ago

      Yeah. Scientific papers may teach an AI about science, but Reddit posts teach AI how to interact with people and “talk” to them. Both are valuable.

      • geekwithsoul@lemm.ee
        link
        fedilink
        English
        arrow-up
        9
        arrow-down
        3
        ·
        23 hours ago

        Hopefully not too pedantic, but no one is “teaching” AI anything. They’re just feeding it data in the hopes that it can learn probabilities for certain types of output. It “understands” neither the Reddit post nor the scientific paper.

        • hoshikarakitaridia@lemmy.world
          link
          fedilink
          arrow-up
          4
          arrow-down
          2
          ·
          edit-2
          14 hours ago

          This might be a wild take but people always make AI out to be way more primitive than it is.

          Yes, in it’s most basic for an LLM can be described as an auto-complete for conversations. But let’s be real: the amount of different optimizations and adjustments made before and after the fact is pretty complex, and the way the AI works is pretty close already to a brain. Hell that’s where we started out; emulating a brain. And you can look into this, the base for AI is usually neural networks, which learn to give specific parts of an input a specific amount of weight when generating the output. And when the output is not what we want, the AI slowly adjusts those weights to get closer.

          Our brain works the same in it’s most basic form. We use electric signals and we think associative patterns. When an electric signal enters one node, this node is connected via stronger or lighter bridges to different nodes, forming our associations. Those bridges is exactly what we emulate when we use nodes with weighted connectors in artificial neural networks.

          Our AI output is quality wise right now pretty good, but integrity and security wise pretty bad (hallucinations, not following prompts, etc.), but saying it is performing at the level of a three year old is simultaneously under-selling and overselling how AI performs. We should be aware that just because it’s AI doesn’t mean it’s good, but it also doesn’t mean it’s bad either. It just means there’s a feature (which is hopefully optional) and then we can decide if it’s helpful or not.

          I do music production and I need cover art. As a student, I can’t afford commissioning good artworks every now and then, so AI is the way to go and it’s been nailing it.

          As a software developer, I’ve come to appreciate that after about 2y of bad code completion AIs, there’s finally one that is a net positive for me.

          AI is just like anything else, it’s a tool that brings change. How that change manifests depends on us as a collective. Let’s punish bad AI, dangerous AI or similar (copilot, Tesla self driving, etc.) and let’s promote good AI (Gmail text completion, chatgpt, code completion, image generators) and let’s also realize that the best things we can get out of AI will not hit the ceiling of human products for a while. But if it costs too much, or you need quick pointers, at least you know where to start.

          • geekwithsoul@lemm.ee
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            2
            ·
            14 hours ago

            This shows so many gross misconceptions and with such utter conviction, I’m not even sure where to start. And as you seem to have decided you like to get free stuff that is the result of AI trained off the work of others without them receiving any compensation, nothing I say will likely change your opinion because you have an emotional stake in not acknowledging the problems of AI.

        • Zexks@lemmy.world
          link
          fedilink
          arrow-up
          5
          arrow-down
          4
          ·
          23 hours ago

          Describe how you ‘learned’ to speak. How do you know what word comes after the next. Until you can describe this process in a way that doesn’t make it ‘human’ or ‘biological’ only it’s no different. The only thing they can’t do is adjust their weights dynamically. But that’s a limitation we gave it not intrinsic to the system.

          • geekwithsoul@lemm.ee
            link
            fedilink
            English
            arrow-up
            6
            arrow-down
            2
            ·
            22 hours ago

            I inherited brain structures that are natural language processors. As well as the ability to understand and repeat any language sounds. Over time, my brain focused in on only the language sounds I heard the most and through trial and repetition learned how to understand and make those sounds.

            AI - as it currently exists - is essentially a babbling infant with none of the structures necessary to do anything more than repeat sounds back without understanding any of them. Anyone who tells you different is selling you something.

  • Stepos Venzny@beehaw.org
    link
    fedilink
    English
    arrow-up
    16
    ·
    22 hours ago

    Training it on research papers wouldn’t make it smarter, it would just make it better at mimicking their writing style.

    Don’t fall for the hype.

  • Rampsquatch@sh.itjust.works
    link
    fedilink
    arrow-up
    20
    ·
    23 hours ago

    You could feed all the research papers in the world to an LLM and it will still have zero understanding of what you trained it on. It will still make shit up, it can’t save the world.

  • ImplyingImplications@lemmy.ca
    link
    fedilink
    arrow-up
    24
    ·
    1 day ago

    Because AI needs a lot of training data to reliably generate something appropriate. It’s easier to get millions of reddit posts than millions of research papers.

    Even then, LLMs simply generate text but have no idea what the text means. It just knows those words have a high probability of matching the expected response. It doesn’t check that what was generated is factual.

    • Melatonin@lemmy.dbzer0.comOP
      link
      fedilink
      arrow-up
      8
      ·
      21 hours ago

      Hmmm. Not sure if I’m being insulted. Is that one of those fish fossils that looks kind of like a horseshoe crab?

      • Tabooki@lemmy.world
        link
        fedilink
        arrow-up
        1
        arrow-down
        2
        ·
        21 hours ago

        Dictionary Definitions from Oxford Languages · Learn more noun (especially in prehistoric times) a person who lived in a cave. a hermit. a person who is regarded as being deliberately ignorant or old-fashioned.

  • Destide@feddit.uk
    link
    fedilink
    English
    arrow-up
    12
    ·
    24 hours ago

    Redditors are always right, peer reviewed papers always wrong. Pretty obvious really. :D