Lemmy.one
  • Communities
  • Create Post
  • heart
    Support Lemmy
  • search
    Search
  • Login
  • Sign Up
Lugh@futurology.todayM to Futurology@futurology.todayEnglish · 8 months ago

Multiple LLMs voting together on content validation catch each other’s mistakes to achieve 95.6% accuracy.

arxiv.org

external-link
message-square
25
fedilink
53
external-link

Multiple LLMs voting together on content validation catch each other’s mistakes to achieve 95.6% accuracy.

arxiv.org

Lugh@futurology.todayM to Futurology@futurology.todayEnglish · 8 months ago
message-square
25
fedilink
Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability
arxiv.org
external-link
Large Language Models (LLMs) have shown significant advances in text generation but often lack the reliability needed for autonomous deployment in high-stakes domains like healthcare, law, and finance. Existing approaches rely on external knowledge or human oversight, limiting scalability. We introduce a novel framework that repurposes ensemble methods for content validation through model consensus. In tests across 78 complex cases requiring factual accuracy and causal consistency, our framework improved precision from 73.1% to 93.9% with two models (95% CI: 83.5%-97.9%) and to 95.6% with three models (95% CI: 85.2%-98.8%). Statistical analysis indicates strong inter-model agreement ($κ$ > 0.76) while preserving sufficient independence to catch errors through disagreement. We outline a clear pathway to further enhance precision with additional validators and refinements. Although the current approach is constrained by multiple-choice format requirements and processing latency, it offers immediate value for enabling reliable autonomous AI systems in critical applications.
alert-triangle
You must log in or # to comment.
  • copygirl@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    36
    ·
    8 months ago

    Great, so it’s still wrong 1 out of 20 times, and just got even more energy intensive to run.

    • kippinitreal@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      8 months ago

      Genuine question: how energy intensive is it to run a model compared to training it? I always thought once a model is trained it’s (comparatively) trivial to query?

      • oktoberpaard@feddit.nl
        link
        fedilink
        English
        arrow-up
        7
        ·
        8 months ago

        A 100-word email generated by an AI chatbot using GPT-4 requires 0.14 kilowatt-hours (kWh) of electricity, equal to powering 14 LED light bulbs for 1 hour.

        Source: https://www.washingtonpost.com/technology/2024/09/18/energy-ai-use-electricity-water-data-centers/

        • hitmyspot@aussie.zone
          link
          fedilink
          English
          arrow-up
          1
          ·
          8 months ago

          How much energy does it take for the PC to be on and the user to type out that email manually?

          I assume we will get to a point where energy required starts to reduce as the computing power increases with moores law. However, it’s awful for the environment in the mean time.

          I don’t doub that rather than reducing energy, instead they will use more complex models requiring more power for these tasks for the foreseeable future. However eventually it will be diminishing returns on power and efficiency will be more profitable.

      • DavidGarcia@feddit.nl
        link
        fedilink
        English
        arrow-up
        7
        ·
        8 months ago

        For the small ones, with GPUs a couple hundred watts when generating. For the large ones, somewhere between 10 to 100 times that.

        With specialty hardware maybe 10x less.

        • Pennomi@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          8 months ago

          A lot of the smaller LLMs don’t require GPU at all - they run just fine on a normal consumer CPU.

          • copygirl@lemmy.blahaj.zone
            link
            fedilink
            English
            arrow-up
            3
            ·
            8 months ago

            Wouldn’t running on a CPU (while possible) make it less energy efficient, though?

            • Pennomi@lemmy.world
              link
              fedilink
              English
              arrow-up
              3
              ·
              8 months ago

              It depends. A lot of LLMs are memory-constrained. If you’re constantly thrashing the GPU memory it can be both slower and less efficient.

          • DavidGarcia@feddit.nl
            link
            fedilink
            English
            arrow-up
            1
            ·
            8 months ago

            yeah but 10x slower, at speeds that just don’t work for many use cases. When you compare energy consumption per token, there isn’t much difference.

        • kippinitreal@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          8 months ago

          Good god. Thanks for the info.

      • 4am@lemm.ee
        link
        fedilink
        English
        arrow-up
        3
        ·
        8 months ago

        Still requires thirsty datacenters that use megawatts of power to keep them online and fast for thousands of concurrent users

    • RobotToaster@mander.xyz
      link
      fedilink
      English
      arrow-up
      4
      ·
      8 months ago

      I wonder how that compares to the average human?

      • copygirl@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        11
        ·
        8 months ago

        I would not accept a calculator being wrong even 1% of the time.

        AI should be held to a higher standard than “it’s on average correct more often than a human”.

      • dustyData@lemmy.world
        link
        fedilink
        English
        arrow-up
        10
        ·
        8 months ago

        Not a very good, or easy comparison to make. Against the average, sure, the AI is above the average. But a domain expert like a doctor or an accountant is way much more accurate than that. In the 99+% range. Sure, everyone makes mistakes. But when we are good at something, we are really good.

        Anyways this is just a ridiculous amount of effort and energy wasted just to reduce hallucinations to 4.4%.

        • Lugh@futurology.todayOPM
          link
          fedilink
          English
          arrow-up
          7
          ·
          8 months ago

          But a domain expert like a doctor or an accountant is way much more accurate

          Actually, not so.

          If the AI is trained on narrow data sets, then it beats humans. There’s quite a few examples of this recently with different types of medical expertise.

          • dustyData@lemmy.world
            link
            fedilink
            English
            arrow-up
            8
            ·
            edit-2
            8 months ago

            Cool, where are the papers?

            • massive_bereavement@fedia.io
              link
              fedilink
              arrow-up
              10
              ·
              8 months ago

              “We just need to drain a couple of lakes more and I promise bro you’ll see the papers.”

              I work in the field and I’ve seen tons of programs dedicated to use AI on healthcare and except for data analytics (data science) or computer image, everything ends in a nothing-burger with cheese that someone can put on their website and call the press.

              LLMs are not good for decision making (and unless there is a real paradigm shift) they won’t ever be due to their statistical nature.

              The biggest pitfall we have right now is that LLMs are super expensive to train and maintain as a service and companies are pushing them hard promising future features that, by most of the research community they won’t ever reach (as they have plateaued): Will we run out of data? Limits of LLM scaling based on human-generated data Large Language Models: a Survey (2024) No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

              And for those that don’t want to read papers on a weekend, there was a nice episode of computerphile 'ere: https://youtu.be/dDUC-LqVrPU

              </end of rant>

            • Lugh@futurology.todayOPM
              link
              fedilink
              English
              arrow-up
              3
              ·
              8 months ago

              Large language models surpass human experts in predicting neuroscience results

              A small study found ChatGPT outdid human physicians when assessing medical case histories, even when those doctors were using a chatbot.

              • massive_bereavement@fedia.io
                link
                fedilink
                arrow-up
                6
                ·
                8 months ago

                Are you kidding me? How did NYT reach those conclusions when the chair flipping conclusions of said study quite clearly states that [sic]“The use of an LLM did not significantly enhance diagnostic reasoning performance compared with the availability of only conventional resources.”

                https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395

                I mean, c’mon!

                On the Nature one:

                “we constructed a new forward-looking (Fig. 2) benchmark, BrainBench.”

                and

                “Instead, our analyses suggested that LLMs discovered the fundamental patterns that underlie neuroscience studies, which enabled LLMs to predict the outcomes of studies that were novel to them.”

                and

                “We found that LLMs outperform human experts on BrainBench”

                Is in reality saying: we made this benchmark that LLMs know how to cheat around our benchmark better than experts do, nothing more, nothing else.

          • BluesF@lemmy.world
            link
            fedilink
            English
            arrow-up
            5
            ·
            8 months ago

            Specialized ML models yes, not LLMs to my knowledge, but happy to be proved wrong.

        • Ogmios@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          4
          ·
          edit-2
          7 days ago

          deleted by creator

  • DragonTypeWyvern@midwest.social
    link
    fedilink
    English
    arrow-up
    14
    ·
    8 months ago

    Congratulations to AI researchers on discovering the benefits of peer review?

  • riplin@lemm.ee
    link
    fedilink
    English
    arrow-up
    13
    ·
    8 months ago

    LLM’s will never achieve much higher than that simply because there’s no reasoning behind it. It. Won’t. Work. Ever.

  • Lugh@futurology.todayOPM
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    8 months ago

    I still see even the more advanced AIs make simple errors on facts all the time…

  • swab148@lemm.ee
    link
    fedilink
    English
    arrow-up
    4
    ·
    8 months ago

    Sounds like Legion from Mass Effect

    • SidewaysHighways@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      8 months ago

      Acknowledged, we have reached consensus.

  • Ogmios@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    7 days ago

    deleted by creator

Futurology@futurology.today

futurology@futurology.today

Subscribe from Remote Instance

Create a post
You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !futurology@futurology.today
Visibility: Public
globe

This community can be federated to other instances and be posted/commented in by their users.

  • 38 users / day
  • 161 users / week
  • 1.32K users / month
  • 5.4K users / 6 months
  • 17 local subscribers
  • 3.12K subscribers
  • 1.98K Posts
  • 12.1K Comments
  • Modlog
  • mods:
  • voidx@futurology.today
  • Lugh@futurology.today
  • Espiritdescali@futurology.today
  • AwesomeLowlander@futurology.today
  • BE: 0.19.7
  • Modlog
  • Legal
  • Instances
  • Docs
  • Code
  • join-lemmy.org