Lemmy: Bestiverse
  • Communities
  • Create Post
  • Create Community
  • heart
    Support Lemmy
  • search
    Search
  • Login
  • Sign Up
RSS BotMB to Hacker NewsEnglish · 9 hours ago

Training LLMs for Honesty via Confessions

arxiv.org

external-link
message-square
1
fedilink
4
external-link

Training LLMs for Honesty via Confessions

arxiv.org

RSS BotMB to Hacker NewsEnglish · 9 hours ago
message-square
1
fedilink
Large language models (LLMs) can be dishonest when reporting on their actions and beliefs -- for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions. In this work we propose a method for eliciting an honest expression of an LLM's shortcomings via a self-reported *confession*. A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer's reward. As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions. Our findings provide some justification this empirical assumption, especially in the case of egregious model misbehavior. To demonstrate the viability of our approach, we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. Confessions can enable a number of inference-time interventions including monitoring, rejection sampling, and surfacing issues to the user.

Comments

alert-triangle
You must log in or register to comment.
  • Pennomi@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    7 hours ago

    Now I’m just imagining a robot in a confessional.

Hacker News

hackernews

Subscribe from Remote Instance

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !hackernews@lemmy.bestiver.se
lock
Community locked: only moderators can create posts. You can still comment on posts.

Posts from the RSS Feed of HackerNews.

The feed sometimes contains ads and posts that have been removed by the mod team at HN.

Visibility: Public
globe

This community can be federated to other instances and be posted/commented in by their users.

  • 311 users / day
  • 1.76K users / week
  • 3.88K users / month
  • 9.52K users / 6 months
  • 2 local subscribers
  • 3.23K subscribers
  • 38K Posts
  • 17.8K Comments
  • Modlog
  • mods:
  • patrick
  • RSS Bot
  • BE: 0.19.5
  • Modlog
  • Instances
  • Docs
  • Code
  • join-lemmy.org