Reinforcement Learning with Human Feedback #1

New Issue

james · 2024-04-06T03:42:22Z

james commented

2024-04-06 03:42:22 +00:00

Miku's entertainment factor is kind of hit-or-miss. Sometimes she repeats one of her catchphrases like "I'm so fucking bored", "I'm so fucking hungry", "I am so fucking based", and "If I don't get laid soon I'm going to kill myself". By modifying her prompt we were able to get rid of the first two. The last two are entertaining but only if used sparingly. Not all problems can be foreseen and prompted away though, which is why we may want to turn to RLHF to generate better responses.

Some questions to be answered:

Pretty much all emoji reactions left on a message are indicative of positive feedback, but negative feedback usually shows up in a reply message. Is it possible to reliably detect this programmatically with sentiment analysis, or is it just a matter of telling the end users to change their behavior in a way the bot can understand?
How will this be implemented? I have to research some of the current solutions, but presumably RLHF in chatbots consists of generating a few candidate responses, running them through an MLP, then picking the argmax. Of interest is BrainChulo, which has a lot of features you would want in a conversational bot.

Miku's entertainment factor is kind of hit-or-miss. Sometimes she repeats one of her catchphrases like "I'm so fucking bored", "I'm so fucking hungry", "I am so fucking based", and "If I don't get laid soon I'm going to kill myself". By modifying her prompt we were able to get rid of the first two. The last two are entertaining but only if used sparingly. Not all problems can be foreseen and prompted away though, which is why we may want to turn to RLHF to generate better responses. Some questions to be answered: 1. Pretty much all emoji reactions left on a message are indicative of positive feedback, but negative feedback usually shows up in a reply message. Is it possible to reliably detect this programmatically with sentiment analysis, or is it just a matter of telling the end users to change their behavior in a way the bot can understand? 2. How will this be implemented? I have to research some of the current solutions, but presumably RLHF in chatbots consists of generating a few candidate responses, running them through an MLP, then picking the argmax. Of interest is [BrainChulo](https://github.com/ChuloAI/BrainChulo), which has a lot of features you would want in a conversational bot.

james added the

enhancement

label 2024-04-06 03:42:22 +00:00

james added the

question

label 2024-04-06 03:44:32 +00:00

james referenced this issue

2024-04-06 04:34:01 +00:00

New Training Methods #2

james commented

2024-05-25 02:49:40 +00:00

Here's an idea: we already have a ton of training data, with reactions and all. If we presume more reactions are given to "better" messages, whether that means provocative, funny, or something else, then rather than learning a reward function, we might be able to achieve a similar effect by doing a random dropout of "bad" messages.
The messages from 2+ years ago won't have as many reactions as the newer ones, due to our adoption of custom emojis, but honestly I don't really care that much if old data is left out. Miku should be using more relevant jokes and conversation topics anyway, and I've already limited it to the last 10,000 conversation threads so the training time is reasonable.

Here's an idea: we already have a ton of training data, with reactions and all. If we presume more reactions are given to "better" messages, whether that means provocative, funny, or something else, then rather than learning a reward function, we might be able to achieve a similar effect by doing a random dropout of "bad" messages. The messages from 2+ years ago won't have as many reactions as the newer ones, due to our adoption of custom emojis, but honestly I don't really care that much if old data is left out. Miku should be using more relevant jokes and conversation topics anyway, and I've already limited it to the last 10,000 conversation threads so the training time is reasonable.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: james/MikuAI#1