Direct Massive Language Mannequin Alignment Via Self-Rewarding Contrastive Immediate Distillation

Last updated: 2024/08/06 at 3:32 PM

media

1 Min Read

Aligning massive language fashions (LLMs) with human expectations with out human-annotated choice knowledge is a vital downside. On this paper, we suggest a technique to judge the response choice through the use of the output chances of response pairs beneath contrastive immediate pairs, which may obtain higher efficiency on LLaMA2-7B and LLaMA2-13B in comparison with RLAIF. Primarily based on this, we suggest an computerized alignment methodology, Direct Massive Mannequin Alignment (DLMA). First, we use contrastive immediate pairs to routinely generate choice knowledge. Then, we proceed to judge the generated choice knowledge utilizing contrastive immediate pairs and calculate a self-rewarding rating. Lastly, we use the DPO algorithm to successfully align LLMs by combining this self-rewarding rating. Within the experimental stage, our DLMA methodology may surpass the RLHF methodology with out counting on human-annotated choice knowledge.

Direct Massive Language Mannequin Alignment Via Self-Rewarding Contrastive Immediate Distillation

Leave a Reply Cancel reply

Latest News

AI was chargeable for the faux quotes within the Megalopolis trailer

Bettering RLHF (Reinforcement Studying from Human Suggestions) with Critique-Generated Reward Fashions

Are You Making These Errors in Classification Modeling?

Steve Jobs’ Apple-1 set to create a ‘excellent storm’ at public sale

AI Century Tech is at the forefront of AI innovation, driving the future with cutting-edge technology and groundbreaking AI solutions.

Quick Link

Top Categories

Sign Up for Our Newsletter

You Might Also Like

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Latest News

Sign Up for Our Newsletter