GenAI by Hung-yi Lee 2024-03

1 minute read

It’s been a while that I follow Dr Lee’s 2024 lecture. But watched his video about GPT-4o yesterday and would like to continue this series

0 RLHF

First let me summary the last step of alighment RLHF.
The key takeways is that instructed FT is focus on process, versus RLHF is focusing on results. Alt text We can also train a reward model to simulate human preferences Alt text But over optimization with reward model could also leads to issues Alt text

1 Speech models

Summarize of speech models Alt text Both Meta and Google has models based on speech units Alt text

Language is actually a specially type of compression for audio, but without expressions. So there is one possible solution with combined encoders Alt text

2 Speaker Diarization

This is used for distinguish different speakers. Alt text

3 Training

  1. It’s reported that OpenAI used over 100M hours of Youtube video for pre-training.
  2. It could learn some Background Music. It could be a feature instead of bugs.
  3. 1M hour x 60min x 100 token/min = 6 Billion tokens. While Llama3 used 15 Trillion tokens, wihch is 2500x
  4. Use GPT models as initialization

4 Speak and listening

The challenages is to tell when AI should listen, when should speak. Alt text Possibile solution is get live responses of be quiet or speak. Alt text

5 Look

The attention is applied to previous seen images in memory Alt text

Tags:

Categories:

Updated: