GenAI by Hung-yi Lee 2024-03

1 minute read

It’s been a while that I follow Dr Lee’s 2024 lecture. But watched his video about GPT-4o yesterday and would like to continue this series

0 RLHF

First let me summary the last step of alighment RLHF.
The key takeways is that instructed FT is focus on process, versus RLHF is focusing on results. Alt text We can also train a reward model to simulate human preferences But over optimization with reward model could also leads to issues

1 Speech models

Summarize of speech models Alt text Both Meta and Google has models based on speech units

Language is actually a specially type of compression for audio, but without expressions. So there is one possible solution with combined encoders Alt text

2 Speaker Diarization

This is used for distinguish different speakers. Alt text

3 Training

It’s reported that OpenAI used over 100M hours of Youtube video for pre-training.
It could learn some Background Music. It could be a feature instead of bugs.
1M hour x 60min x 100 token/min = 6 Billion tokens. While Llama3 used 15 Trillion tokens, wihch is 2500x
Use GPT models as initialization

4 Speak and listening

The challenages is to tell when AI should listen, when should speak. Alt text Possibile solution is get live responses of be quiet or speak.

5 Look

The attention is applied to previous seen images in memory Alt text

Twitter Facebook LinkedIn

GenAI by Hung-yi Lee 2024-03

0 RLHF

1 Speech models

2 Speaker Diarization

3 Training

4 Speak and listening

5 Look

You May Also Enjoy

Stream Batch process

CUDA

Slurm and Enroot

NVLink, InfiniBand and SpectrumX