Note: still under construction as personal habit, see TODOs.
Automatically generate LNG-like contents based on LNG streaming data (version 1 demo)
Note: if any procedures unclear, feel free to raise an issue and ping me!
- Setup Environment
# Create your environment by conda
$ conda env create -f environment.yml
$ conda activate lng_ai
# Create `.env` and place your keys within like
export yt_api_key=...
export OPENAI_API_KEY=...
- Prepare data (download zip file or execute below commands)
$ python3 download_audio_files.py
$ python3 transcribe_audio_files.py
$ python3 prepare_dataset.py --repetitive_word_threshold 0.1
- Model Training & Interaction
# Fine-tune from scratch
$ python3 fine_tune_openai_model.py --mode 0 --jsonl_dataset_path $JSONL_DATASET_PATH
$ python3 fine_tune_openai_model.py --mode 0 --jsonl_dataset_path jsonl_dataset/jsonl_dataset_50_percent_29608.jsonl
# View fine-tune models (include corresponding training history)
$ python3 fine_tune_openai_model.py --mode 1
# Test fine-tune models
$ python3 fine_tune_openai_model.py --mode 2 --model_name $MODEL_NAME --num_of_sentences_generated $NUM_OF_SENTENCES_GENERATED
$ python3 fine_tune_openai_model.py --mode 2 --model_name babbage:ft-personal-2023-03-31-16-53-00 --num_of_sentences_generated 10
# View training process
$ python3 fine_tune_openai_model.py --mode 3 --model_id $MODEL_ID
$ python3 fine_tune_openai_model.py --mode 3 --model_id $ft-YYivAE5wK5tEGjKhJblhimCq
Features
- Download audio files via Youtube Data API
- Transcribe audio into transcripts via OpenAI Whisper API (candidates evaluation)
- Process transcripts into valid JSONL training dataset (OpenAI dataset preparation)
- Add checker to verify data integrity of training dataset
- Train OpenAI fine-tuned model
- Interact and verify results given by fine-tuned model
Notes
Dataset Preparation Tricks
- Filter out transcrirpts that have over-reptitive word (e.g., >10%)
- For instance, if a word "哈哈" appears 18% in the transcript, this transcript will be excluded in the training dataset.
- Note that we leverage OpenAI Whisper to transcribe, some transcripts transcribed could be problematic (repetitive words).
- Use more than 1 sentences (e.g., 3) in prompt to make chat completion more coherent
Model
- 1-year data (2022-2023)
- 4 iterations fine-tuning
- OpenAI Baddage model