LNG_AI

Note: still under construction as personal habit, see TODOs.

Automatically generate LNG-like contents based on LNG streaming data (version 1 demo)

How to Use?

Note: if any procedures unclear, feel free to raise an issue and ping me!

Setup Environment

# Create your environment by conda
$ conda env create -f environment.yml
$ conda activate lng_ai

# Create `.env` and place your keys within like
export yt_api_key=...
export OPENAI_API_KEY=...

Prepare data (download zip file or execute below commands)

$ python3 download_audio_files.py 
$ python3 transcribe_audio_files.py 
$ python3 prepare_dataset.py --repetitive_word_threshold 0.1

Model Training & Interaction

# Fine-tune from scratch
$ python3 fine_tune_openai_model.py --mode 0 --jsonl_dataset_path $JSONL_DATASET_PATH
$ python3 fine_tune_openai_model.py --mode 0 --jsonl_dataset_path jsonl_dataset/jsonl_dataset_50_percent_29608.jsonl

# View fine-tune models (include corresponding training history)
$ python3 fine_tune_openai_model.py --mode 1

# Test fine-tune models
$ python3 fine_tune_openai_model.py --mode 2 --model_name $MODEL_NAME --num_of_sentences_generated $NUM_OF_SENTENCES_GENERATED
$ python3 fine_tune_openai_model.py --mode 2 --model_name babbage:ft-personal-2023-03-31-16-53-00 --num_of_sentences_generated 10

# View training process
$  python3 fine_tune_openai_model.py --mode 3 --model_id $MODEL_ID
$  python3 fine_tune_openai_model.py --mode 3 --model_id $ft-YYivAE5wK5tEGjKhJblhimCq

Development Milestones

Version 1

Features

Download audio files via Youtube Data API
Transcribe audio into transcripts via OpenAI Whisper API (candidates evaluation)
Process transcripts into valid JSONL training dataset (OpenAI dataset preparation)
Add checker to verify data integrity of training dataset
Train OpenAI fine-tuned model
Interact and verify results given by fine-tuned model

Notes
Dataset Preparation Tricks

Filter out transcrirpts that have over-reptitive word (e.g., >10%)
- For instance, if a word "哈哈" appears 18% in the transcript, this transcript will be excluded in the training dataset.
- Note that we leverage OpenAI Whisper to transcribe, some transcripts transcribed could be problematic (repetitive words).
Use more than 1 sentences (e.g., 3) in prompt to make chat completion more coherent

Model

1-year data (2022-2023)
4 iterations fine-tuning
OpenAI Baddage model

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
LNG_AI		LNG_AI
asset		asset
.gitignore		.gitignore
README.md		README.md
data_integrity_check.py		data_integrity_check.py
download_audio_files.py		download_audio_files.py
environment.yml		environment.yml
fine_tune_openai_model.py		fine_tune_openai_model.py
prepare_dataset.py		prepare_dataset.py
transcribe_audio_files.py		transcribe_audio_files.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LNG_AI

How to Use?

Development Milestones

Version 1

About

Releases

Packages

Languages

tkuo-tkuo/LNG_AI

Folders and files

Latest commit

History

Repository files navigation

LNG_AI

How to Use?

Development Milestones

Version 1

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages