Introduction

📈 CFGPT: Chinese Financial Assistant with Large Language Model

Introduction

CFGPT is a family of large language models for Chinese financial scenarios. We perform continual pretraining on collected and cleaned Chinese financial corpora (CFData-pt), and then conduct supervised fine-tuning with knowledge-intensive financial instruction data (CFData-sft), enabling general-purpose LLMs to obtain stronger capabilities in financial text understanding, financial information extraction, financial generation, and risk analysis.

CFGPT currently includes three major versions:

Version	Size	Base Model	Status
CFGPT1	7B	InternLM 7B	Part of the model weights have been open-sourced
CFGPT2	7B & 20B	InternLM2 7B & 20B	Related models and evaluation results have been released
CFGPT3	8B & 32B	Qwen3 8B & 32B	Training code has been added; model weights are not open-sourced yet

CFGPT1 is based on InternLM 7B and includes a continual-pretrained model, a LoRA supervised fine-tuned model, and a full-parameter supervised fine-tuned model:
- CFGPT1-pt-7B: model weights after continual pretraining on Chinese financial corpora.
- CFGPT1-sft-7B-LoRA: LoRA adapter weights trained based on CFGPT1-pt-7B.
- CFGPT1-sft-7B-Full: full-parameter SFT weights trained based on CFGPT1-pt-7B.
CFGPT2 is based on InternLM2, covering both 7B and 20B scales. It integrates modules such as retrieval augmentation, fact checking, compliance checking, and risk monitoring, and achieves strong performance on multiple financial tasks.
- CFGPT2-7B
- CFGPT2-20B
CFGPT3 is based on Qwen3-8B and Qwen3-32B. It follows the training route of "financial continual pretraining + financial supervised fine-tuning" and expands the continual-pretraining data. The repository currently includes training and evaluation code for CFGPT3-8B; the model weights of CFGPT3-8B and CFGPT3-32B are not open-sourced yet.

We also released CFBenchmark, which is used to evaluate the text-processing capability of large language models in the Chinese financial market. This repository keeps CFGPT training code, evaluation code, CFData samples, and typical application cases, making it easier for researchers to reproduce the training pipeline and understand the data format.

The following figure provides an overview of the CFGPT training pipeline:

Quick Start

1. Prepare the code and environment

Clone the repository, create a Python environment, and activate it:

git clone https://github.com/TongjiFinLab/CFGPT.git
cd CFGPT
conda create -n env_name python=3.10
source activate env_name
pip install -r requirements.txt

2. Select a model version

The currently open-sourced models are mainly from the CFGPT1 series. CFGPT3-8B and CFGPT3-32B are currently at the stage where training and evaluation code is released, while model weights are not open-sourced yet.

Model	Type	Link
CFGPT1-pt-7B	Continual-pretrained model	Hugging Face
CFGPT1-sft-7B-LoRA	LoRA supervised fine-tuned model	Hugging Face
CFGPT1-sft-7B-Full	Full-parameter supervised fine-tuned model	Hugging Face
CFGPT2-7B	Supervised fine-tuned model	Hugging Face
CFGPT2-20B	Supervised fine-tuned model	Hugging Face
CFGPT3-8B / CFGPT3-32B	Qwen3-based financial models	Not open-sourced yet

3. Use CFGPT1-sft-7B-LoRA

from transformers import AutoModel, AutoTokenizer
from peft import PeftModel
base_model = 'TongjiFinLab/CFGPT1-pt-7B'
lora_weights = 'TongjiFinLab/CFGPT1-sft-7B-LoRA'
device_map = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(
    base_model,
    trust_remote_code=True,
    device_map=device_map,
    torch_dtype=torch.bfloat16
)
model = PeftModel.from_pretrained(
    model,
    lora_weights,
    device_map=device_map,
)
model = model.eval()
inputs = tokenizer("""你是一名金融从业者，请对这篇新闻进行情感分析。请从（中性、积极、消极）中选取答案。新闻内容：挖贝快讯：特步国际发布2023年第二季度中国内地业务营运状况，披露截至2023年6月30日止3个月零售销售实现高双位数同比增长(包括线上线下渠道)，零售折扣水平约七五折。同时，2022年7月MSCI首次予以特步ESG评级，一年后评级表现即迎来提升。明晟MSCI上调特步ESG评级，由“BB”升至“BBB”。\n回答：""", return_tensors='pt').to(device_map)
pred = model.generate(**inputs, max_new_tokens=64, do_sample=False, repetition_penalty=1.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True).split('回答：')[1])

4. Use CFGPT1-sft-7B-Full

from transformers import AutoModel, AutoTokenizer
base_model = 'TongjiFinLab/CFGPT1-sft-7B-Full'
device_map = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(
    base_model,
    trust_remote_code=True,
    device_map=device_map,
    torch_dtype=torch.bfloat16
)
model = model.eval()
inputs = tokenizer("""你是一名金融从业者，请对这篇新闻进行情感分析。请从（中性、积极、消极）中选取答案。新闻内容：挖贝快讯：特步国际发布2023年第二季度中国内地业务营运状况，披露截至2023年6月30日止3个月零售销售实现高双位数同比增长(包括线上线下渠道)，零售折扣水平约七五折。同时，2022年7月MSCI首次予以特步ESG评级，一年后评级表现即迎来提升。明晟MSCI上调特步ESG评级，由“BB”升至“BBB”。\n回答：""", return_tensors='pt').to(device_map)
pred = model.generate(**inputs, max_new_tokens=64, do_sample=False, repetition_penalty=1.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True).split('回答：')[1])

More usage details are available in ./code/test

5. Train CFGPT3-8B

The CFGPT3-8B code includes continual pretraining and full-parameter supervised fine-tuning. Before running it, please update model paths, data paths, cache paths, and GPU/DeepSpeed configurations according to your local environment.

# Data processing and training for continual pretraining
cd code/train/Qwen3-8B/pretrain
python process_data.py
bash start.sh

# Supervised fine-tuning
cd ../full-sft
bash start.sh

Use Cases

Data

In this repository, we share samples of CFData:

CFData: ./data

CFData includes a pretraining dataset (CFData-pt) and a supervised fine-tuning dataset (CFData-sft). The pretraining dataset contains Chinese financial data and analytics, together with a smaller subset of general-purpose text, totaling 584 million documents and 141 billion tokens. The supervised fine-tuning dataset is designed for six different financial tasks, covering multiple aspects of financial analysis and decision-making, with 1.5 million instruction pairs and 1.5 billion tokens in total.

Continual Pretraining

CPT for CFGPT1 & CFGPT2

The pretraining dataset contains 591 million documents and 193 billion tokens, including six sub-datasets:

CFData-CP (6.24%): 39 thousand corporate prospectuses with 13 billion tokens;
CFData-CA (12.28%): 6 million corporate announcements with 17 billion tokens;
CFData-RR (2.51%): 392 thousand research reports with 3 billion tokens;
CFData-FN (18.70%): 82 million financial news articles with 26 billion tokens;
CFData-SM (60.15%): 495 million social media records with 84 billion tokens;
CFData-Wiki (0.09%): 255 thousand Wikipedia entries with 137 million tokens.

We sampled a financial text sub-corpus from CFData-pt for further pretraining on InternLM-7B. This sub-corpus contains about 13.7 billion tokens from a large amount of Chinese financial data and analytics, along with a small amount of general-purpose text, including announcements, research reports, social media content, financial news articles, and Wikipedia. These data were mainly collected by ourselves.

CPT for CFGPT3

For CFGPT3 training, we added a portion of new data. The details are as follows:

Financial (70.17%)
- news_rp_origin_text (21.73%)
- fin_news_2022 (20.95%)
- fin_announcement (13.61%)
- fin_articles (6.91%)
- fin_news_2025 (6.33%)
- fin_books (0.36%)
- fin_exam (0.26%)
- fin_reports (0.01%)
General (29.83%)
- CLUECorpusSmall (18.76%)
- UNv1-0 (3.15%)
- ParaCrawl-v9 (1.86%)
- zhwiki_2023 (1.84%)
- news-crawl (1.64%)
- csl (1.26%)
- translation2019zh (0.83%)
- WikiMatrix (0.43%)
- news-commentary (0.05%)

The total amount is about 28.4 billion tokens. The continual pretraining of the CFGPT3 system models uses all of these data. These data mainly include financial data such as news, announcements, research reports, financial articles, professional books, financial exams, and financial reports. In addition, a small portion of general corpora is also included. All of the above data were collected by ourselves.

Supervised Fine-Tuning

The supervised fine-tuning dataset contains 1.6 million instruction pairs and 1.5 billion tokens, including six financial tasks:

CFData-SA (5.69%): 120 thousand instances with 86 million tokens for sentiment analysis;
CFData-RS (50.60%): 369 thousand instances with 765 million tokens for report summarization;
CFData-ED (22.69%): 490 thousand instances with 343 million tokens for event detection;
CFData-TD (12.37%): 369 thousand instances with 187 million tokens for topic decomposition;
CFData-QA (0.39%): 12 thousand instances with 6 million tokens for question answering;
CFData-SP (8.27%): 212 thousand instances with 125 million tokens for stock movement prediction.

We use high-quality domain-specific data to achieve financial domain adaptation through supervised fine-tuning. The dataset includes six financial datasets that reflect different aspects of financial analysis and decision-making, including sentiment analysis, event detection, report summarization, topic decomposition, question answering, and stock movement prediction.

CFData-sft provides a large amount of financial-domain text information, allowing FinLLMs to learn from different sources.

Considering practical requirements, we reorganized these financial supervised fine-tuning datasets into ten tasks.

The details are as follows:

Task	Task Description	Dataset	Size
Sentiment	Identify sentiment associated with financial documents	CFData-SA	13K
Summary	Generate a content summary based on the provided financial document	CFData-RS	18K
Risk	Generate risk alerts based on the provided financial document	CFData-RS	20K
Suggestion	Generate investment recommendations based on the provided financial document	CFData-RS	18K
Event	Identify event categories associated with financial documents	CFData-ED	12K
Industry	Identify industry categories associated with financial documents	CFData-ED	14K
Company	Identify company names associated with financial documents	CFData-ED	12K
Product	Identify product names associated with financial documents	CFData-ED	21K
Exam	Answer true-false questions related to finance	CFData-QA	16K
Stock	Predict future stock movement	CFData-SP	15K

For more information about CFData, researchers can refer to the sample records in CFData.

Code

This repository keeps the key training and evaluation code for CFGPT. Overall, the CFGPT training pipeline includes data preprocessing, continual pretraining, supervised fine-tuning, model generation testing, and downstream evaluation. Different versions correspond to different base models and script directories. For detailed parameters, please refer to the .py, .sh, .yml, and ds_config.json files in the corresponding directories.

Code Structure

code/
├── train/
│   ├── InternLM/
│   │   └── pretrain/          # Continual pretraining code for CFGPT1/2 InternLM-series models
│   └── Qwen3-8B/
│       ├── pretrain/          # Continual pretraining and data processing code for CFGPT3-8B
│       └── full-sft/          # Full-parameter supervised fine-tuning code for CFGPT3-8B
├── test/
│   └── eval-generate.py       # Generation test example
└── utils/                     # Common components for data processing, loss, and trainer

Continual Pretraining

Continual pretraining is used to adapt general-purpose base models to Chinese financial corpora. This stage mainly includes tokenization, long-text packing, dataset saving, and distributed training.

InternLM-series training code is located at:

code/train/InternLM/pretrain

CFGPT3-8B training code is located at:

code/train/Qwen3-8B/pretrain

Taking CFGPT3-8B as an example, the running process is as follows:

cd code/train/Qwen3-8B/pretrain
python process_data.py
bash start.sh

Here, process_data.py processes raw jsonl text into 2048-token training blocks, qwen3_8b_pt_train.py performs continual pretraining, and start.sh configures environment variables and launches distributed training. Before running, please update model paths, data paths, cache paths, CUDA paths, and the number of GPUs according to your local environment.

Supervised Fine-Tuning

Supervised fine-tuning further improves the model's performance on financial question answering, financial information extraction, report summarization, risk alerts, investment suggestions, and other instruction-following tasks.

The original CFGPT1 version includes both LoRA SFT and full-parameter SFT. Here we use lora-bf16 as an example. The training script is located under ./code/train/lora:

deepspeed --include localhost:6,7 --master_port 60005 lora_bf_16_parallel_train.py --config lora_bf_16_parallel_train.yml > lora_bf_16_parallel_train.log 2>&1

CFGPT3-8B currently provides full-parameter SFT training code:

code/train/Qwen3-8B/full-sft

Running example:

cd code/train/Qwen3-8B/full-sft
bash start.sh

Here, qwen3_8b_sft.py loads the continual-pretrained model, constructs ChatML-format training text, and launches SFT training. Training data paths, evaluation data paths, DeepSpeed configuration paths, and output paths should be modified according to the actual machine environment.

Generation Test

The generation test script is located at:

code/test/eval-generate.py

This script is used to quickly check the model's generation quality on financial tasks. Model loading paths and inference parameters can be modified in the script.

Evaluation

The evaluation results of CFGPT2 are shown below.

C-Eval

Model	Size	STEM	Social Science	Humanities	Others	Average	Average(hard)
GPT-4	-	67.1	77.6	64.5	67.8	68.7	54.9
ChatGPT	175B	52.9	61.8	50.9	53.6	54.4	41.4
InternLM-7B	7B	48.0	67.4	55.4	45.8	52.8	37.1
ChatGLM2-6B	6B	48.6	60.5	51.3	49.8	51.7	37.1
Qwen-7B	7B	52.8	74.1	63.1	55.2	59.6	41.0
Qwen-14B	14B	65.7	85.4	75.3	68.4	72.1	53.7
Baichuan-7B	7B	38.2	52.0	46.2	39.3	42.8	31.5
Baichuan-13B	13B	47.0	66.8	57.3	49.8	53.6	36.7
Baichuan2-13B-Chat	13B	48.4	70.5	60.3	55.0	56.6	37.9
InternLM2-7B	7B	52.3	71.9	64.9	61.0	60.8	38.8
InternLM2-20B	20B	56.1	75.7	62.6	62.4	63.0	46.3
CFGPT2-7B	7B	56.7	76.4	63.9	63.0	63.5	43.2
CFGPT2-20B	20B	64.6	80.8	72.1	68.9	69.2	49.9

FinEval

Model	Size	Finance	Economy	Accounting	Certificate	Average
GPT-4	-	71.0	74.5	59.3	70.4	68.6
ChatGPT	175B	59.3	61.6	45.2	55.1	55.0
InternLM-7B	7B	49.0	49.2	40.5	49.4	47.1
ChatGLM2-6B	6B	46.5	46.4	44.5	51.5	47.4
Qwen-Chat-7B	7B	51.5	52.1	44.5	53.6	50.5
Qwen-7B	7B	54.5	54.4	50.3	55.8	53.8
Baichuan-7B-Chat	7B	44.9	41.5	34.9	45.6	42.0
Baichuan-13B-Chat	13B	51.6	51.1	41.7	52.8	49.4
InternLM2-7B	7B	54.2	54.0	43.5	55.4	51.9
InternLM2-20B	20B	57.3	58.9	47.4	58.6	55.5
CFGPT2-7B	7B	62.6	63.9	58.9	66.0	62.9
CFGPT2-20B	20B	64.0	64.9	62.1	67.9	64.8

CFBenchmark-Basic

Model	Size	Company	Product	R.Avg	Sector	Event	Sentiment	C.Avg	Summary	Risk	Suggestion	G.Avg	Avg
HUMAN	-	0.931	0.744	0.838	0.975	0.939	0.912	0.942	1.000	1.000	1.000	1.000	0.927
ChatGPT	20B	0.797	0.198	0.498	0.453	0.458	0.425	0.455	0.593	0.541	0.771	0.635	0.529
ERNIE-Bot	260B	0.807	0.300	0.533	0.408	0.350	0.186	0.315	0.715	0.590	0.716	0.673	0.507
ERNIE-Bot-4	-	0.819	0.417	0.618	0.418	0.358	0.375	0.384	0.721	0.629	0.718	0.689	0.564
Falcon-7B	7B	0.671	0.168	0.420	0.169	0.132	0.250	0.184	0.302	0.301	0.246	0.283	0.296
Falcon-7B-chat	7B	0.582	0.046	0.314	0.112	0.142	0.153	0.135	0.307	0.299	0.258	0.288	0.246
bloomz-7B1	7B	0.765	0.166	0.465	0.252	0.154	0.394	0.267	0.451	0.371	0.462	0.428	0.387
bloomz-7Bt1-mt	7B	0.751	0.157	0.454	0.087	0.182	0.380	0.216	0.425	0.379	0.396	0.400	0.357
Qwen-7B	7B	0.780	0.357	0.569	0.480	0.335	0.379	0.398	0.750	0.505	0.713	0.656	0.541
Qwen-Chat-7B	7B	0.763	0.360	0.562	0.400	0.367	0.265	0.344	0.548	0.307	0.379	0.411	0.439
Qwen-14B	14B	0.805	0.421	0.613	0.481	0.350	0.385	0.405	0.754	0.608	0.717	0.693	0.570
Qwen-Chat-14B	14B	0.814	0.442	0.628	0.382	0.400	0.350	0.377	0.732	0.478	0.736	0.649	0.551
ChatGLM2-6B	6B	0.747	0.313	0.530	0.285	0.300	0.357	0.314	0.657	0.454	0.671	0.594	0.479
Baichuan2-7B-Base	7B	0.672	0.340	0.506	0.342	0.490	0.480	0.437	0.739	0.619	0.751	0.703	0.549
Baichuan2-7B-Chat	7B	0.757	0.402	0.579	0.425	0.475	0.323	0.408	0.725	0.648	0.732	0.702	0.563
Baichuan2-13B-Base	13B	0.781	0.330	0.555	0.436	0.496	0.477	0.470	0.725	0.503	0.747	0.658	0.561
Baichuan2-13B-Chat	13B	0.797	0.314	0.556	0.472	0.507	0.387	0.455	0.739	0.634	0.746	0.706	0.572
InternLM-7B	7B	0.612	0.233	0.423	0.266	0.311	0.328	0.302	0.378	0.336	0.379	0.364	0.363
InternLM-7B-Chat	7B	0.632	0.261	0.447	0.272	0.364	0.399	0.345	0.363	0.270	0.353	0.329	0.374
InternLM-20B	20B	0.809	0.358	0.583	0.500	0.427	0.417	0.448	0.706	0.653	0.728	0.695	0.575
InternLM-20B-Chat	20B	0.488	0.362	0.425	0.323	0.327	0.370	0.340	0.706	0.578	0.762	0.662	0.476
CFGPT1-stf-LoRA	7B	0.820	0.414	0.617	0.569	0.729	0.769	0.689	0.745	0.584	0.609	0.646	0.650
CFGPT1-sft-Full	7B	0.836	0.476	0.656	0.700	0.808	0.829	0.779	0.798	0.669	0.808	0.758	0.731
CFGPT2-7B	7B	0.834	0.470	0.652	0.644	0.750	0.793	0.729	0.801	0.692	0.790	0.761	0.714
CFGPT2-20B	20B	0.891	0.501	0.696	0.722	0.825	0.865	0.806	0.825	0.727	0.823	0.792	0.755

OpenFinData

Model	Size	Knowledge	Caluation	Explanation	Identification	Analysis	Compliance	Average
ERNIE-Bot-3.5	-	78.0	70.4	82.1	75.3	77.7	36.7	70.0
ERNIE-Bot-4	-	87.3	73.6	84.3	77.0	79.1	37.3	73.1
InternLM-7B	7B	65.3	45.8	71.4	62.5	59.2	37.2	56.9
ChatGLM2-6B	6B	62.4	37.2	70.8	59.2	58.3	38.7	54.4
Qwen-Chat-7B	7B	71.3	40.5	71.4	58.6	51.3	40.0	55.5
Qwen-Chat-14B	14B	78.0	57.6	75.6	71.6	59.3	40.6	63.8
Baichuan2-7B-Chat	7B	46.2	37.0	76.5	60.2	55.0	28.7	50.6
Baichuan2-13B-Chat	13B	69.3	39.5	75.3	65.7	62.0	31.3	57.2
InternLM2-7B	7B	70.2	39.9	73.4	62.8	61.4	39.5	57.8
InternLM2-20B	20B	76.4	52.6	76.3	66.2	63.9	42.1	62.9
CFGPT2-7B	7B	81.9	62.8	75.2	71.3	64.1	68.2	70.5
CFGPT2-20B	20B	84.6	66.5	78.1	75.9	66.0	71.9	73.8

Acknowledgements

CFGPT refers to the following open-source projects during development. We thank the researchers and developers of these projects.

InternLM: https://github.com/InternLM/InternLM
Qwen: https://github.com/QwenLM/Qwen
Firefly: https://github.com/yangjianxin1/Firefly
FinGPT: https://github.com/AI4Finance-Foundation/FinGPT

To-Do List

Build downstream applications with CFGPT, such as CFAPP.
Construct more comprehensive training tasks and corresponding datasets.
Continuously improve CFGPT's capabilities on more complex financial tasks.

License

The code of CFGPT follows the Apache License 2.0. The use of open-sourced models should also comply with the licenses of the corresponding base models and the usage terms of the training data. CFGPT1/CFGPT2 models follow the license requirements of base models such as InternLM / InternLM2. CFGPT3 is trained based on Qwen3, and the model weights of CFGPT3-8B and CFGPT3-32B are not open-sourced yet. If you find any potential risk or violation, please contact us.

Thanks To Our Contributors:

Citation

If you find CFGPT useful for your research, please cite the following papers:

@article{li2023cfgpt,
  title={CFGPT: Chinese financial assistant with large language model},
  author={Li, Jiangtong and Bian, Yuxuan and Wang, Guoxuan and Lei, Yang and Cheng, Dawei and Ding, Zhijun and Jiang, Changjun},
  journal={arXiv preprint arXiv:2309.10654},
  year={2023}
}

@article{li2024ra,
  title={RA-CFGPT: Chinese financial assistant with retrieval-augmented large language model},
  author={Li, Jiangtong and Lei, Yang and Bian, Yuxuan and Cheng, Dawei and Ding, Zhijun and Jiang, Changjun},
  journal={Frontiers of Computer Science},
  volume={18},
  number={5},
  pages={185350},
  year={2024},
  publisher={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
cases		cases
code		code
data		data
figs		figs
.gitignore		.gitignore
LICENSE		LICENSE
README-zh.md		README-zh.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📈 CFGPT: Chinese Financial Assistant with Large Language Model

Introduction

Contents

Quick Start

Use Cases

Data

Continual Pretraining

CPT for CFGPT1 & CFGPT2

CPT for CFGPT3

Supervised Fine-Tuning

Code

Code Structure

Continual Pretraining

Supervised Fine-Tuning

Generation Test

Evaluation

C-Eval

FinEval

CFBenchmark-Basic

OpenFinData

Acknowledgements

To-Do List

License

Thanks To Our Contributors:

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📈 CFGPT: Chinese Financial Assistant with Large Language Model

Introduction

Contents

Quick Start

Use Cases

Data

Continual Pretraining

CPT for CFGPT1 & CFGPT2

CPT for CFGPT3

Supervised Fine-Tuning

Code

Code Structure

Continual Pretraining

Supervised Fine-Tuning

Generation Test

Evaluation

C-Eval

FinEval

CFBenchmark-Basic

OpenFinData

Acknowledgements

To-Do List

License

Thanks To Our Contributors:

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages