Update README.md

0fb43d8 verified over 1 year ago

4.45 kB

	---
	base_model:
	- microsoft/codebert-base
	datasets:
	- devngho/the-stack-llm-annotations-v2
	language:
	- code
	library_name: transformers
	license: mit
	metrics:
	- f1
	---

	# devngho/code_edu_classifier-v3-microsoft_codebert-base

	이 모델은 [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)에 classifier를 추가한 모델입니다. [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)의 코드 버전을 목표로, 코드의 교육성 점수를 평가합니다.
	학습에는 [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)에서 추출한 샘플을 [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)로 평가한 [devngho/the-stack-llm-annotations-v2](https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2) 데이터셋이 사용되었습니다.

	이 연구는 Google의 TPU Research Cloud [(TRC)](https://sites.research.google/trc/about/)의 Cloud TPU 제공으로 수행되었습니다. ⚡

	## 상세

	- 제작: devngho
	- 언어: code
	- 라이선스: mit
	- 기반 모델: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)

	## 학습 상세

	- learning_rate: 3e-4 (cosine)
	- warmup_ratio: 0.1
	- batch_size: 2048(512*4)
	- optimizer: adamw(b1=0.9, b2=0.98, eps=1e-8, weight_decay=0.01)
	- duration: 4h 41m
	- steps: 6080

	## 학습 장비

	TPU v4-8

	## 성능

	```
	Validation Report:
	precision recall f1-score support

	0 0.80 0.06 0.10 72
	1 0.62 0.40 0.48 835
	2 0.61 0.62 0.61 2722
	3 0.48 0.72 0.58 1891
	4 0.62 0.02 0.05 623
	5 0.00 0.00 0.00 1

	accuracy 0.55 6144
	macro avg 0.52 0.30 0.30 6144
	weighted avg 0.58 0.55 0.52 6144

	Confusion Matrix:
	[[ 4 36 30 2 0 0]
	[ 1 330 464 40 0 0]
	[ 0 157 1684 881 0 0]
	[ 0 5 516 1361 9 0]
	[ 0 0 71 537 15 0]
	[ 0 0 0 1 0 0]]
	```

	3 이상과 미만으로 구분할 때 f1 score는 약 0.72입니다.

	# devngho/code_edu_classifier-v3-microsoft_codebert-base

	This model is [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) with classfier head. It is designed to evaluate the educational value of codes, similar to the [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), but focused on code. The training data comes from [devngho/the-stack-llm-annotations-v2](https://huggingface.co/datasets/devngho/the-stack-llm-annotations-v2) dataset, contains samples extracted from [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup) and evaluated using [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct).

	This research was supported with Cloud TPUs from Google's TPU Research Cloud [(TRC)](https://sites.research.google/trc/about/).⚡

	- Developed by: devngho
	- Language(s): code
	- License: mit
	- Base model: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)

	## Training detail

	- learning_rate: 3e-4 (cosine)
	- warmup_ratio: 0.1
	- batch_size: 2048(512*4)
	- optimizer: adamw(b1=0.9, b2=0.98, eps=1e-8, weight_decay=0.01)
	- duration: 4h 41m
	- steps: 6080

	## Training hardware

	TPU v4-8

	## Performance

	```
	Validation Report:
	precision recall f1-score support

	0 0.80 0.06 0.10 72
	1 0.62 0.40 0.48 835
	2 0.61 0.62 0.61 2722
	3 0.48 0.72 0.58 1891
	4 0.62 0.02 0.05 623
	5 0.00 0.00 0.00 1

	accuracy 0.55 6144
	macro avg 0.52 0.30 0.30 6144
	weighted avg 0.58 0.55 0.52 6144

	Confusion Matrix:
	[[ 4 36 30 2 0 0]
	[ 1 330 464 40 0 0]
	[ 0 157 1684 881 0 0]
	[ 0 5 516 1361 9 0]
	[ 0 0 71 537 15 0]
	[ 0 0 0 1 0 0]]
	```

	The F1 score is about 0.72 when separating above and below 3.