waiyiaisg commited on
Commit
9f9c28a
·
verified ·
1 Parent(s): d3b95ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -41,16 +41,16 @@ For tokenisation, the model employs the default tokenizer used in Llama 3 8B Ins
41
  We evaluated Llama-SEA-LION-v2-8B-IT on both general language capabilities and instruction-following capabilities.
42
 
43
  #### General Language Capabilities
44
- For the evaluation of general language capabilities, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
45
  These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).
46
 
47
  Note: SEA-HELM is implemented using prompts to elicit answers in a strict format. For all tasks, the model is expected to provide an answer tag from which the answer is automatically extracted. For tasks where options are provided, the answer should comprise one of the pre-defined options. The scores for each task is normalised to account for baseline performance due to random chance.
48
 
49
- The evaluation was done **zero-shot** with native prompts and only a sample of 100-1000 instances for each dataset.
50
 
51
 
52
  #### Instruction-following Capabilities
53
- Since Llama-SEA-LION-v2-8B-IT is an instruction-following model, we also evaluated it on instruction-following capabilities with two datasets, [IFEval](https://arxiv.org/abs/2311.07911) and [MT-Bench](https://arxiv.org/abs/2306.05685).
54
 
55
  As these two datasets were originally in English, the linguists and native speakers in the team worked together to filter, localize and translate the datasets into the respective target languages to ensure that the examples remained reasonable, meaningful and natural.
56
 
@@ -61,14 +61,14 @@ SEA-IFEval evaluates a model's ability to adhere to constraints provided in the
61
 
62
  **SEA-MTBench**
63
 
64
- SEA MTBench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
65
 
66
 
67
  For more details on Llama-SEA-LION-v2-8B-IT benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
68
 
69
 
70
  ### Usage
71
- SEA-LION can be run using the 🤗 Transformers library
72
  ```python
73
  import transformers
74
  import torch
 
41
  We evaluated Llama-SEA-LION-v2-8B-IT on both general language capabilities and instruction-following capabilities.
42
 
43
  #### General Language Capabilities
44
+ For the evaluation of general language capabilities, we employed the [SEA-HELM (also known as BHASA) evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
45
  These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).
46
 
47
  Note: SEA-HELM is implemented using prompts to elicit answers in a strict format. For all tasks, the model is expected to provide an answer tag from which the answer is automatically extracted. For tasks where options are provided, the answer should comprise one of the pre-defined options. The scores for each task is normalised to account for baseline performance due to random chance.
48
 
49
+ The evaluation was done **zero-shot** with native prompts on a sample of 100-1000 instances for each dataset.
50
 
51
 
52
  #### Instruction-following Capabilities
53
+ Since Llama-SEA-LION-v2-8B-IT is an instruction-following model, we also evaluated it on instruction-following capabilities with two datasets, SEA-IFEval (based on [IFEval](https://arxiv.org/abs/2311.07911)) and SEA-MTBench (based on [MT-Bench](https://arxiv.org/abs/2306.05685)).
54
 
55
  As these two datasets were originally in English, the linguists and native speakers in the team worked together to filter, localize and translate the datasets into the respective target languages to ensure that the examples remained reasonable, meaningful and natural.
56
 
 
61
 
62
  **SEA-MTBench**
63
 
64
+ SEA-MTBench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
65
 
66
 
67
  For more details on Llama-SEA-LION-v2-8B-IT benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
68
 
69
 
70
  ### Usage
71
+ Llama-SEA-LION-v2-8B-IT can be run using the 🤗 Transformers library
72
  ```python
73
  import transformers
74
  import torch