Skip to content

Commit 7e1b87f

Browse files
authored
Merge pull request #1235 from 545999961/master
update mteb eval
2 parents 9e3c39a + 192ea6a commit 7e1b87f

83 files changed

Lines changed: 131 additions & 8 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

FlagEmbedding/evaluation/mteb/arguments.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,6 @@ class MTEBEvalArgs(AbsEvalArgs):
2121
use_special_instructions: bool = field(
2222
default=False, metadata={"help": "Whether to use specific instructions in `prompts.py` for evaluation. Default: False"}
2323
)
24-
use_special_examples: bool = field(
25-
default=False, metadata={"help": "Whether to use specific examples in `examples` for evaluation. Default: False"}
26-
)
24+
examples_path: str = field(
25+
default=None, metadata={"help": "Use specific examples in the path. Default: None"}
26+
)

FlagEmbedding/evaluation/mteb/examples.py

Lines changed: 0 additions & 1 deletion
This file was deleted.

FlagEmbedding/evaluation/mteb/runner.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -160,10 +160,9 @@ def run(self):
160160
except:
161161
logger.logger.info(f"No instruction found for {task_name}")
162162

163-
if self.eval_args.use_special_examples:
163+
if self.eval_args.examples_path is not None:
164164
try:
165-
eg_pairs = examples_dict[task_name]
166-
self.retriever.set_examples(eg_pairs)
165+
eg_pairs = json.load(open(os.path.join(self.eval_args.examples_path, task_name + '.json')))
167166
except:
168167
logger.logger.info(f"No examples found for {task_name}")
169168

examples/evaluation/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ For MTEB, we primarily use the official [MTEB](https://github.com/embeddings-ben
115115
- **`tasks`**: Tasks to evaluate. Default: None
116116
- **`task_types`**: The task types to evaluate. Default: None
117117
- **`use_special_instructions`**: Whether to use specific instructions in `prompts.py` for evaluation. Default: False
118-
- **`use_special_examples`**: Whether to use specific examples in `examples.py` for evaluation. Default: False
118+
- **`examples_path`**: Use specific examples in the path. Default: None
119119

120120
Here is an example for evaluation:
121121

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"query": "So, which AI model did the best on the MMMU benchmark according to Yue and his team back in 2023?", "pos": "MMMU (val) Gemini Ultra (0-shot) GPT-4V (0-shot)\nMaj@32 pass@1 pass@1\nArt & Design 74.2 70.0 65.8\nBusiness 62.7 56.7 59.3\nScience 49.3 48.0 54.7\nHealth & Medicine 71.3 67.3 64.7\nHumanities & Social Science 78.3 78.3 72.5\nTechnology & Engineering 53.0 47.1 36.7\nOverall 62.4 59.4 56.8\nTable 8|Gemini Ultra performance on the MMMU benchmark (Yue et al., 2023) per discipline."}
2+
{"query": "The GSPMD partitioner, part of the XLA compiler, is responsible for dividing the training step calculation.", "pos": "The GSPMD partitioner (Xu et al., 2021) in the XLA compiler\npartitions the training step computation, and the MegaScale XLA compiler (XLA, 2019) pass statically\nschedules appropriate collectives so that they maximally overlap with the computation with very little\nvariation in step time.\nMaintaining a high goodput2at this scale would have been impossible using the conventional\napproach of periodic checkpointing of weights to persistent cluster storage. For Gemini models, we\ninstead made use of redundant in-memory copies of the model state, and on any unplanned hardware\nfailures, we rapidly recover directly from an intact model replica."}
3+
{"query": "What's the impact of where you live and your social status on how well AI image labeling tech works?", "pos": "Thoughwedo\nnot see large discrepancies across different groups, we note that this metric is imperfect as the human\nreference captions could be inherently biased. Additionally, we perform a zero-shot classification style\nevaluation with the Dollarstreet dataset (Rojas et al., 2022) to measure discrepancies in performance\nacross images which come from different geographic locations. As is seen in previous work, we find\nthat models work less effectively for images from lower socioeconomic regions and regions outside\nNorth America and Europe. This is an area where we need further research and work to improve in\nfuture iterations of our models.\nIn addition to comparing performance on tasks across groups, we also consider how people are\ndescribed in captions."}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"query": "What gauges the effects of data contamination?", "pos": "We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models\non datasets such as Common Crawl, which can potentially include content from test datasets simply because such\ncontent often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify\nits distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most\ndatasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these\ndatasets or we note them with an asterisk, depending on the severity."}
2+
{"query": "What strategies did the United States employ to convince Pakistan to exercise its influence over the Taliban?", "pos": "Direct\npressure on the Taliban had proved unsuccessful. As one NSC staff note\nput it, \"Under the Taliban, Afghanistan is not so much a state sponsor\nof terrorism as it is a state sponsored by terrorists.\" In early 2000,\nthe United States began a high-level effort to persuade Pakistan to use\nits influence over the Taliban. In January 2000, Assistant Secretary\nof State Karl Inderfurth and the State Department’s counterterrorism\ncoordinator, Michael Sheehan, met with General Musharraf in Islamabad,\ndangling before him the possibility of a presidential visit in March as a\nreward for Pakistani cooperation. Such a visit was coveted by Musharraf,\npartly as a sign of his government’s legitimacy."}
3+
{"query": "What does carrying rotten potatoes symbolize?", "pos": "The children started complaining about the\ntrouble loudly.\nThen Mrs. Smith told them why she asked them to play the game. She\nsaid,\"This is exactly the situation when you carry your hatred for somebody\ninside your heart. The terrible smell of the hatred will pollute your\nheart and you will carry something unnecessary with you all the time. If\nyou cannot stand the smell of the rotten potatoes for just two weeks, can\nyou imagine how heavy it would be to have the hatred in your heart for your\nlifetime? So throw away any hatred from your heart, and you’ll be really\nhappy.\"\nQ: Which of the following is True according to the passage?\nA: If a kid hated four people,he or she had to carry four potatoes."}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"query": "Could you elucidate on the values of temperature and top-p that are utilized for pass@1 scores?", "pos": "8 62.8\n13B 18.3 60.2 30.6 69.0\n34B 22.6 77.2 33.0 76.1\n70B29.9 89.0 45.0 81.4\nTable 21: Code generation results on Human-Eval and MBPP . We report 0-shot and 3-shot results for\nHuman-Eval and MBPP respectively. For pass@100 and pass@80 scores, we use a temperature of 0.8 and\ntop-p=0.95. For pass@1 scores, we use a temperature of 0.1 and top- p=0.95.\n49"}
2+
{"query": "What do high safety scores and low helpfulness ratings suggest?", "pos": "Here we show more evidence and\nqualitative results to manifest this tension. Figure32 are two scatter plots of helpfulness and safety reward\nmodel scores on the safety test set for safe and unsafe responses. The tension can be observed at the bottom\nright corner (i.e., high safety score but low helpfulness score) in the safe response plot (left) and the top left\ncorner (i.e., low safety score but high helpfulness score) in the unsafe response plot (right). We also list two\nqualitative examples where safety and helpfulness reward models don’t agree with each other in Table 35."}
3+
{"query": "The process of carefully adjusting precautions relies on using challenging stimuli together with protected displays to make its operation run more smoothly.", "pos": "4.2 Safety Fine-Tuning\nIn this section, we describe our approach to safety fine-tuning, including safety categories, annotation\nguidelines,and the techniques we use to mitigate safety risks. We employ a process similar to the general\nfine-tuning methods as described in Section 3, with some notable differences related to safety concerns.\nSpecifically, we use the following techniques in safety fine-tuning:\n1.Supervised Safety Fine-Tuning : We initialize by gathering adversarial prompts and safe demonstra-\ntions that are then included in the general supervised fine-tuning process (Section 3.1). This teaches\nthe model to align with our safety guidelines even before RLHF,and thus lays the foundation for\nhigh-quality human preference data annotation."}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"query": "What are the pre-training challenges for large language models?", "pos": "To make this survey more self-contained, we present the\ndetailed formulations for these configurations in Table 6.\nNormalization Methods. Training instability is a challeng-\ning issue for pre-training LLMs. To alleviate this issue,\nnormalization is a widely adopted strategy to stabilize the\ntraining of neural networks. In the vanilla Transformer [22],\nLayerNorm [256] is employed. Recently, several advanced\nnormalization techniques have been proposed as alterna-\ntives to LayerNorm, e.g., RMSNorm, and DeepNorm.\n•LayerNorm. In the early research, BatchNorm [265] is\na commonly used normalization method. However, it is\ndifficult to deal with sequence data of variable lengths and\nsmall-batch data."}
2+
{"query": "Language learning models seriously struggle to grasp complex symbols when they're thrown in scenarios they don't know jack about.", "pos": "For an example of\nthe out-of-domain test, LLMs could only see the examples\nwith two words in context, but it requires LLMs to concate-\nnate the last letters of three or more words. Typically, the\naccuracy of the generated symbols is adopted to evaluate\nthe performance of LLMs on these tasks. Thus, LLMs need\nto understand the semantic relations among the symbolic\noperations and their composition in complex scenarios.\nHowever, under the out-of-domain setting, as LLMs have\nnot seen the complex compositions of symbolic operations\nand rules ( e.g., twice the number of operations in context\nexamples), it is hard for LLMs to capture their accurate\nmeanings."}
3+
{"query": "Could you shed some light on the two primary ways that LLMs employ demonstrations as discussed in document 493?", "pos": "How LLMs Perform ICL? At the inference stage, researchers\nfocus on analyzing how the ICL capability operates based\non given demonstrations since no explicit learning or updat-\ning is involved. According to the discussion in [493], there\nare two main ways for LLMs to utilize demonstrations: task\nrecognition and task learning.\n•Task recognition. In the first way, LLMs recognize the\ntask from demonstrations and utilize the prior knowledge\nobtained from pre-training to solve new test tasks. A Proba-\nbly Approximately Correct (PAC) framework [494] has been\nproposed to assess the learnability of ICL."}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"query": "Why is it believed the universe began at a particular time?", "pos": "According to a number of earlycosmologies and the Jewish/Christian/Muslim tradition, the universe started at a finite, and not very distant,time in the past. One argument for such a beginning was the feeling that it was necessary to have “First Cause”to explain the existence of the universe. (Within the universe, you always explained one event as being causedby some earlier event, but the existence of the universe itself could be explained in this way only if it had somebeginning.) Another argument was put forward by St. Augustine in his book The City of God. He pointed out\nthat civilization is progressing and we remember who performed this deed or developed that technique."}
2+
{"query": "Could you elucidate on the intricate procedure of stellar constitution?", "pos": "Andeven then it was a long time before the implications of the theory for massive stars were understood.To understand how a black hole might be formed, we first need an understanding of the life cycle of a star. A star isformed when a large amount of gas (mostly hydrogen) starts to collapse in on itself due to its gravitational attraction. Asit contracts, the atoms of the gas collide with each other more and more frequently and at greater and greater speeds –the gas heats up. Eventually, the gas will be so hot that when the hydrogen atoms collide they no longer bounce offeach other, but instead coalesce to form helium. The heat released in this reaction, which is like a controlled hydrogenbomb explosion, is what makes the star shine."}
3+
{"query": "Black hole existence evidence?", "pos": "the body that has collapsed must be lost when a black hole is formed, because afterward all we can possibly measureabout the body is its mass and rate of rotation. The significance of this will be seen in the next chapter.Black holes are one of only a fairly small number of cases in the history of science in which a theory was developed ingreat detail as a mathematical model before there was any evidence from observations that it was correct. Indeed, thisused to be the main argument of opponents of black holes: how could one believe in objects for which the onlyevidence was calculations based on the dubious theory of general relativity?"}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"query": "So, like, what's the big deal about the top species from the bigger groups talked about in Chapter 4?", "pos": "In our second and fourth chapters, on Variation and on Natural Selection, I have attempted\nto show that it is the widely ranging, the much diffused and common, that is the dominant species\nbelonging to the larger genera, which vary most. The varieties, or incipient species, thus produced\nultimately become converted, as I believe, into new and distinct species; and these, on the principle\nof inheritance, tend to produce other new and dominant species. Consequently the groups which\nare now large, and which generally include many dominant species, tend to go on increasing\nindefinitely in size."}
2+
{"query": "Identify the unique species in the Chthamalinae subfamily of sessile cirripedes and the location of its fossil discovery.", "pos": "I suspect that but few of\nthe very many animals which live on the beach between high and low watermark are preserved.\nFor instance, the several species of the Chthamalinae (a sub-family of sessile cirripedes) coat the\nrocks all over the world in infinite numbers: they are all strictly littoral, with the exception of a\nsingle Mediterranean species, which inhabits deep water and has been found fossil in Sicily,\nwhereas not one other species has hitherto been found in any tertiary formation: yet it is now\nknown that the genus Chthamalus existed during the chalk period. The molluscan genus Chiton\noffers a partially analogous case."}
3+
{"query": "Why are there flaws in the geological record?", "pos": "Nor is their rarity surprising, when we remember how large a proportion of the\nbones of tertiary mammals have been discovered either in caves or in lacustrine deposits; and that\nnot a cave or true lacustrine bed is known belonging to the age of our secondary or palaeozoic\nformations.\nBut the imperfection in the geological record mainly results from another and more important cause\nthan any of the foregoing; namely, from the several formations being separated from each other by\nwide intervals of time. When we see the formations tabulated in written works, or when we follow\nthem in nature, it is difficult to avoid believing that they are closely consecutive."}

0 commit comments

Comments
 (0)