Add challenge set

This commit is contained in:
rlm
2023-05-05 20:05:50 -07:00
parent 2c0936ea91
commit 7cd7a50a94
6 changed files with 11 additions and 0 deletions
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@@ -0,0 +1,11 @@
"question","answer",
"What are the limitations of task-specific fine-tuning?", "First, the need for a large dataset of labeled examples for every new task limits the applicability of language models. Second, high capacity models tend to over-fit on narrow fine-tuning datasets and do not generalize well outside of them. Third, humans do not require large supervised datasets to learn most language tasks. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.",
"What is in-context learning?","In-context learning is an approach to meta-learning, which means the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities at inference time to rapidly adapt to or recognize the desired task when given examples. This involves absorbing many skills and tasks within the parameters of the model.",
"On what NLP tasks does GPT3 report state-of-the-art performance using zero or few shot learning relative to fine-tuned benchmarks?","GPT3 achieves 71.2% on TriviaQA in the few-shot setting, which is state of the art relative to fine-tuned models operating in the same closed-book setting.",
"What are the pros and cons of fine-tuning, zero-shot learning, and few-shot learning?","Fine-tuning involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. It benefits from strong performance on many benchmarks, but requires a new large dataset for every task. Few shot learning gives the model a few demonstrations of the task at inference time as conditioning, but no weight updates are done. It benefits from a major reduction in the need for task-specific data. But results from this method have so far been much worse than state-of-the-art fine-tuned models. In zero-shot learning, the model is only given a natural language instruction describing the task without any examples. It is the most convent and potentially robust approach, but the most challenges (especially for tasks that are difficult to describe).",
"How is the batch size increased for the GPT3 models?","The batch size is increased linearly from a small value (32k tokens) to the full value of 3.2M token over the first 2 billion tokens of training.",
"How does RETRO perform retrieval in terms of search and latency?", " For each chunk, RETRO will retrieve its approximate k-nearest neighbours from a key-value database using the L2 distance on BERT embeddings. It uses the SCaNN library to query the approximate nearest neighbours in O(log𝑇) time."
"What scaling law does the Chinchilla paper propose and how does Chinchilla compare to Gopher?", "The paper fits a scaling law for loss L, as a function of model size N and data size D. Based on the losses of over 400 models, the paper suggests that large models should be substantially smaller and therefore trained much longer than is currently done. They verify this by training a more compute-optimal 70B model, called Chinchilla, on 1.4 trillion tokens, which is 4x smaller than Gopher."
"How do the LLaMA model compare to prior benchmarks, such as PALM, Chinchilla, and GPT-3?","LLaMA is trained only on publicly available data, making the work compatible with open-sourcing. LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10× smaller. The 65B-parameter model is also competitive with the best large language models such as Chinchilla or PaLM-540B."
"How did the LLaMA models draw inspiration from GPT3, PaLM, GPTNeo, or Chinchilla?","Like GPT3, LLaMA normalizes the input of each transformer sub-layer using RMSNorm. Like PaLM, they replace the ReLU non-linearity with the SwiGLU activation function. Like GPTNeo, they remove the absolute positional embeddings, and instead, add rotary positional embeddings. The general approach was inspired by the Chinchilla scaling laws: LLaMA-13B outperforms GPT3, but can be run on a single GPU."
"How does Gato embed multi-modal inputs?" , "Tokens belonging to text, discrete or continuous-valued observations or actions for any time-step are embedded via a lookup table into a learned vector embedding space. Tokens belonging to image patches for any time-step are embedded using a single ResNet block to obtain a vector per patch."
Can't render this file because it contains an unexpected character in line 11 and column 41.