GPT-J is the Open Source Alternative to GPT-3

EleutherAI aims to break open Microsoft's monopoly over transformer-based language models with new GPT-J.

GPT-J is the Open Source Alternative to GPT-3
Photo by Brett Jordan / Unsplash

EleutherAI aims to break open Microsoft's monopoly over transformer-based language models with new GPT-J.

Overview:
EleutherAI’s GPT-J vs OpenAI’s GPT-3
EleutherAI claims new NLP model approaches GPT-3-level performance
Review:
Can’t Access GPT-3? Here’s GPT-J — Its Open-Source Cousin
Fun and Dystopia With AI-Based Code Generation Using GPT-J-6B
Twitter Thread:
Max Woolf
GPT-J-6B:
GitHub

GPT-3 Generated Summary:

  • EleutherAI is a research group focused on AI alignment, scaling and open-source AI research
  • The company released two GPT-Neo models with 1.3 billion and 2.7 billion parameters respectively
  • Microsoft has the exclusive access to GPT-3’s source code as part of a larger agreement between the two companies
  • EleutherAI is supported by Google and CoreWeave (cloud computing providers)
  • EleutherAI has built 825 gigabytes (GB) of language modelling dataset called The Pile, curated from a set of datasets including arXiv, GitHub, Wikipedia, StackExchange, HackerNews
  • Now, it has launched GPT-J, one of the largest models that EleutherAI has released till date
  • GPT-J is a 6 billion parameters model trained on The Pile
  • GPT-J is comparable in performance to the GPT-3 version of similar size — 6.7 billion parameters
  • GPT-J was trained on GitHub (7 percent) and StackExchange (5 percent) data, it is better than GPT3 175B at writing code. However, in other tasks, it is significantly worse
  • GPT-J is a fully end-to-end trainable Transformer LM on the same scale as 6.7B GPT-3, which achieves state of the art in zero-shot learning.
  • The model was trained on 400 billion tokens from The Pile dataset with 800 GB text.
  • Efficient attention (like linear, local or sliding window, etc.) was not used for simplicity, as it would not have significantly improved ‘throughput’ at this scale.
  • The dimension of each ‘attention head’ was set to 256, which is more than that of GPT-3 of comparable size. “This noticeably improved the ‘throughputwith minimal performance degradation,” said Komatsuzaki.
  • The team made two minor architectural improvements for GPT-J–Rotary embedding for slightly better performance, and placed the attention layer and the feedforward layer in parallel for decreased communication.

Advantages of GPT-J:

  1. It is open source and hence can be used by anyone, including commercial organisations like Microsoft.
  2. It is based on the DeepSpeed framework and can be used for training and inference.
  3. It has been trained on GitHub and StackExchange datasets, which makes it better at writing code than GPT-3 175B model, but worse at other tasks like reading comprehension or translation.
  4. GPT-J has been built to be compatible with both TPUs as well as GPUs (via CoreWeave).