High Alpha Ideas

LLMs and Proprietary Datasets

Large language models (LLMs) like GPT-4 and PaLM 2 have been trained on hundreds of billions of parameters and made available to anyone in the world – for a fee. The wide availability of these so-called “foundation models” means that differentiation and competitive advantage for businesses building on top of them will come in the form of datasets that can be used to improve product offerings or be licensed to others. This gives rise to a few interesting questions that we’re exploring at High Alpha:

  • How will businesses get access to these datasets to incorporate them into their own finetuned models or applications?
  • What enterprise security and data privacy considerations need taken into account?
  • Will “give-to-get” and the notion of “data collaboratives” proliferate?
  • Will data be licensed one-time, in perpetuity or only when consumed by an application?


In the event that users contributed to a dataset, how will they be compensated for their contributions and which attribution models will support fair payment?

  • Shutterstock’s Contributor Fund offers one possibility, but payment proportional to number of records on a database contributed by a user is a flawed attribution model along a number of dimensions.
  • Over the long term, will the value of proprietary data converge to zero as foundation models are trained on larger training datasets and synthetic data supplements proprietary data?
  • Will datasets be API called via a marketplace at the time a prompt is submitted or will they be included in embeddings?


Are you a future co-founder or potential early customer interested in joining us to turn this idea into a business? Get in touch below or learn more about our studio process.

LLMs and Proprietary Data