
“What I see with LLMs and the new wave of AI is an era where you can now analyze all the data at the topmost layer and instead of querying with just SQL or Spark, it’s English or natural language queries,” Venkatesh said. With LLMs, that pyramid structure has flattened out, with significantly more data available for analysis, and easier methods. He explained that Big Data created a pyramid-like approach for data analytics, where the Big Data resides at the bottom and only a small amount of data could be analyzed at the top. “A bunch of us came to work in Big Data, not because we were all excited about SQL, but to look at fundamentally different ways to analyze data,” Venkatesh said. Venkatesh now sees LLMs as the next logical step on the path forward from Big Data. The Big Data market has shifted over the years into the data lakehouse space, where organizations use query engines, typically SQL-based, for data analytics on data stored in cloud object storage repositories. When Cloudera got started in 2008, Big Data, in the form of the open-source Hadoop project, was the company’s foundation. How LLMs are the logical path forward from Big Data Rather than duplicating data, what a vector database does is provide a functional index of the data as vectors. Venkatesh emphasized that creating a vector database for an LLM deployment with Cloudera does not mean enterprises are duplicating data, with one set in the lakehouse and another in the vector database. “You really need a database engine that can take a semantic search query, run it in vector space, and return the most relevant results back to you,” he said. To work with AI, there is a need to organize the data with a vector database. ĭata lakehouse technology relies on data object storage, which Venkatesh said is often a great way for organizations to store unstructured and semi-structured data. Among the options are Milvus, Weaviate and qdrant. Venkatesh said that Cloudera is enabling its users to choose which open-source vector database to use. Part of the Cloudera LLM reference architecture is the integration of open-source vector databases into the stack. The intersection between vector databases and Cloudera’s data lakehouse platform He emphasized that keeping data under tight control is critical for some enterprises. Venkatesh noted that by running the LLM in the same platform as the data, organizations can ensure that no data ever leaves the enterprise’s purview and no external API calls are being made. The initial set of LLMs that Cloudera is integrating with are open-source models that can run entirely inside the Cloudera platform. The training approach that Cloudera is embracing is what is known as a zero-shot learning model, where an existing LLM can quickly benefit from an existing data source. Venkatesh explained that CDP users can select the new LLM reference architecture from the catalog and have it installed in their environment in a few minutes. Now the company is expanding with architectures for conversational AI and LLMs. Register Now How Cloudera is bringing LLMs to its data lakehouseĬloudera is not building its own LLMs rather, it is making it easier for enterprises to use LLMs to gain insights from data that organizations already have in a data lakehouse.Ĭloudera already has a catalog of reference architectures that it provides to its users existing use cases have included AI models for customer churn and fraud analytics.
