How to Leverage Large Language Models Over Structured and Unstructured Data in Enterprise Applications

The Challenge of Diverse Data in Enterprises

In today's data-driven world, enterprises are inundated with a mixture of both structured and unstructured data, housed in various repositories such as Snowflake, Google Drive, or Salesforce. This data, often in varied formats like CSV, PDF, and Word documents, presents unique challenges, particularly when documents like PDFs contain complex tables where structured and unstructured data are intertwined.

Building a Large Language Model (LLM) App in This Environment

At Ritual, we've mastered the art of leveraging LLM to augment knowledge work. The key lies in a systematic approach to handling diverse data types. Here's how you can do it:

  1. Data Connectors: Start by deploying data connectors to gather data from various sources efficiently.

  2. Data Processing: This step involves several critical sub-steps:

    • PDF Extraction: Handling PDFs can be challenging. Different OCR models offer varying results, especially when extracting data from tables. Therefore, experimenting with multiple OCR services is advisable.

    • Chunking: It's often beneficial to break down documents into smaller segments. For structured data, we suggest keeping it in your data lake or warehouse.

    • Embedding: This is a crucial phase where different embedding strategies are employed based on your data and use-cases. The strategy for embedding Python APIs, for instance, would differ from that for Word documents.

  3. Doc Retrievers and Vector Stores: Before involving the LLM, a retrieval step is essential. This involves sourcing content with relevant responses to be used in the LLM context. Here, you have two main options:

    • Option 1 - Metadata Filters: Ideal for scenarios with a significant amount of unstructured data. Metadata filters in document retrieval systems help focus the search on pertinent documents using information like author, date, or tags. When integrated with vector stores, they enhance retrieval precision by marrying semantic relevance with specific metadata criteria.

    • Option 2 - Text-to-SQL: Recommended for cases with substantial structured data in databases. This involves using LLMs to generate SQL queries for execution on your data sources.

An orchestrator should ideally manage these processes, generating either SQL queries or requests to vector stores with appropriate metadata filters based on the chosen option.

User Interface and Operations

The end user, whether within the enterprise or an external consumer, need not concern themselves with the underlying data complexities. They can interact seamlessly through interfaces like a custom ChatGPT-like interface or bots integrated into platforms like Teams or Slack.

The Advantage of End-to-End LLMOps Platforms

Data scientists can experiment with various embedding, chunking, LLM, and vector store strategies, selecting the best fit for their specific use-case in a matter of hours.

The Debate: Build vs. Buy

Some teams opt to build each component in-house, but this approach is increasingly seen as a reinvention of the wheel. The argument for using best-of-breed components for each task does hold merit but often leads to development complexities. The right approach depends on the unique business context and the expertise of the AI team.

The example of Microsoft Office illustrates how integrating several tools can outperform individual applications. In the realm of AI, this integration becomes even more crucial. Companies are beginning to realize the power of leveraging embeddings on top of LLMs.

Conclusion

In conclusion, the landscape of enterprise AI is rapidly evolving, with LLMs introducing more complex architectures into an already intricate business and regulatory framework. The key to success lies in choosing the right strategy that aligns with your organization's unique needs and strengths, allowing you to quickly market your LLM application and focus on what truly sets your organization apart.