How to Output Image Content in a Text-Image Knowledge Base
Learn how to implement text-image mixed display effects in Dify knowledge base, outputting high-quality RAG content with images
Note
Original Article
一个无人问津的小站 - Lesson 04: How to Output Image Content in DIFY Text-Image Knowledge Base
First, don't expect to get good results by simply feeding a few PDFs to the embedding model. Good results often require you to understand your actual needs, organize high-quality materials, and understand the operating principles of RAG.
If you want to use DIFY to create text-image mixed display effects as shown below, refer to this tutorial.
Image Storage Solutions
When outputting mixed text and images in the DIFY knowledge base retrieval process, the key is image storage. Currently, there are two solutions:
- Store images on a remote server, so text-image mixing actually loads images from the server
- Store images in Word documents, where DIFY automatically generates URL paths for remote access when parsing the Word file
This tutorial focuses on the second approach, which allows you to quickly implement text-image mixing effects without additional server costs and domain configuration.
Content Organization
First, organize your knowledge base content into a Word document. If you encounter various parsing errors when processing Word files, you can first put your content into a Feishu (Lark) document, then use the Feishu document features to download it as a Word file. You can simply understand this as Feishu standardizing your content into a more standard Word format, embedding images in the Word file rather than referencing external links.
When organizing your document, try to use two line breaks as separators, which helps DIFY's default segment identifier correctly recognize paragraphs. Of course, you can also use special identifiers and modify them later in the DIFY configuration. For example, I use two line breaks here, which corresponds to \n\n
.
Knowledge Base Configuration
After downloading the Word file, you can import it into the DIFY knowledge base for processing. Pay attention to the segment identifier configuration to ensure it matches your plan. Then click the preview button to check the segmentation effect of each block. As shown below, the preview on the right matches my expected document effect.
For the embedding model and rerank model, simply select the silicon-based flow model:
After saving, wait a moment for the embedding to complete. At this point, we can directly use retrieval testing to see the text-image effect.
Q&A Process Design
Next, we can insert a knowledge retrieval node in the chatflow and select the knowledge base content we just added.
Then, add an LLM node to further process the retrieved content, prompting the LLM to maintain the text-image mixed format, preventing the model from automatically filtering out image information.
Finally, you'll be able to achieve a text-image mixed display effect.