Article

What Did We Learn While Developing and Testing the Internal Levi9 Chatbot?

Sharing knowledge is an important aspect of Levi9’s culture, and by organizing Meetups that bring together the IT community, we take the opportunity to share the knowledge and experience we’ve gained. During our birthday month in April, we organized a Meetup titled “AI-Driven Software Development: the good, the bad, and the ugly,” where Stefan Atanasković, a Python Developer at Levi9, and Zoran Nikić, a Test Lead at Levi9, shared their experiences.

Stefan and Zoran shared their experiences in developing Levi9’s internal Chatbot application with about 100 external guests, each from their own perspective. In this text, we’ll summarize this Meetup and the experiences of working on the Chatbot application.

Phases of Developing the Internal ChatBot Application

Stefan shared details about the different phases of developing our internal Chatbot application with the audience. There were a total of 7 phases:

Analysis and Planning: Initially, we carefully analyzed the requirements of our users and identified the key functionalities that our Chatbot should have. This step was crucial for defining the knowledge domain and setting clear development goals.
Technology Selection: After the analysis, we decided to use OpenAI services due to their widespread usability and reliability. Additionally, we chose to implement a rag-pattern approach due to its ability to combine search and text generation.
Knowledge Base Preparation: The next step was to prepare our internal knowledge base for use with the Chatbot. This included organizing and structuring the data, as well as optimizing the text to reduce the number of tokens and facilitate communication with the OpenAI service.
Rag-pattern Implementation: Implementing the rag-pattern required careful consideration of how to divide our knowledge base into smaller parts for more efficient token usage. We used algorithms like recursive character splitting to ensure the text parts were semantically connected and to prevent information loss.
Addressing Challenges: During development, we encountered several challenges, including token limit restrictions when communicating with the OpenAI service, accuracy of responses, and security risks such as prompt injection. Each of these challenges required a unique approach and the implementation of solutions that included methods like minimum similarity, limiting fragments in the prompt, and using the OpenAI Moderations API to filter suspicious queries.
Testing and Iteration: After implementation, we conducted extensive testing to check the functionality, accuracy, and security of our Chatbot. We used the feedback received to iteratively improve performance and user experience.
Production Deployment and Maintenance: Finally, after successful testing, the Chatbot was deployed to production. We regularly monitor its performance and respond to user feedback to ensure continuous optimization and maintenance.

ChatBot Testing

How do you test a chatbot? This is a crucial question. Zoran Nikić, Test Lead at Levi9, explained that the basic principle of testing is communicating with the application by asking questions and receiving answers. However, it’s key to understand that the application isn’t limited to the user experience but also involves interaction via API.

Testing through the API allows for better control and understanding of the response generation process, enabling the analysis of context and the size of articles being forwarded.

When creating a list of questions for testing, Zoran emphasized the importance of having a diverse list that covers various aspects of the application. This includes questions that cover different parts of the documentation, negative scenarios, and even generating questions directly from the documentation using GPT.

Verification of responses is a crucial step in testing. Zoran highlighted that, while it’s possible to recognize correct answers if we are familiar with the documentation, the question arises of how to verify responses when we are not experts in all fields. Here, two options come into play: obtaining predefined answers from experts or using GPT to generate responses based on the documentation.

Analysis of results is the next important part of the testing process. By analyzing, we can actually understand why certain issues with responses occur. This could be due to the application’s prompt system, the lack of clear answers in the documentation, or even “hallucinations” of the LLM model.

Automation of testing is also important. Zoran explained how they used an additional API call directly to GPT to verify responses and facilitate the testing process.

After describing the testing process, Zoran shared the challenges they faced during the lecture. Examples he cited pointed to problems with the accuracy of responses, especially when responses are generated from documentation that may not be precise or is lacking. He also highlighted issues with vectorization and context in testing, emphasizing the importance of file normalization and prompt quality. Finally, all attendees could see a short demo, shaping their understanding of the Levi9 ChatBot.

The conclusion is that there is no universal solution or ideal practice for testing chatbots. Each project requires a tailored approach, and testing in production and result analysis are continuous processes. Through experience and iteration, teams can adapt their practices and resolve challenges they encounter in testing AI-based applications.

We are glad that we had the opportunity through this Meetup to share the challenges and solutions we experienced while developing our internal Chatbot application at Levi9, and to participate in an active discussion with the audience and hear their insights. We believe this gathering was beneficial for everyone, and we look forward to the next opportunity to share knowledge.

Reason #612: We love to share our knowledge.