Trying to build internal use AI chatbot that is trained on internal data sources such as Confluence, Jira etc. Any tips on tools to use?

Hi,

There are a lot of tools to build customer faci g chatbots that scan the webpage, load additional pdfs etc, but I would like to build an internal use chatbot assistant that is trained on sensitive internal data so I would need APIs to push the training data to train the AI on the specifics of the business. Any suggestion on the tools to use in order to not build the bicycle twice?

You can make a custom GPT and upload your data to it if you have a subscription.

Or you could do RAG with private data.

Heres a tutorial from datastax but there are other platforms and ways to do RAG.
https://docs.datastax.com/en/astra-serverless/docs/vector-search/chatbot-quickstart.html

1 Like

Thanks Nick for the offer. I built RAG based chatbot concept already to try if the idea could work. And also to understand there is way too much to do for the project to be financially feasable. Thatā€™s why I am looking for some ready made tos for that

Like I said a custom GPT is potentially a good solution thatā€™s super easy to configure. I havenā€™t tested it much but you can upload your own data.

Copilot Studio can be ā€œtrainedā€ on sharepoint sites, and publicly available sites. As of right now, I have not way of trying out hooking atlassian products to it, as a data source (requiring a login).

From a technical standpoint it should be easily doable by forwarding an API Key, however I am not aware if this is implemented (yet).

There are other solutions on the atlassian marketplace available. Iā€™m just not sure if thatā€™s the way you want to take (Microsoft seems to protect your data where as GPT is rather unclear)

For what itā€™s worth, itā€™s the best pretrained thing available now, as far as I can tell. Itā€™s pre trained (GPT) that hooks to your data sources and lets you interact with it. You can then deploy the chat to many different interfaces, e.g. a Teams User. If itā€™s not what youā€™re looking for, keep an eye on it, still. They will surely build on top of that.

Thanks Matz. We are testing Copilit Studio, but yeah - we would need something we can train over API. So have to wait when they add that or maybe there is smth like that already - thatā€™s why I asked :slight_smile:

1 Like

The problem with using RAG for this is that youā€™re not really training the LLM, and any amount of RAG context you feed the LLM eats up the context window, which reduces the effectiveness of the response as the prompt size changes. The OP is looking for tools to incrementally train an LLM so it retains the knowledge. I think LoRAs that get incrementally built, with periodic merging with an off-the-shelf open source LLM base like Llama2 or Mistral.

I considered fine tuning LLM yes and just supplement with RAG to be able to show exact sources/links with additional information. However, I have amjor challange - the materials for fine tuning are not in English but Latvian and I think that strikes fine tuning out. So I think I need to stick with an LLM with sufficiently big context window. And them again - I could try sifferent approaches, but I am looking for ready made frameworks/tools to not build the whole boilerplate myself

1 Like

Are there any Latvian base model LLMs? Alternately, have you considered translating the material from Latvian to English when calculating the weights?

Unfortunately, Latvian is a small language soā€¦ have not found any. And translating would mean to translate also the requests and that would almost guarantee that the specific industry related questions would ā€œlost in translationā€

1 Like

Confluence: Confluence | šŸ¦œļøšŸ”— Langchain
Jira: Jira | šŸ¦œļøšŸ”— Langchain

(nllb200: Meta AI Research Topic - No Language Left Behind)

Approch 1:

Assuming the user query is in english and base llm works good with english, have the documents translated to english via nllb. It does carry over for semantic meaning of the sentence. If you find a better model, you can go ahead.

if your user query is in latvian, Convert user query to english, do a RAG over your data that you have converted to english.

Approach 2:(If you really really really dont want to use RAG)
Do a frankenmerger or a weight merge of nllb and a base model such as mistral.
Lazy Merge Kit (Colab)
Mergekit (Github)

1 Like

I suggest using just python to create the code which downloads your assets as web resources. That should work for both confluence and other web datasets. Keep the design of the dataset component on its own thatyou use as a function bring that local into a directory and then focus on naming and indexing. The RAG pattern can be added after but sometimes a static content locator that re-searches the static content you replicate (copy and modify pattern like brainā€™s model making). This allows you to update retrieval when you like. Use some of your own logic to put it into context and select parts to ā€˜not careā€™ about extraneous info which will give much richer output when you put your selected content into context with your prompt to answer cartesian product questions and generations since AI is best at sticking with a single topic when iterating. Here are open source examples using similar pattern: šŸ©ŗšŸ” Care Team Finder - a Hugging Face Space by awacke1 and this gets content: šŸ“„šŸ“‚WebDataDownload - a Hugging Face Space by awacke1 The best architecture I feel for this is ultimately RAG however if you think creatively about how you revolve you acquired content datasets into context is your key to making it work the way you want.

1 Like