With the growing popularity of Large Language Models (LLMs) and such applications as ChatGPT, I observe a lot of growing misconceptions around the usage of models in production-level systems. In this article, I am trying to raise awareness around them for new developers and entrepreneurs.
Myth #1 Prompt Engineering will solve my problems
There are several researches, that focus on different prompt techniques and how they improve the output of the reasoning of the model, for example, writing "let's think step by step" in a prompt, increases accuracy by several percent.
While these techniques, allow us for micro-optimizations and allow us to know more about the way models were trained, they also increase the fallacy of believing that there exists a "perfect prompt" (speculation). Hundreds of prompt libraries out there try to entrap you into the same belief.
So, what is the alternative? Prompt Chaining.
For example, you can ask for a recipe for Pastéis de Nata, and then provide the output to another prompt as an input, but this time, ask to criticize the recipe, and once you have critiques, you can chain it to the next prompt, asking to improve the recipe from the received critique. And for the finals, you can validate, that the recipe is according to user taste.
Does it remind you something? Yes. It is the way humans tend to solve problems, and it seems that we are reinventing nature. But that's a topic for another article.
The resulting quality would be never possible with even the most ideal prompt.
Myth #2: I have enough context to put my whole data to prompt
The ever-growing context window of models (GPT-4 turbo allows for 128K tokens!) increases the willingness to put all the data into one prompt. This being a very easy way to interact with models, however, produces several bad side-effects: processing time increases, LLM gives different importance for context beginning, middle, and end, the quality of the response drops, and the costs increase astronomically for a production level system.
While it could be suitable for fast prototypes and, maybe, for some very specific use cases, for the reasons listed above, it is really hard to scale such logic.
To overcome these limitations, we could use several strategies, that mostly fall under the classical "divide and conquer" algorithm.
Sometimes we tend even simple tasks to pass to LLMs, but we should remember it is a very expensive operation. Identify what can be done without LLMs and extract it to code following formal logic.
We should not forget that ML models existed before LLMs and while the latter is great at general reasoning cases, ML offers us production-level performance and cost-effectiveness in tasks of classification, tagging, analysis, simple decision-making, etc. It is much easier today to train your model, even if you are not a data scientist. You can use LLMs to provide examples of your simpler ML models!
Then, depending on your use case, you can offload your data to the Vector database with embeddings or fine-tune your model, depending on whether you want your LLMs to provide you with exact data (Vector DB), versus change the behavior of LLM (fine-tuning).
The last, but maybe, the simplest in terms of implementation is "prompt chaining" mentioned in the previous myth. You can split your prompts and data into different requests.
These are only some of the strategies and you can find even more, the main point is, to always be aware, that in most cases putting all your context in your prompt is not the best idea for scaling (meaning, you will need to refactor the code sooner than you think).
Myth #3: LLMs are always there to help and are deterministic
Running LLMs on production is not the easiest task. Therefore, such platforms as HuggingFace, and ChatGPT are getting overused and under constant instability. Not talking about constant evolvement and change of the output for every new update. Probably, one day we will get into more stable land, but at this point, the hype is being forced by greed, and expect that if your main user journey is dependent on LLM, there will be seconds, minutes, and hours when it is not available.
So either, crack the code for yourself, and host stable LLM at your cloud resources, or provide users with a fallback, when LLM is not available or changes the behavior.