Introduction
Since the introduction of AlphaGo by Google DeepMind in 2016, the performance of artificial intelligence (AI) systems has improved significantly, driven by advances in computing power. The development of graphics processing units (GPUs) optimized for parallel computation has enabled substantial progress in computationally intensive fields such as natural language processing (NLP). The emergence of the transformer model in 2017 marked a pivotal milestone, ushering in a new era of NLP research [1].
In NLP, the transformer model processes sentences primarily through two key steps: tokenization and embedding. Tokenization is the process of breaking a sentence into smaller units, called tokens, which may include words, characters, punctuation marks, or numbers. Embedding follows tokenization, converting these tokens into vector representations that the model can interpret and process. Through this structured representation, the model captures contextual meanings and relationships between words, thereby enabling it to probabilistically predict the next word or sentence [1].
Building on advancements in transformer architecture, large language models (LLMs) such as ChatGPT [2], Vicuna, LLaMA [3], and LLaMA2 [4] have emerged. Among these, LLaMA2 has been made available for commercial use, driving active research and development across countries, institutions, companies, and developers.
Recently, it was reported that ChatGPT, one of the leading LLMs, achieved a passing score on the United States Medical Licensing Examination [5], raising concerns about the potential of AI to replace human professionals in specialized fields. However, LLMs such as ChatGPT operate fundamentally based on probabilistic word-prediction mechanisms. Therefore, achieving a passing score on an examination does not necessarily indicate that these models possess the comprehensive capabilities required to replace human experts.
Although models such as ChatGPT may not yet be capable of fully replacing human experts, the rapid advancement of AI technology highlights the necessity of proactively integrating LLMs and related tools into the medical domain. Establishing an environment that facilitates early adoption and adaptation is essential to keep pace with these technological shifts.
Therefore, this study aims to explore the potential applications of LLMs in the medical domain by incorporating LangChain technology [6].
Methods
Study materials
In this study, two clinical guidelines were selected to evaluate whether LLMs could be interpreted and applied effectively in the medical field. The selected guidelines were : (1) the “2018 Guidelines for Antibiotic Use in Urinary Tract Infections” published by the Korea Disease Control and Prevention Agency (KDCA) [7], and (2) the “Preliminary Recommendations for the Management of Long COVID” published by the Korean Society of Infectious Diseases [8].
These guidelines were chosen for the following reasons. First, documents issued by government-affiliated agencies, such as the KDCA, are typically written in precise and consistent language, which may enhance the language comprehension capabilities of LLMs. Second, the long coronavirus disease (COVID) guidelines were selected because they contain few tables and figures, allowing for direct analysis using LLMs without additional preprocessing.
The following is a list of questions extracted from each guideline:
- Does treating asymptomatic bacteriuria prevent the development of symptomatic urinary tract infection or perinatal complications in pregnant female?
- In non-pregnant women, does the treatment of asymptomatic bacteriuria prevent the development of symptomatic urinary tract infection?
- In women residing in nursing homes, does treatment of asymptomatic bacteriuria prevent the development of symptomatic urinary tract infection?
- In women with diabetes, does treatment of asymptomatic bacteriuria prevent the development of symptomatic urinary tract infection?
- In patients with spinal cord injury, does treating asymptomatic bacteriuria prevent the development of symptomatic urinary tract infection?
- In patients with indwelling urinary catheters, does treatment of asymptomatic bacteriuria prevent the development of symptomatic urinary tract infection?
- Does treating asymptomatic bacteriuria prevent infectious complications in patients undergoing urological procedures likely to cause mucosal bleeding (e.g., Transurethral Resection of the Prostate)?
- What are the non-carbapenem antibiotic treatment options for uncomplicated acute pyelonephritis caused by extended-spectrum beta-lactamase-producing organisms?
- Should empirical antibiotics be used in adult patients suspected of having emphysematous pyelonephritis, and if so, which antibiotics should be used?
- In patients with acute bacterial prostatitis, does the addition of medications such as alpha-blockers enhance treatment efficacy?
- In patients with acute bacterial prostatitis complicated by a prostatic abscess, does aspiration and drainage improve treatment outcomes?
- Under what circumstances should post-acute COVID-19 syndrome (long COVID) be suspected?
- In patients with post-acute COVID-19 syndrome, is thromboprophylaxis necessary?
- Do patients with post-acute COVID-19 syndrome require general rehabilitation or pulmonary rehabilitation?
- How should persistent respiratory symptoms in post-acute COVID-19 syndrome be managed?
- How should olfactory and gustatory dysfunction in post-acute COVID-19 syndrome be managed?
- How should fatigue in post-acute COVID-19 syndrome be managed?
- How should headache or cognitive symptoms in post-acute COVID-19 syndrome be treated?
- How should psychological or psychiatric symptoms in post-acute COVID-19 syndrome be treated?
- Is steroid therapy beneficial in patients with post-acute COVID-19 syndrome?
- What specific considerations are required for the diagnosis and treatment of post-acute COVID-19 in children and adolescents?
- Does COVID-19 vaccination influence the development of post-acute COVID-19 syndrome?
Experimental environment
All experiments in this study were conducted using the following hardware and software configurations:
Processor: Intel® Core™ i7-10700K (Intel Corporation, USA), 8 cores, 3.8 GHz
GPU: NVIDIA® GeForce RTX 3090 Ti (NVIDIA Corporation, USA), 24 GB VRAM, 10,496 CUDA cores
RAM: Samsung DDR4 128 GB, 3200 MHz, Quad Channel (Samsung Electronics Co., Ltd., Korea)
Storage: Samsung 980 Pro NVMe SSD, 1 TB (Samsung Electronics Co., Ltd., Korea)
Operating System: Windows 11 Pro (build 22621) (Microsoft Corporation, USA)
Software and libraries: oobabooga/text-generation-webui [9]
Large Language Models (LLMs): KoVicuna [10] (KoAlpaca Project, Korea), WizardVicuna [11] (LMStudio Community, USA), LLaMA 2 [12] (Meta AI, USA), and ChatGPT-3.5 [13] (OpenAI, USA).
The experimental environment was configured as described above for the following reasons: First, because hospitals and clinics typically operate in closed-network environments, unlike web-accessible models such as ChatGPT, the highest-tier consumer GPU was selected to support local computation. Second, the available VRAM of the selected GPU constrained the size of the deployable models, thereby limiting the study to 7 B and 13 B parameter models.
Token optimization using LangChain
Contemporary LLMs process inputs by tokenizing the text, with each token requiring computational resources during inference. Consequently, there are technical limitations on the number of tokens that an LLM can handle in a single interaction. For instance, ChatGPT-3.5 is known to have a limit of approximately 4,096 tokens per session. In this study, the number of tokens was restricted to 2,048 to balance performance and response latency.
Although such limitations can hinder the analysis of lengthy or complex documents, this study employs LangChain technology to address token constraints[6]. By leveraging LangChain, it is possible to process larger texts and generate responses efficiently, even within a closed-network environment with limited computing resources (Fig. 1).
Results
This study utilized three offline-accessible models, KoVicuna [10], WizardVicuna [11], and LLaMa2 [12], and the online-accessible ChatGPT model, all released by August 2023. Using LangChain technology, the two selected clinical guidelines were converted into vector databases. Each model was then queried to generate responses to key clinical questions derived from these guidelines.
The four models demonstrated answer accuracy rates ranging from 27.3% to 72.7% and response times (i.e., the time taken from submitting a query to receiving a complete answer) ranging from 18.3 to 71.9 seconds when responding to key questions extracted from the clinical guidelines. Among them, ChatGPT, which had the largest number of parameters, achieved both the highest accuracy and shortest response time (Fig. 2).
In general, models with a larger number of parameters, typically measured in billions (B), tend to exhibit improved performance [14]. This trend was also observed in the present study: ChatGPT-3.5, which is estimated to contain over 1,750 B (1.75 trillion) parameters, achieved the highest answer accuracy. Similarly, the WizardVicuna 13 B model outperformed the KoVicuna 7 B model, further supporting the correlation between model size and performance.
Notably, the LLaMA2 7 B model demonstrated accuracy comparable to that of the Wizard Vicuna 13 B model. This can be attributed to the performance differences between the underlying base models. WizardVicuna is based on Vicuna, which in turn is built upon LLaMA1. According to Meta, “LLaMA 2 pretrained models are trained on 2 trillion tokens and have double the context length than LLaMA 1” [15]. During pre-training, LLaMA2 was trained on two trillion tokens, approximately 40% more than its predecessor, and its context window size was doubled, enabling it to process substantially more tokens at once. Additionally, LLaMA2 was fine-tuned using approximately one million prompts, further enhancing its performance.
Therefore, when comparing the performance of LLaMA2 7 B with that of WizardVicuna 13 B, it can be inferred that a sufficiently trained LLM, based on a high volume of high-quality pre-training data, such as LLaMA2 7 B, can generate meaningful responses in the medical domain even with fewer parameters. This finding reinforces the importance of the quality and scale of pre-training data in determining model performance [16].
Discussion
LLMs are no longer limited to corporations or researchers, and the widespread availability of tools such as ChatGPT indicates a rapidly approaching reality in which these technologies are accessible to the general public. Enabled by advances in AI and the increasing availability of cloud-computing resources, LLMs are being increasingly adopted, driving innovation across a range of disciplines.
This study evaluates the performance of LLMs released between February and August 2023 in resource-constrained environments, focusing on their potential clinical applications. By prompting the models with questions derived from official medical guidelines, we identified both their strengths and limitations.
However, the use of LangChain for context retrieval has produced inconsistent results. Because retrieval performance depends on embedding quality, relevant information is sometimes not returned, leading to incomplete or irrelevant outputs.
In addition, differences in model size significantly affected both accuracy and usability. Smaller models, such as LLaMA2 7 B, frequently produced partial or superficial answers, whereas larger models, such as WizardVicuna 13 B, generated more comprehensive outputs but required substantially longer inference times. This trade-off between performance and computational burden has been observed in other evaluations of LLMs for medical and technical tasks [17].
Furthermore, hallucinations (the generation of factually incorrect or unsupported content) were consistently observed. For example, one model recommended interventions that were not included in the source guidelines. Such phenomena have been widely reported as critical barriers to the adoption of LLMs in medicine [18]. In clinical settings, hallucinations may mislead physicians and patients, highlighting the need for continuous monitoring, validation, and strict alignment with authoritative sources.
In addition to these technical limitations, practical and ethical issues must also be considered. Response times exceeding one minute for larger models may disrupt outpatient workflows, particularly in patient-facing specialties such as family medicine. Concerns about privacy, data security, and regulatory compliance remain central to the clinical deployment of LLMs [19].
Taken together, our findings suggest that current LLMs, when deployed on a single high-end personal computer, can provide clinically relevant information but remain insufficient to replace physicians. Instead, they may serve as supportive tools for augmenting communication, knowledge retrieval, and decision-making. To enable safe integration into healthcare, further research should establish standardized evaluation frameworks, ensure robust privacy protection, and develop safeguards to minimize hallucinations and other risks.








