Photo by Conner Baker on Unsplash

Operating large language models in-house

At Home

Article from ADMIN 88/2025

By Thomas Joos

An internal AI server is an interesting way to retain data sovereignty. We show you how to set up an in-house AI server on your hardware and use it in parallel with AI services such as ChatGPT in the cloud.

Operating your own artificial intelligence (AI) server in your data center offers a number of advantages over cloud services. One decisive factor is retaining complete control over sensitive company data, which will always remain on your network, which improves data security, and which helps you comply with strict data protection requirements, especially in highly regulated industries. Moreover, an in-house AI server enables consistent performance without dependencies on an Internet connection or external providers. Data processing latency is reduced, which is particularly beneficial for computationally intensive tasks such as image or speech analysis.

Another advantage is the ability to customize your hardware and software environments. You can scale and configure your servers individually to meet the specific requirements of your AI applications, without being restricted by standardized services from cloud providers. In the long term, an in-house server can also prove to be more cost efficient, because regular billing for cloud services is eliminated, and the infrastructure can be fully amortized. Being independent of price adjustments or service conditions imposed by external providers also gives you financial and operational peace of mind.

Hardware Requirements

The equipment for your large language model (LLM) environment depends on the requirements and the number of users, but the choice of graphics processing unit (GPU) is crucial for AI workloads: GPUs such as the NVIDIA A100 or the newer H100 are the market leaders because they are specifically optimized for deep learning and machine learning. These GPUs support technologies such as tensor cores, which specialize in computing neural networks, and offer a massive speed boost in terms of training and inference.

The H100 is based on the Hopper architecture and offers significant performance gains with lower power consumption compared with the previous generation, making it a good choice for performance-intensive AI applications. In addition to the high-end GPUs, NVIDIA also offers more cost-effective options that are particularly suitable for smaller projects or entry-level AI applications. These include the GPUs in the NVIDIA RTX series, which were originally developed for gaming and professional visualization but are also suitable for AI because of their good compute power and CUDA compatibility.

With 24GB of GDDR6X memory and up to 82.6 teraflops (TFLOPs) of single-precision 32-bit floating point (FP32) performance, the NVIDIA RTX 4090 offers a good platform for AI workloads, especially for inference-based applications or for training smaller models. The price for the RTX 4090 is currently about $2,500 (EU2,000, £1,800), depending on the provider and availability. The NVIDIA RTX 4080 is a slightly cheaper option that comes with 16GB of GDDR6X memory and an FP32 performance of up to 49TFLOPs at a cost of around $1,500 (EUR1,200, £1,300). The NVIDIA RTX 3060 and RTX 3070/3090 are even cheaper, but with significant limitations in terms of computing power and memory. These GPUs are priced at between $300 and $1,200 (EUR300 and EUR500, £250 and £740) and only make sense in development environments.

In addition to the CPU and GPU, the server should have enough RAM to process large volumes of data quickly. In many cases, 256GB or more RAM makes sense, especially when it comes to large neural networks or processing big data. The storage architecture is also crucial: NVMe SSDs offer the speed you need for fast access to training data and models. Finally, the network connection should be powerful enough to transfer high volumes of data efficiently, ideally with support for 100Gbps Ethernet or InfiniBand.

For test and development environments, you can also use Apple hardware with an M2, M3, or M4 processor. These processors have integrated AI functions and can certainly keep up with the smaller NVIDIA GPUs. The M2 chip introduced significant improvements in terms of machine learning capability, particularly through the further development of the Neural Engine, which is capable of handling complex AI tasks. The M3 chip builds on these advances, with a focus on greater integration of AI tasks into everyday applications, and the M4 chip offers even more compute power. One advantage of Apple devices is the significantly lower power consumption compared with servers with powerful NVIDIA GPUs. A Mac mini with an M2 Pro is definitely an interesting option for newcomers, especially because the software presented in this article also runs on macOS. Users can access the server from a web browser and the network, so it usually doesn't matter which operating system is installed on the server itself.

An AI Team

At the heart of the AI server present here is Ollama [1], an open source infrastructure platform that simplifies the deployment, management, and execution of LLMs. The platform offers the ability to package and deploy AI models efficiently, much like Docker does for container applications. Administrators can use these packages to run their own AI servers. Popular LLMs that are part of Ollama include Llama 3.1, Phi-3, Mistral, and Gemma 2, which means you can specify which LLMs are available on the server, install them in parallel, and give users a choice.

During the installation (on Ubuntu 24.04 for this example), you can integrate several LLMs that are available completely free of charge on an open source basis. Another advantage is that the source code is open and can therefore be examined for backdoors and issues.

The user interface is Open WebUI [2], which is optimized for use with LLMs and works with various web servers. You can use Open WebUI for speech-to-text scenarios as well as for creating texts or images and analyzing data. The user interface is similar to that of ChatGPT or other AI services. You can upload documents to the server and query the AI on the basis of these documents. The software also lets you implement rights management to control access to the AI service. Open WebUI has its own user administration, but a connection to Active Directory is not currently possible. Of course, you could set up authentication with OAuth [3], but I am getting a little ahead of myself, so I'll take a look at setting up the software environment first.

Installing Ollama

Once Ubuntu 24.04 LTS is in place on the server, you can proceed to set up Ollama by simply running the installation script provided by the developers. Ideally, you will be carrying out the configuration and installation in a sudo shell:

curl -fsSL https://ollama.com/ install.sh | sh

During the process, Ollama identifies the existing graphics adapter and integrates the matching drivers. This important step is the only way to ensure that the AI services use the GPU; otherwise, the server's CPU is used, which entails significant performance hits. You can see whether everything has been identified correctly in terms of the hardware; to update Ollama on the server, just run the command again. You do not need to stop any services: The script does this itself; however, you will need to restart the machine when done. You can check the results in a terminal window or type

service ollama status

to use SSH to check that the service is working correctly. Once Ollama is available, you can type

ollama pull llama3.1

(e.g., to integrate the Llama LLM). You can also download other models [4] in the same way. Llama 3.1 is ideal for getting started because it roughly corresponds to the functions of OpenAI GPT. The process naturally takes some time because the LLMs are several gigabytes in size, and Ollama downloads them completely to the server. Llama has its own hardware requirements, which are listed in detail online [5].

In addition to Llama, CodeGemma is also interesting because it lets developers create code and search for information in the software development field. If you want to generate images, LLaVa is the most exciting option. As mentioned, all data always remains on the local server for all models used; Ollama does not phone home to the Internet. However, you can specify in the options that users may use Internet connections or open connections to OpenAI LLMs, which are disabled by default.

After downloading the first LLM, you should test the function; in this example, you run Llama by typing:

ollama run llama3

From the command line, enter a first prompt to test the results. If the AI is working, it will terminate with a /bye . If everything works, you can install the web interface for the AI server.

1 2 3 Next »