Skip to main content

Cookie settings

We use cookies to ensure the basic functionalities of the website and to enhance your online experience. You can configure and accept the use of the cookies, and modify your consent options, at any time.

Essential

Preferences

Analytics and statistics

Marketing

Architecting Stateful Large Language Model Simulations for Infrastructure-less Cybersecurity Education

Avatar: MOUH MARIAM MOUH MARIAM

Team name
Learnifiers
Team members (First name, LAST NAME, University)
Mariam ,MOUH,ENSIASD ibn zohr - Fatima ,BADDAZ,ENSIASD ibn zohr -Nouhila HADOUDER,ENSIASD ibn zohr - Sara ,EL-ATEIF,ENSIASD ibn zohr (supervisor )
What area does your use case primarily fall under?
Training / education / pedagogy
The AI use case you are working on
We surveyed 30 students in CS, Engineering, and Business and found that 100% use AI, but 74% only read outputs or copy-paste without really thinking. At the same time, 92% still struggle to connect theory with practice. students also face challenges understanding complex concepts, staying focused, managing time, and finding clear guidance or resources. This shows that AI is creating an unhealthy dependency where students learn passively instead of actively reasoning.
Why this use case matters
This situation deserves attention since the ability of students to comprehend their studies in depth has a direct influence on their academic success, confidence, and even their future careers. When students don't learn by heart, they will be overwhelmed by gaps in their knowledge, which will result in lack of concentration. This also brings us to the point of discussing the future of AI technology in the academic field and how we could benefit from it in a positive manner. In order to solve the problem, we would like to develop an AI tutor that uses problem-solving and thinking critically by applying Socratic questioning, problem decomposition, and thinking from first principles. Instead of giving the answers, it starts with real problems, solves them step by step, and provides hints and questions to the students. It also gives exercises along the way and checks understanding to make sure that students actually learned those concepts.
Your team's motivation and learning objectives
We are motivated by this project because we are students ourselves, so we are directly concerned by this problem and genuinely want to solve it. It help students enjoy learning while improving their skills. We also want to support the integration of AI in education, as many institutions still reject or discourage its use.We’ve experienced the failures of AI, like copying answers without truly understanding or being able to solve similar problems later, and even feeling guilty using it. That’s why we want to build a system that follows the student’s reasoning, detects confusion, and guides them step by step toward real understanding. we are also motivated to grow technically in prompt engineering, LLM integration, and designing a user experience that shifts students from passive consumers to active problem solvers.
Your initial contribution
1. What is the situation or context you are addressing? A motivated cybersecurity student in a low-resource setting cannot get hands-on technical training because the needed infrastructure is too costly for most institutions in developing areas. The affected group includes students and institutions globally. This ranges from learners in developing countries in sub-Saharan Africa, Southeast Asia, North Africa, and Latin America who deal with poor infrastructure and limited connectivity to underfunded community colleges, vocational training centers, and non-profit bootcamps in developed countries that simply do not have the budget for adequate lab environments. Data shows the extent of the issue: 46% of institutions worldwide lack cybersecurity labs, and only 11% have facilities that meet basic training standards (Catota et al., 2019). The figures are clear. Managed lab platforms charge $60 to $120 per user each year before compute costs. The U.S. Cyber Range costs up to $240 per student annually. Learners resort to other methods like using free-tier accounts, sharing credentials, watching static tutorials, or quitting. This highlights that the market offers either costly and functional solutions or free options that lack educational value. There is nothing available that is affordable, interactive, practical, and well-guided (U.S. Cyber Range, 2025; ACI Learning, 2025). 2. What is your critical analysis of this situation? The main issue behind this crisis is not a shortage of talent but a failure in the existing teaching model. VM-based labs are expensive and rely heavily on connectivity. Cloud alternatives move some of these barriers but don’t fully remove them (Khan & Mohamed, 2025). LLMs seem to provide a solution. Testing shows an 80% success rate on beginner offensive security tasks at a cost of just a few cents per session. However, the Bandit CTF study reveals that GPT-4o achieves 80% success on isolated tasks but struggles with tasks that require continuity across steps. This is due to a problem called "Temporal Decay." The model loses track of changes in the environment as the conversation history exceeds the context window (Shi et al., 2025). The financial impact of doing nothing is significant, creating a cost gap of 20 to 100 times. This gap often decides whether a low-resource institution can even offer a cybersecurity program (U.S. Cyber Range, 2025). A key issue that others have overlooked is separating environmental state from the language model itself. Research from STaDS shows that LLMs only perform well in structured environments when decision factors rely on clear external knowledge instead of recalling information within the context (Liang et al., 2024). There are also hidden challenges: errors in tool syntax can teach incorrect habits (ACL Anthology, 2025), and LLMmap fingerprinting can identify the underlying model with 95% accuracy in just eight interactions, making the system vulnerable to targeted attacks (Pasquini et al., 2025). 3. What perspectives were discussed and how were they debated? Three perspectives drove every major decision. The technical perspective required reliable state persistence without corruption over long sessions. The learner perspective insisted that any friction or inaccuracy could lead to permanent disengagement from students who lack a human tutor for support. The business perspective needed the token-based cost model to handle real exploratory usage patterns without exceeding the budget. Three approaches were seriously considered and then rejected. Context-window state management was discarded because context fills predictably, attention decreases, and costs rise linearly. Docker containerization was rejected as it was still too heavy for a mid-range smartphone using mobile data. Pre-scripted decision-tree simulation was eliminated due to its branching complexity collapsing under unexpected inputs, which students learning offensive security would inevitably produce (Khan & Mohamed, 2025; Shi et al., 2025). The biggest conflict arose over accuracy requirements. One side accepted 95% syntactic accuracy as good enough for education, while the other argued that a 5% error rate is not an acceptable tolerance in cybersecurity education; it represents systematic corruption of the student's professional mental model. This disagreement was resolved by understanding that accuracy in technical education is a correctness constraint, not a performance metric. Consequently, the RAG grounding layer became mandatory without exception. All decisions were assessed against four criteria: accuracy, accessibility, pedagogical depth, and cost at scale. This approach led to a genuine synthesis: the external SSD from the technical perspective, mandatory RAG grounding from the learner perspective, sliding-window memory from the business perspective, and the Goal Tree hint system combining all three (Hao et al., 2023). 4. What contribution are you proposing, and under what conditions could it be implemented? The proposed contribution is an infrastructure- cybersecurity lab. In this lab a Large Language Model or LLM acts like a simulator. It is helped by a Shadow State Database. This database tracks all changes. These changes include things like ports, files, processes and permissions. It does this independently of the context window. This solves a problem called Temporal Decay. It also allows for multi-turn simulations. The system works in a five-stage loop: * An Input Parser looks at commands. It decides if they change the state or are for information. * The Shadow State Database. Queries the virtual environment. This happens based on the command. * A RAG layer makes sure outputs are based on documentation. It uses man pages and tool documentation. This helps to prevent the LLM from making things up. * A Socratic Evaluator checks student progress. It compares it to a Goal Tree. It gives hints that are calibrated. It does not give answers. A working prototype needs a few things. It needs API access to a cutting-edge LLM. It also needs a vector store indexed with FAISS. A goal tree authoring interface is also required. It can work on a smartphone. It uses bandwidth text connections. There are some risks. They are dealt with in the ways: * A default-deny policy is used for tool calls. This helps against injection. * Prompt obfuscation is used against LLMmap fingerprinting. * The RAG layer helps against hallucination. * OpenTelemetry tracing is used against cost overruns. Success will be measured in five ways: * How accurate the simulated outputs are. * The cost per student, per month. * The rate of task completion. This is measured with LSA cosine scoring. * The time it takes to become competent. This is compared to cohorts using Virtual Machines or VMs. * The latency of system responses. This is measured over low-bandwidth connections. References ACI Learning. (2025). ITPro: Plans and pricing for individuals. https://www.acilearning.com/individuals/pricing/ ACL Anthology. (2025). From capabilities to performance: Evaluating key functional properties of LLM architectures in penetration testing. https://aclanthology.org/2025.emnlp-main.802.pdf Catota, F. E., Morgan, M. G., & Sicker, D. C. (2019). Cybersecurity education in a developing nation: the Ecuadorian environment. Journal of Cybersecurity, 5(1), tyz001. https://doi.org/10.1093/cybsec/tyz001 Graesser, A. C., et al. (2001). Intelligent tutoring systems with conversational dialogue. AI Magazine. https://www.researchgate.net/publication/220017573 Hao, S., et al. (2023). Reasoning with language model is planning with world model. Proceedings of EMNLP 2023. https://aclanthology.org/2023.emnlp-main.507/ Khan, A., & Mohamed, A. (2025). Optimizing cybersecurity education: a comparative study of on-premises and cloud-based lab environments using AWS EC2. Computers, 14(8), 297. https://doi.org/10.3390/computers14080297 LFAI & Data. (2025). Demo to production: An open source architecture for reliable AI agents. https://lfaidata.foundation/communityblog/2025/11/25/demo-to-production-an-open-source-architecture-for-reliable-ai-agents/ Liang, et al. (2024). Evaluating LLM understanding via structured tabular decision simulations. arXiv. https://arxiv.org/abs/2511.10667 Pasquini, D., et al. (2025). LLMmap: Fingerprinting for large language models. Proceedings of the 34th USENIX Security Symposium. https://arxiv.org/html/2407.15847v4 Shi, et al. (2025). Autonomous penetration testing: Solving capture-the-flag challenges with LLMs. arXiv. https://arxiv.org/html/2508.01054 U.S. Cyber Range. (2025). Pricing. https://www.uscyberrange.org/pricing/
Comment

Confirm

Please log in

The password is too short.

Share