Sreekar Reddy Edulapalli, ISM CyberRAG Portfolio

Opening stance

Get the measurement right first, then make it inspectable.

I came into the project pushing for the team to take measurement seriously before we built anything new. That one decision shaped the rest of it. The 100-question dataset, the fixed model and embeddings across sprints, the cross-sprint comparison, and the Pipeline Explorer that lets anyone look at how the system reached an answer without opening a notebook.

"I kept emphasising that our project should not just be technically interesting, but also measurable and clearly explainable across sprints. This influenced the way we framed the final project as an MLOps-style improvement story rather than just a one-time chatbot." AT1 Reflective Journal, weeks 1 to 3

Project snapshot

What ISM CyberRAG is, in a paragraph.

The Australian Information Security Manual is Australian government cyber security guidance published by the Australian Signals Directorate. Our corpus used 25 PDF chapters with 1,073 unique control IDs, structured by applicability levels and Essential Eight maturity. It is rigorous, and it is hard to use quickly. Users such as IRAP assessors, security architects, GRC analysts, and students need a faster way to reach the right controls without losing the source evidence.

ISM CyberRAG is a deployed natural-language interface to that corpus. A user asks a question. The system retrieves the controls that actually apply, cites them, and refuses politely when a question lands outside the corpus. Three sprints of measured iteration, one fixed evaluation set, one deployed product at a public URL.

Stack: Python · FastAPI · Supabase pgvector · Groq Llama 3.1 8B · nomic-embed-text · RAGAS · ClearML · Docker · HF Spaces
Live: esreekarreddy-ism-cyberrag.hf.space
Repo: studiobuilders/ism-cyberrag
Timeline: 13 weeks, 3 sprints, March to May 2026
Team: Sreekar · Chandan · Ruben
Course: AI Studio (AT5, 100 marks)

Problem framing

Why the ISM is hard to search, and what a RAG system helps with.

Rubric map: problem design + project design proposal

The lookup problem

Twenty-five PDFs, more than a thousand control IDs, applicability levels (Not Classified, Official, Protected, Secret, Top Secret), Essential Eight maturity tiers, revision numbers. The structure is precise. That is what makes it both useful for experts and slow for everyone else.

The people who use it (IRAP assessors, security architects, GRC analysts) often have to grep through the documents to find the right controls for a question that does not arrive in ISM vocabulary.

What we built, and what we did not

A natural-language interface that returns the right controls with citations, reducing the lookup friction. Three retrieval improvements across sprints. A two-stage guardrail that refuses out-of-scope questions politely. A deployed product anyone can use.

What I think the project contributes. The individual techniques are not new on their own. In our public search, what we did not find was an evaluated end-to-end ISM-specific system carrying this stack together: control-boundary chunking, hybrid search with cross-encoder reranking, multi-query expansion, a two-stage guardrail calibrated against real rerank-score data, and a streaming Pipeline Explorer over SSE. I would defend this as negative search evidence for our integration, not proof that nobody has built anything similar.

What this is not: a replacement for a qualified assessor, implementation advice, or a substitute for reading the ISM when a real decision is at stake. The framing is "make the document searchable", not "make the document optional".

Role rotation as a growth arc

Three sprints, three roles, three different lessons.

Every team member did Product Owner, Data Scientist, and Data Engineer once. I did them in that order. The interesting bit is not the role names. It is what each one taught me that I did not know going in.

Sprint 1 · Product Owner

Owning the measurement backbone

Lesson: under-scoped measurement costs more than under-scoped features. A feature can be reworked next sprint; a flaky measurement story is harder to recover from.

Sprint 2 · Data Scientist

Building the thing I had been measuring

Lesson: teammate validation feels slower in the moment but prevents a much more expensive fix after the report is written.

Sprint 3 · Data Engineer

Making it real and inspectable

Lesson: "CI green" is not the same as "live verified." Build status and live status are different signals; report them separately.

How the rotation worked for the whole team

Sprint	Product Owner	Data Engineer	Data Scientist
Sprint 1	Sreekar	Chandan	Ruben
Sprint 2	Chandan	Ruben	Sreekar
Sprint 3	Ruben	Sreekar	Chandan

Architecture

The seven-stage retrieval pipeline.

Rubric map: POC, V0.1 and V0.2 design, shown through the final Sprint 3 architecture

Pre-filter → query embedding → multi-query expansion → hybrid search (BM25 + pgvector via RRF) → cross-encoder reranking → rerank-score guardrail → LLM generation with cited ISM controls. One diagram, one pipeline, three sprints of improvements layered onto it.

ISM CyberRAG Sprint 3 high-level architecture diagram

Sprint 3 high-level design · seven pipeline stages with deployment surface

Personal contribution, by sprint

What I owned in each sprint.

Rubric map: POC implementation, V0.1 implementation, V0.2 implementation, and team-role reflection

I expanded the evaluation dataset from 20 to 100 questions after a week-6 tutor suggestion, structured across five difficulty categories, and manually verified every ground-truth answer against the source ISM PDFs, fixing seven inconsistencies during verification. That dataset is the measurement backbone for the entire project; every cross-sprint comparison uses the same questions.

evaluations/eval_questions.json: 100-question dataset, five categories, manually verified
src/evaluation.py: RAGAS wrapper, separate evaluation LLM, local-Ollama judge path
ClearML task c32673341b364cf78c52a12992a3a6e4: Sprint 1 baseline with parameter snapshot
notebooks/sprint1_poc.ipynb: runnable POC reproducing baseline end to end

"I am more comfortable with the evaluation and measurement side of the project than with the infrastructure side. Sprint 1 played to that strength." AT2 Reflective Journal

I rewrote the chunker to segment at ISM control boundaries (chunks dropped from 900 to 643, every chunk tied to a control_id), integrated a cross-encoder reranker (ms-marco-MiniLM-L-6-v2), and added answer_similarity because the standard RAGAS metrics can make correctly refused out-of-scope answers look worse than they are. I also tracked max_rerank_score per question; that calibration data became Sprint 3's rerank-threshold guardrail.

src/chunking.py: ISM-aware chunker, complete rewrite of Sprint 1's fixed-size approach
src/reranking.py: cross-encoder rerank module, ms-marco-MiniLM-L-6-v2
src/evaluation.py changes: answer_similarity metric, max_rerank_score tracking
notebooks/sprint2_development.ipynb: sections 3, 5, 8, 8.5, 9, chunking through to the Sprint 1 vs Sprint 2 comparison
docs/sprint-2/SPRINT2_PIPELINE_REPORT.md + ClearML task 379669d5c8ca47d083bce53ab9b815fc

"Splitting text at arbitrary character boundaries discards structural information that the document author built in. Parsing the actual control line format and using it to segment text produced more meaningful retrieval units, which is why context precision improved more than the other metrics." AT3 Reflective Journal

I delivered the deployment surface: Dockerfile (python:3.11-slim, CPU-only PyTorch wheels, pre-cached model weights), two GitHub Actions workflows (CI on every push, CD on merge to main), the Pipeline Explorer page driven by Server-Sent Events, the Evaluations dashboard, and the Supabase Row Level Security policy. The live URL is the evidence that this moved beyond a local demo: esreekarreddy-ism-cyberrag.hf.space.

Dockerfile: CPU-only torch wheels, pre-cached model weights, non-root UID 1000, uvicorn on :7860
.github/workflows/ci.yml + deploy.yml: lint, smoke import, Docker build; rsync to HF Space on merge
app/templates/pipeline.html + /pipeline/stream: seven SSE cards animate as the pipeline runs
app/templates/evaluations.html: three-sprint comparison + chart grid
database/sprint3_rls.sql: Row Level Security so the publishable key cannot write to the corpus

"The CI/CD workflows took shape through a sequence of small failures and fixes rather than through me sitting down and writing a complete workflow file in one go." AT4 Reflective Journal

Live product

What the deployed app looks like.

Rubric map: communication of project results + deployed V0.2 evidence

The deployed app has three tabs. Search ISM returns a cited answer alongside the retrieved chunk cards. The Pipeline Explorer streams the seven pipeline stages as they happen, so a non-technical viewer can audit the system without opening a notebook. The Evaluations tab embeds the cross-sprint comparison directly.

Search ISM tab. Cited answer on the left, retrieved chunk cards with rerank scores on the right.

Cited control IDs, applicability metadata, ISM source attribution per chunk

Measurement story

Three sprints on the same 100 questions, same generator setup, same judge.

Rubric map: project results, cross-sprint evidence, and evaluation methodology

0.83 Sprint 3 faithfulness

+0.10 vs Sprint 2 · +0.15 vs Sprint 1 baseline · target met

Metric	S1	S2	S3	S3 target	Result
Faithfulness	0.6834	0.7341	0.8351	> 0.78	Met (+0.10)
Answer Relevancy	0.7216	0.7678	0.9078	> 0.82	Met (+0.14)
Context Precision	0.7885	0.8598	0.8590	> 0.85	Met
Context Recall	0.8224	0.8659	0.9249	> 0.91	Met (+0.06)
Answer Similarity	n/a	0.9057	0.9179	> 0.93	Not met (−0.012)

sprint3_ragas_metrics: final five-metric snapshot

sprint3_guardrail_outcomes: two-stage refusal classification

sprint3_oos_threshold_calibration: rerank-score distribution drove the −5.0 cutoff

hpo_workflow_diagram: ClearML controller with child tasks

Before / after: the two highest-leverage retrieval changes

Sprint 1 · Fixed-size chunking

900 chunks at 1000 chars with 200-char overlap. Control boundaries get sliced through arbitrarily. Retrieval has to guess where the control starts.

Sprint 2 · ISM-aware chunking

643 chunks segmented at Control: ISM-XXXX; Revision: X boundaries. Each chunk maps to a logical unit with structured metadata (control_id, applicability, Essential Eight level).

Context precision
+7.1 pp
0.7885 → 0.8598

Sprint 1 · Cosine top-5

Pure vector retrieval over pgvector. Misses exact control IDs and specific terminology when the user types them verbatim.

Sprint 3 · Hybrid + rerank + multi-query

BM25 (GIN index) + pgvector merged via RRF at k=50, top 10 reranked by cross-encoder, multi-query expansion deduplicates by chunk ID across three variants.

Context recall
+10 pp
0.8224 → 0.9249

Why the cross-sprint comparison is defensible

Same 100 questions, same Llama 3.1 8B generator setup, same nomic-embed-text embeddings, same local-Ollama RAGAS judge across all three sprints. That makes the deltas more comparable than a one-off final score. I added answer_similarity in Sprint 2 because the standard RAGAS metrics can return NaN or 0.0 on correctly worded refusals, which makes out-of-scope handling hard to read from headline averages. The Sprint 3 evaluation fills correctly refused OOS rows to 1.0 on faithfulness, answer_relevancy and context_recall. We logged that in the release notes as a methodology choice, not buried as a metric tweak.

Three design decisions

Three calls we made, what we rejected, and why.

SSE for the Pipeline Explorer

REJECTED · WebSockets

Data flow is one-way (server to browser) and sequential. SSE works on plain HTTP without a protocol upgrade, and I used the Fetch streaming reader so the app could send the question as POST body and parse event frames directly.

Regex + rerank-score threshold

REJECTED · Trained classifier

It avoids an extra model call. The −5.0 cutoff is calibrated from Sprint 2's actual rerank-score distribution (OOS averaged −8.4, in-scope averaged positive), so the threshold is grounded in our data, not a blind hyperparameter guess.

Fill-to-1.0 on refusal NaN

REJECTED · Drop NaN rows

Preserves the denominator and avoids making correctly worded refusals look like failed answers. Logged as an explicit methodology choice in release notes rather than buried as a metric tweak.

HPO sweep

The sweep was inconclusive, and that is what we reported.

Late in Sprint 3 we wired a clearml.automation.HyperParameterOptimizer sweep over OOS_RERANK_THRESHOLD values of {−7, −6, −5, −4, −3} on a 30-question subset, with the composite objective averaged across the five RAGAS metrics.

The objective came out flat across the band, spread 0.026, with run-to-run variance of 0.028 between two identical −5.0 runs. The OOS block rate was 1.0 at every threshold because the pre-filter caught the OOS rows first.

The conclusion was not "−7.0 beats −5.0." It was "this sample size cannot distinguish thresholds in that range." We kept −5.0 because the Sprint 2 score distribution showed a narrow boundary around −4.75, and false-refusing hard in-scope ISM questions would be worse than letting the pre-filter handle obvious OOS cases.

What this told us

When the variance between identical runs is bigger than the spread across configurations, the sweep is not evidence. We wrote that into the audit trail and kept the threshold the Sprint 2 score distribution already pointed to.

Three moments that changed how I work

Three incidents that changed how I worked after them.

The pattern across these three incidents is the same: moving private work into inspectable work, in three different shapes. They are kept together because the lesson compounds across roles, not because the situations were similar.

Sprint 1 · PO · Week 6

Asking the team to trust a larger evaluation set

The tutor flagged 20 evaluation questions as too few for a defensible three-sprint comparison: per-category variance would dominate the headline numbers, and any cross-sprint deltas would be drowned out. The honest part of my response was admitting I had under-scoped the evaluation set during planning. I asked Chandan and Ruben to accept a larger measurement task close to the sprint deadline, took the work onto myself, expanded to 100 questions across five categories, and manually verified every ground-truth answer.

→ I stopped treating evaluation as a private PO task and started documenting dataset shape, assumptions, and known errors so the team could inspect them. I now treat the cost of an under-scoped measurement artefact as bigger than the cost of an under-scoped feature, because features can be reworked but a flaky measurement story is harder to recover from.

Sprint 2 · DS · Week 8

Accepting validation feedback on the chunker

The first ISM-aware regex worked on the sample controls I tested by hand, but Chandan's validation pass against the PDFs surfaced sections where the control-line format deviated from my assumed pattern. My first reaction was frustration; the validation was exactly what the role rotation was supposed to create.

→ I tightened the chunking rules, reran the evaluation, and described the edge cases in the pipeline report rather than pretending the first version was complete.

Sprint 3 · DE · Mid-sprint

"CI green, prod stale"

A deploy that passed CI but did not actually update the live Hugging Face Space. The rsync source path included its own destination (workflow looped until timeout); the HF README was missing required frontmatter. I had been treating green workflow status as evidence that the release was live.

→ I separated build status from live verification, used the live URL as the source of truth, and added smoke checks for one in-scope and one guardrail query before declaring a release ready.

What the rotation taught me about myself

Three patterns I noticed in how I work.

1

"Did it work?" comes more naturally to me than "how do I build it?"Sprint 1 played to that strength; Sprint 2 forced me to implement and was harder; Sprint 3 as Data Engineer brought me back closer to the "did it work?" frame, but with the deployment surface as the unit of work.

2

I am better at incremental debugging than greenfield design.From AT4: the CI/CD workflows took shape through a sequence of small failures and fixes, not through one complete file written from scratch. Fine when feedback is fast, has limits when it is not.

3

I tend to think too broadly at the start.From AT1: I get excited about many possible features, which makes the first version of an idea larger than what is realistic. Managed by deferring scope decisions to the sprint backlog rather than the planning meeting.

What I would do differently next time

→

Treat deployment as a parallel epic from day one.Almost every problem in Sprint 3's deployment work would have been a smaller problem a week earlier.

→

Define data contracts between teammates earlier.I assumed Chandan and Ruben understood the exact format I needed from retrieval for evaluation; we had to do back-and-forth to align. Being explicit about contracts earlier would have saved time.

→

Time-box when I can feel I am over-polishing.The Sprint 2 chunker edge cases took longer than they should have because I kept iterating instead of shipping a good-enough version and moving on.

Environmental + social impact

What we actually did to reduce the project footprint.

Rubric map: environmental impacts + social impacts, with citations

Environmental

LLM inference (Groq Llama 3.1 8B). Sticking with 8B not 70B; pre-filter blocks obvious off-topic before query expansion or final generation; rerank-threshold guardrail blocks weak-evidence before final generation; tight system-prompt instructions hold token count down. Sources: Patterson et al. (2021); Luccioni et al. (2023).
Embedding + pgvector search. Small embedding model (nomic-embed-text, 137M parameters); one-off corpus embedding at ingestion not per query; HNSW index for sub-linear search. Sources: Strubell et al. (2019); Schwartz et al. (2020).
Deployment + CI/CD. HF Space on CPU Basic not GPU; short CI workflow (lint, smoke import, Docker build); cached model weights in the image so cold starts do not redownload them. Source: Lannelongue et al. (2021).

Social

Wider access to ISM guidance. A student or generalist IT worker can ask a plain question and receive a cited answer, without needing to know the document structure. Limit: reduces lookup friction, does not replace a qualified assessor. Source: Australian Signals Directorate (2025).
Over-reliance risk. Fluent answers invite over-trust in security decisions. Response: system prompt requires citations from retrieved context, two-stage guardrail refuses out-of-scope questions, Pipeline Explorer exposes the path from query to evidence to answer. Sources: Bender et al. (2021); Bucinca et al. (2021).
Accessibility of technical language. Natural-language search lets a generalist ask first and inspect the cited ISM control second. Limit: the app answers in English only. Sources: Australian Government Style Manual (n.d.); Australian Signals Directorate (2025).

MLOps maturity

CI, CD, experiment tracking, HPO audit trail.

CI

.github/workflows/ci.yml: lint, smoke import, Docker build on every push

CD

.github/workflows/deploy.yml: rsync to HF Space git remote on merge to main, evaluation PNGs copied into image

Experiment tracking

ClearML tasks for S1, S2, S3 with parameter snapshots, scalar series, per-question DataFrames as artefacts

HPO audit trail

ClearML controller + child tasks over OOS_RERANK_THRESHOLD; outputs in evaluations/sprint-3/hpo/

Tech stack

What the project runs on.

Python 3.11 FastAPI Supabase pgvector Groq · Llama 3.1 8B nomic-embed-text v1.5 ms-marco-MiniLM-L-6-v2 RAGAS ClearML Docker GitHub Actions Hugging Face Spaces Server-Sent Events BM25 + RRF HNSW index Jinja2 Ollama (local judge)

Skills snapshot

Each skill linked to the file or artefact that demonstrates it.

RAG pipeline designsrc/retrieval.py · src/reranking.py

Document-aware chunkingsrc/chunking.py

Hybrid search (BM25 + vector)Supabase hybrid_search() RPC

Cross-encoder rerankingms-marco-MiniLM-L-6-v2 integration

Evaluation methodology100-question dataset + answer_similarity

Experiment trackingClearML tasks across three sprints

Hyperparameter optimisationclearml.automation.HyperParameterOptimizer sweep

ContainerisationDockerfile, CPU-only torch, model pre-cache

CI/CD.github/workflows/ci.yml + deploy.yml

Cloud deploymentHF Spaces Docker SDK, port 7860

Streaming HTTP (SSE)/pipeline/stream + Fetch ReadableStream

Database securitySupabase RLS · database/sprint3_rls.sql

Prompt engineeringcitation-required system prompt + two-stage guardrail

Frontend (vanilla)FastAPI + Jinja2 + vanilla CSS, no JS framework

Technical writingSPRINT2_PIPELINE_REPORT, deployment guide, release notes

Product ownershipSprint 1 PRD, 100-question dataset scoping

Honest limits + what's next

Known limits.

Cold-start latency on HF CPU Basic. First request after idle takes longer because the container has to warm up. Mitigated for the defence demo by warming the container five minutes before, not eliminated.
English only. We write in Australian English to match the ISM. The app does not handle other languages.
Answer similarity short of target by 0.012. 0.9179 against a 0.93 target. We did not over-fit to close the gap.
Single-domain corpus. The system is calibrated for the ISM. Generalisation to other security frameworks (NIST, ISO 27001) would require re-chunking, re-embedding, and re-tuning the guardrail.
HPO sample size. The 30-question subset was too small to distinguish OOS_RERANK_THRESHOLD values in the {−7..−3} band. Larger sweep would be the natural next step.

Evidence ledger

Tickets, artefacts, identifiers, all in one place.

Sprint	Role	Tickets	Pts	Key artefacts	ClearML
Sprint 1	Product Owner	RAG-8, 9, 10, 11, 12, 25, 28, 29, 30, 33	12 + 2	eval_questions.json · src/evaluation.py · sprint1_poc.ipynb	c32673…
Sprint 2	Data Scientist	RAG-37, 38, 39, 40, 41, 42, 43	17	src/chunking.py · src/reranking.py · sprint2_development.ipynb · SPRINT2_PIPELINE_REPORT.md	379669…
Sprint 3	Data Engineer	RAG-71, 72, 73, 74, 75, 76	14	Dockerfile · workflows/ci.yml · workflows/deploy.yml · pipeline.html · evaluations.html · sprint3_rls.sql	sprint-3/

Evidence links

Live system	esreekarreddy-ism-cyberrag.hf.space
GitHub repository	github.com/studiobuilders/ism-cyberrag
Sprint 1 ClearML task	c32673341b364cf78c52a12992a3a6e4
Sprint 2 ClearML task	379669d5c8ca47d083bce53ab9b815fc
Sprint 3 evaluation outputs	evaluations/sprint-3/
Sprint 3 HPO outputs	evaluations/sprint-3/hpo/

References

The literature this portfolio relies on.

Australian Signals Directorate. (2025). Information security manual (December 2025). Cyber.gov.au. https://www.cyber.gov.au/ism
Australian Government Style Manual. (n.d.). Clear language and writing style. https://www.stylemanual.gov.au/writing-and-designing-content/clear-language-and-writing-style
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610-623). https://doi.org/10.1145/3442188.3445922
Bucinca, Z., Malaya, M. B., & Gajos, K. Z. (2021). To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1), Article 188. https://doi.org/10.1145/3449287
Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated evaluation of retrieval augmented generation. arXiv. https://arxiv.org/abs/2309.15217
Lannelongue, L., Grealey, J., & Inouye, M. (2021). Green algorithms: Quantifying the carbon footprint of computation. Advanced Science, 8(12), Article 2100707. https://doi.org/10.1002/advs.202100707
Luccioni, A. S., Viguier, S., & Ligozat, A.-L. (2023). Estimating the carbon footprint of BLOOM, a 176B parameter language model. Journal of Machine Learning Research, 24(253), 1-15. https://www.jmlr.org/papers/v24/23-0069.html
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 220-229). https://doi.org/10.1145/3287560.3287596
Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., & Dean, J. (2021). Carbon emissions and large neural network training. arXiv. https://arxiv.org/abs/2104.10350
Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green AI. Communications of the ACM, 63(12), 54-63. https://doi.org/10.1145/3381831
Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3645-3650). https://doi.org/10.18653/v1/P19-1355