scripts/00_DataCollection.md · vicpada/ai-microsoft-solution-architect at main

This file provides information on Data Curation:

Create dataset
This step crawls and scrapes data sources
A jupiter notebook with steps to scrap data from five different Azure Resources sites It uses Firecrawl for the scrapping. The API kept failing and I end up scheduling and downloading the crawling from the site directly. The files were uploaded into HF in vicpada/AzureResources into a zip file.
Process files
This step builds prepares JSONL
A jupiter notebook with the steps to download the zip file create a JSONL files. One JSONL file for each site. For each entry from Firecrawl extracts a title, cleans the content, adds the token count, skips very small or extremely large files and generates a deterministic id so we can use for evaluation. Also, in the next step we will include the whole doc as context for those documents with token count less than 8000.
Add context
This step creates the chunks, adds context, and saves them into PKL files
A jupiter notebook to download the JSONL files and build the documents for our RAG application. It uses SentenceSplitter with a chunk size of 800 and 0 overlap. It also adds some context to each chunk to situate it within the overall document for the purposes of improving search retrieval of the chunk. For each JSONL file a PKL file is created and uploaded to HF. It uses gpt-4.1-nano for situating the chunk in the context for cost saving purposes and to finish processing in a timely manner. Also one site from the datasources was excluded as it was getting very expensive to process all.
Finetune embedding
This step finetunes the embedding model
A jupiter notebook with steps to download the PKL files and train an embedding model. Only 10,000 chunks were used for time and cost saving purposes as there were more than 400,000 nodes. Using generate_qa_embedding_pairs for generating training and validation data. The base model for the embedding is local:BAAI/bge-small-en-v1.5. Hit Rate & MRR were used for evaluation. The model is uploaded to HF. Note: Finetuning and using the embedding model locally is a much cheaper way than using a online service. When dealing with a big number of nodes, and a big number of queries it does a big difference at the time of processing the nodes.
Create test data
Create Test data for evaluation of the model
A jupiter notebook with steps to download the embedding model, the PKL files and use generate_question_context_pairs to generate evaluation data. The json generated file is then uploaded to HF. It uses Gemini-2.0-flash to generate the eval data.
Create Vector
Create VectorDB and evaluate
A jupiter notebook with steps to generate the VectorDB using the local embedding model. It calculates "hit_rate", "mrr", "precision", "recall", "ap" and "ndcg" metrics The vector is saved as a PKL file and uploaded to HF.
Reranking
Evaluate Cohere Rerank
A jupiter notebook with steps to download the VectorDB, the embedding model and the evaluation dataset. Then it evaluates the result with and without cohere. Setting the path for the final solution