TalentLensAI / UTILS_DIRECTORY_GUIDE.md
Johnny
feat: Complete Format_Resume.py system with OpenAI GPT-4o integration and template preservation - Added Format_Resume.py Streamlit page with OpenAI GPT-4o primary extraction, HF Cloud backup, 5-tier fallback system, template preservation with Qvell branding, contact info extraction, skills cleaning, career timeline generation, and comprehensive utils restructure (10/11 files required). Renamed app.py to TalentLens.py, added blank_resume.docx template, updated .gitignore for Salesforce exclusion.
c2f9ec8
|
raw
history blame
10.1 kB

πŸ“ Utils Directory Guide - Format_Resume.py Focus

🎯 REQUIRED FILES for Format_Resume.py (10 out of 11 files)

After analyzing the Format_Resume.py functionality with OpenAI GPT-4o as primary and HF Cloud as backup, here are the essential files:

utils/
β”œβ”€β”€ 🎯 CORE EXTRACTION SYSTEM (Format_Resume.py dependencies)
β”‚   β”œβ”€β”€ hybrid_extractor.py      # ⭐ REQUIRED - Main orchestrator (direct import)
β”‚   β”œβ”€β”€ openai_extractor.py      # ⭐ REQUIRED - OpenAI GPT-4o (PRIMARY method)
β”‚   β”œβ”€β”€ hf_cloud_extractor.py    # ⭐ REQUIRED - HF Cloud API (BACKUP method)
β”‚   β”œβ”€β”€ ai_extractor.py          # ⭐ REQUIRED - Alternative HF AI (fallback)
β”‚   β”œβ”€β”€ hf_extractor_simple.py   # ⭐ REQUIRED - Simple HF (fallback)
β”‚   └── extractor_fixed.py       # ⭐ REQUIRED - Regex fallback (last resort)
β”‚
β”œβ”€β”€ πŸ—οΈ DOCUMENT PROCESSING (Format_Resume.py dependencies)
β”‚   β”œβ”€β”€ builder.py               # ⭐ REQUIRED - Resume document generation with header/footer preservation
β”‚   └── parser.py                # ⭐ REQUIRED - PDF/DOCX text extraction (direct import)
β”‚
└── πŸ“Š REFERENCE DATA (Required for fallback system)
    └── data/                    # ⭐ REQUIRED - Used by extractor_fixed.py fallback
        β”œβ”€β”€ job_titles.json      # ⭐ REQUIRED - Job title patterns for regex extraction
        └── skills.json          # ⭐ REQUIRED - Skills matching for spaCy extraction

πŸ”— Dependency Chain for Format_Resume.py

pages/Format_Resume.py
β”œβ”€β”€ utils/hybrid_extractor.py (DIRECT IMPORT - orchestrator)
β”‚   β”œβ”€β”€ utils/openai_extractor.py (PRIMARY GPT-4o - best accuracy)
β”‚   β”œβ”€β”€ utils/hf_cloud_extractor.py (BACKUP - good accuracy)
β”‚   β”œβ”€β”€ utils/ai_extractor.py (alternative backup)
β”‚   β”œβ”€β”€ utils/hf_extractor_simple.py (simple backup)
β”‚   └── utils/extractor_fixed.py (regex fallback) β†’ uses data/job_titles.json & data/skills.json
β”œβ”€β”€ utils/builder.py (DIRECT IMPORT - document generation with template preservation)
└── utils/parser.py (DIRECT IMPORT - file parsing)

🎯 File Purposes for Format_Resume.py

βœ… REQUIRED - Core Extraction System

File Purpose When Used Priority
hybrid_extractor.py Main entry point - orchestrates all extraction methods Always (Format_Resume.py imports this) πŸ”΄ CRITICAL
openai_extractor.py PRIMARY AI - OpenAI GPT-4o extraction with contact info When use_openai=True (best results) 🟠 PRIMARY
hf_cloud_extractor.py BACKUP AI - Hugging Face Cloud API extraction When OpenAI fails or unavailable 🟑 BACKUP
ai_extractor.py Alternative AI - HF AI models extraction Alternative backup method 🟒 FALLBACK
hf_extractor_simple.py Simple AI - Simplified local processing When cloud APIs fail 🟒 FALLBACK
extractor_fixed.py Reliable fallback - Regex-based extraction with spaCy When all AI methods fail πŸ”΅ LAST RESORT

βœ… REQUIRED - Document Processing

File Purpose When Used Priority
builder.py Document generation - Creates formatted Word docs with preserved headers/footers Always (Format_Resume.py imports this) πŸ”΄ CRITICAL
parser.py File parsing - Extracts raw text from PDF/DOCX files Always (Format_Resume.py imports this) πŸ”΄ CRITICAL

βœ… REQUIRED - Reference Data

File Purpose When Used Priority
data/job_titles.json Job title patterns - Used by extractor_fixed.py for regex matching When all AI methods fail (fallback) 🟑 BACKUP
data/skills.json Skills database - Used by extractor_fixed.py for spaCy skill matching When all AI methods fail (fallback) 🟑 BACKUP

❌ NOT NEEDED - Other Features

File Purpose Why Not Needed
screening.py Resume evaluation, scoring, candidate screening Used by TalentLens.py, not Format_Resume.py

πŸš€ Format_Resume.py Extraction Flow

1. User uploads resume β†’ parser.py extracts raw text
2. hybrid_extractor.py orchestrates extraction:
   β”œβ”€β”€ Try openai_extractor.py (PRIMARY GPT-4o - best accuracy)
   β”œβ”€β”€ If fails β†’ Try hf_cloud_extractor.py (BACKUP - good accuracy)
   β”œβ”€β”€ If fails β†’ Try ai_extractor.py (alternative backup)
   β”œβ”€β”€ If fails β†’ Try hf_extractor_simple.py (simple backup)
   └── If all fail β†’ Use extractor_fixed.py (regex fallback) β†’ uses data/*.json
3. builder.py generates formatted Word document with preserved template headers/footers
4. User downloads formatted resume with Qvell branding and proper formatting

πŸ—οΈ Document Builder Enhancements

The builder.py has been enhanced to properly handle template preservation:

Header/Footer Preservation

  • βœ… Preserves Qvell logo and branding in header
  • βœ… Maintains footer address (6001 Tain Dr. Suite 203, Dublin, OH, 43016)
  • βœ… Eliminates blank pages by clearing only body content
  • βœ… Preserves image references to prevent broken images

Content Generation Features

  • βœ… Professional Summary extraction and formatting
  • βœ… Skills table with 3-column layout
  • βœ… Professional Experience with job titles, companies, dates
  • βœ… Career Timeline chronological job history
  • βœ… Education and Training sections
  • βœ… Proper date formatting (e.g., "February 2017 – Present")

πŸ“Š File Usage Statistics

  • Total utils files: 11
  • Required for Format_Resume.py: 10 files (91%)
  • Not needed for Format_Resume.py: 1 file (9%)

🧹 Cleanup Recommendations

If you want to minimize the utils folder for Format_Resume.py only:

Keep These 10 Files:

utils/
β”œβ”€β”€ hybrid_extractor.py      # Main orchestrator
β”œβ”€β”€ openai_extractor.py      # OpenAI GPT-4o (primary)
β”œβ”€β”€ hf_cloud_extractor.py    # HF Cloud (backup)
β”œβ”€β”€ ai_extractor.py          # HF AI (fallback)
β”œβ”€β”€ hf_extractor_simple.py   # Simple HF (fallback)
β”œβ”€β”€ extractor_fixed.py       # Regex (last resort)
β”œβ”€β”€ builder.py               # Document generation with template preservation
β”œβ”€β”€ parser.py                # File parsing
└── data/
    β”œβ”€β”€ job_titles.json      # Job title patterns for regex fallback
    └── skills.json          # Skills database for spaCy fallback

Can Remove This 1 File (if only using Format_Resume.py):

utils/
└── screening.py             # Only used by TalentLens.py

πŸ’‘ Best Practices for Format_Resume.py

  1. Always use hybrid_extractor.py as your main entry point
  2. Set environment variables for best results:
    • OPENAI_API_KEY for OpenAI GPT-4o (primary)
    • HF_API_TOKEN for Hugging Face Cloud (backup)
  3. Use this configuration in Format_Resume.py:
    data = extract_resume_sections(
        resume_text, 
        prefer_ai=True, 
        use_openai=True,      # Try OpenAI GPT-4o first (best results)
        use_hf_cloud=True     # Fallback to HF Cloud (good backup)
    )
    
  4. Template preservation is automatic - headers and footers are maintained
  5. Fallback system ensures extraction never completely fails

πŸ”§ Recent System Improvements

Header/Footer Preservation (Latest Fix)

  • Problem: Template headers and footers were being lost during document generation
  • Solution: Conservative content clearing that preserves document structure
  • Result: Qvell branding and footer address now properly maintained

Extraction Quality Enhancements

  • OpenAI GPT-4o Integration: Primary extraction method with structured prompts
  • Contact Info Extraction: Automatic email, phone, LinkedIn detection
  • Skills Cleaning: Improved filtering to remove company names and broken fragments
  • Experience Structuring: Better job title, company, and date extraction

Fallback System Reliability

  • JSON Dependencies: job_titles.json and skills.json required for regex fallback
  • Quality Validation: Each extraction method is validated before acceptance
  • Graceful Degradation: System never fails completely, always produces output

πŸ§ͺ Testing Format_Resume.py Dependencies

# Test all required components for Format_Resume.py
from utils.hybrid_extractor import extract_resume_sections, HybridResumeExtractor
from utils.builder import build_resume_from_data
from utils.parser import parse_resume

# Test extraction with all fallbacks
sample_text = "John Doe\nSoftware Engineer\nPython, Java, React"
result = extract_resume_sections(sample_text, prefer_ai=True, use_openai=True, use_hf_cloud=True)

# Test document building with template preservation
template_path = "templates/blank_resume.docx"
doc = build_resume_from_data(template_path, result)

print("βœ… All Format_Resume.py dependencies working!")
print(f"βœ… Extraction method used: {result.get('extraction_method', 'unknown')}")
print(f"βœ… Headers/footers preserved: {len(doc.sections)} sections")

🎯 System Architecture Summary

The Format_Resume.py system now provides:

  1. Robust Extraction: 5-tier fallback system (OpenAI β†’ HF Cloud β†’ HF AI β†’ HF Simple β†’ Regex)
  2. Template Preservation: Headers, footers, and branding maintained perfectly
  3. Quality Assurance: Each extraction method validated for completeness
  4. Professional Output: Properly formatted Word documents with consistent styling
  5. Reliability: System never fails completely, always produces usable output

The utils directory analysis shows 10 out of 11 files are needed for Format_Resume.py functionality! 🎯

Recent improvements ensure perfect template preservation and reliable extraction quality. ✨