| # Backend Code Generation Model - Setup & Usage Guide |
|
|
| ## π οΈ Installation & Setup |
|
|
| ### 1. Install Dependencies |
| ```bash |
| pip install torch transformers datasets pandas numpy aiohttp requests |
| pip install accelerate # For faster training |
| ``` |
|
|
| ### 2. Set Environment Variables |
| ```bash |
| # Optional: GitHub token for collecting real repositories |
| export GITHUB_TOKEN="your_github_token_here" |
| |
| # For GPU training (if available) |
| export CUDA_VISIBLE_DEVICES=0 |
| ``` |
|
|
| ### 3. Directory Structure |
| ``` |
| backend-ai-trainer/ |
| βββ training_pipeline.py # Main pipeline code |
| βββ data/ |
| β βββ raw_dataset.json # Collected training data |
| β βββ processed/ # Preprocessed data |
| βββ models/ |
| β βββ backend_code_model/ # Trained model output |
| β βββ checkpoints/ # Training checkpoints |
| βββ evaluation/ |
| βββ test_cases.json # Test scenarios |
| βββ results/ # Evaluation results |
| ``` |
|
|
| ## πββοΈ Quick Start |
|
|
| ### Option A: Full Automated Pipeline |
| ```python |
| import asyncio |
| from training_pipeline import TrainingPipeline |
| |
| config = { |
| 'base_model': 'microsoft/DialoGPT-medium', |
| 'output_dir': './models/backend_code_model', |
| 'github_token': 'your_token_here', # Optional |
| } |
| |
| pipeline = TrainingPipeline(config) |
| asyncio.run(pipeline.run_full_pipeline()) |
| ``` |
|
|
| ### Option B: Step-by-Step Execution |
|
|
| #### Step 1: Collect Training Data |
| ```python |
| from training_pipeline import DataCollector |
| import asyncio |
| |
| collector = DataCollector() |
| |
| # Collect from GitHub (requires token) |
| github_queries = [ |
| 'express api backend', |
| 'fastapi python backend', |
| 'django rest api', |
| 'nodejs backend server', |
| 'flask api backend' |
| ] |
| |
| asyncio.run(collector.collect_github_repositories(github_queries, max_repos=100)) |
| |
| # Generate synthetic examples |
| collector.generate_synthetic_examples(count=500) |
| |
| # Save dataset |
| collector.save_dataset('training_data.json') |
| ``` |
|
|
| #### Step 2: Preprocess Data |
| ```python |
| from training_pipeline import DataPreprocessor |
| |
| preprocessor = DataPreprocessor() |
| processed_examples = preprocessor.preprocess_examples(collector.collected_examples) |
| training_dataset = preprocessor.create_training_dataset(processed_examples) |
| |
| print(f"Created dataset with {len(training_dataset)} examples") |
| ``` |
|
|
| #### Step 3: Train Model |
| ```python |
| from training_pipeline import CodeGenerationModel |
| |
| model = CodeGenerationModel('microsoft/DialoGPT-medium') |
| model.fine_tune(training_dataset, output_dir='./trained_model') |
| ``` |
|
|
| #### Step 4: Generate Code |
| ```python |
| # Generate a complete backend application |
| generated_code = model.generate_code( |
| description="E-commerce API with user authentication and product management", |
| framework="fastapi", |
| language="python" |
| ) |
| |
| print("Generated Backend Application:") |
| print("=" * 50) |
| print(generated_code) |
| ``` |
|
|
| ## π― Training Configuration Options |
|
|
| ### Model Selection |
| ```python |
| # Lightweight for testing |
| config['base_model'] = 'microsoft/DialoGPT-small' |
| |
| # Balanced performance |
| config['base_model'] = 'microsoft/DialoGPT-medium' |
| |
| # High quality (requires more resources) |
| config['base_model'] = 'microsoft/DialoGPT-large' |
| ``` |
|
|
| ### Training Parameters |
| ```python |
| training_config = { |
| 'num_epochs': 5, # More epochs = better learning |
| 'batch_size': 4, # Adjust based on GPU memory |
| 'learning_rate': 5e-5, # Conservative learning rate |
| 'max_length': 2048, # Maximum token length |
| 'warmup_steps': 500, # Learning rate warmup |
| 'save_steps': 1000, # Checkpoint frequency |
| } |
| ``` |
|
|
| ### Framework Coverage |
| The pipeline supports these backend frameworks: |
|
|
| **Node.js Frameworks:** |
| - Express.js - Most popular Node.js framework |
| - NestJS - Enterprise-grade framework |
| - Koa.js - Lightweight alternative |
|
|
| **Python Frameworks:** |
| - FastAPI - Modern, high-performance API framework |
| - Django - Full-featured web framework |
| - Flask - Lightweight and flexible |
|
|
| **Go Frameworks:** |
| - Gin - HTTP web framework |
| - Fiber - Express-inspired framework |
|
|
| ## π Evaluation & Testing |
|
|
| ### Automatic Quality Assessment |
| ```python |
| from training_pipeline import ModelEvaluator |
| |
| evaluator = ModelEvaluator() |
| |
| # Test specific code generation |
| generated_code = model.generate_code( |
| description="User authentication API with JWT tokens", |
| framework="express", |
| language="javascript" |
| ) |
| |
| # Get quality scores |
| quality_scores = evaluator.evaluate_code_quality(generated_code, "javascript") |
| print(f"Syntax Correctness: {quality_scores['syntax_correctness']:.2f}") |
| print(f"Completeness: {quality_scores['completeness']:.2f}") |
| print(f"Best Practices: {quality_scores['best_practices']:.2f}") |
| ``` |
|
|
| ### Comprehensive Benchmarking |
| ```python |
| test_cases = [ |
| { |
| 'description': 'REST API for task management with user authentication', |
| 'framework': 'express', |
| 'language': 'javascript' |
| }, |
| { |
| 'description': 'GraphQL API for social media platform', |
| 'framework': 'fastapi', |
| 'language': 'python' |
| }, |
| { |
| 'description': 'Microservice for payment processing', |
| 'framework': 'gin', |
| 'language': 'go' |
| } |
| ] |
| |
| benchmark_results = evaluator.benchmark_model(model, test_cases) |
| print("Overall Performance:", benchmark_results) |
| ``` |
|
|
| ## π Advanced Usage |
|
|
| ### Custom Data Sources |
| ```python |
| # Add your own training examples |
| custom_examples = [ |
| { |
| 'description': 'Custom API requirement', |
| 'requirements': ['Custom feature 1', 'Custom feature 2'], |
| 'framework': 'fastapi', |
| 'language': 'python', |
| 'code_files': { |
| 'main.py': '# Your custom code here', |
| 'requirements.txt': 'fastapi\nuvicorn' |
| } |
| } |
| ] |
| |
| # Add to training data |
| collector.collected_examples.extend([CodeExample(**ex) for ex in custom_examples]) |
| ``` |
|
|
| ### Fine-tuning on Specific Domains |
| ```python |
| # Focus training on specific application types |
| domain_specific_queries = [ |
| 'microservices architecture', |
| 'api gateway implementation', |
| 'database orm integration', |
| 'authentication middleware', |
| 'rate limiting api' |
| ] |
| |
| asyncio.run(collector.collect_github_repositories(domain_specific_queries)) |
| ``` |
|
|
| ### Export Trained Model |
| ```python |
| # Save model for deployment |
| model.model.save_pretrained('./production_model') |
| model.tokenizer.save_pretrained('./production_model') |
| |
| # Load for inference |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| production_model = AutoModelForCausalLM.from_pretrained('./production_model') |
| production_tokenizer = AutoTokenizer.from_pretrained('./production_model') |
| ``` |
|
|
| ## π§ Troubleshooting |
|
|
| ### Common Issues |
|
|
| **1. Out of Memory Errors** |
| ```python |
| # Reduce batch size |
| config['per_device_train_batch_size'] = 1 |
| config['gradient_accumulation_steps'] = 4 |
| |
| # Use gradient checkpointing |
| config['gradient_checkpointing'] = True |
| ``` |
|
|
| **2. Slow Training** |
| ```python |
| # Enable mixed precision (if GPU supports it) |
| config['fp16'] = True |
| |
| # Use multiple GPUs |
| config['dataloader_num_workers'] = 4 |
| ``` |
|
|
| **3. Poor Code Quality** |
| ```python |
| # Increase training data diversity |
| collector.generate_synthetic_examples(count=1000) |
| |
| # Extend training duration |
| config['num_train_epochs'] = 10 |
| ``` |
|
|
| ### Performance Optimization |
|
|
| **For CPU Training:** |
| ```python |
| config['dataloader_pin_memory'] = False |
| config['per_device_train_batch_size'] = 1 |
| ``` |
|
|
| **For GPU Training:** |
| ```python |
| config['fp16'] = True |
| config['dataloader_pin_memory'] = True |
| config['per_device_train_batch_size'] = 4 |
| ``` |
|
|
| ## π Expected Results |
|
|
| After training on ~500-1000 examples, you should expect: |
|
|
| - **Syntax Correctness**: 85-95% |
| - **Code Completeness**: 80-90% |
| - **Best Practices**: 70-85% |
| - **Framework Coverage**: All major Node.js and Python frameworks |
| - **Generation Speed**: 2-5 seconds per application |
|
|
| ## π Continuous Improvement |
|
|
| ### Regular Retraining |
| ```python |
| # Schedule weekly data collection |
| import schedule |
| |
| def update_training_data(): |
| asyncio.run(collector.collect_github_repositories(['new backend trends'])) |
| |
| schedule.every().week.do(update_training_data) |
| ``` |
|
|
| ### A/B Testing Different Models |
| ```python |
| models_to_compare = [ |
| 'microsoft/DialoGPT-medium', |
| 'microsoft/DialoGPT-large', |
| 'gpt2-medium' |
| ] |
| |
| for base_model in models_to_compare: |
| model = CodeGenerationModel(base_model) |
| results = evaluator.benchmark_model(model, test_cases) |
| print(f"{base_model}: {results}") |
| ``` |
|
|
| ## π― Next Steps |
|
|
| 1. **Start Small**: Begin with synthetic data and 100-200 examples |
| 2. **Add Real Data**: Integrate GitHub repositories gradually |
| 3. **Evaluate Regularly**: Monitor quality metrics after each training session |
| 4. **Expand Frameworks**: Add support for new frameworks as needed |
| 5. **Production Deploy**: Export model for API deployment |
|
|
| This pipeline provides a complete foundation for building your own backend code generation AI. The modular design allows you to customize and extend each component based on your specific needs. |