Add ROCm 6.2 VRAM optimization for AMD GPUs (8-16GB)

This commit adds comprehensive ROCm 6.2 support and VRAM optimization for AMD GPUs, specifically targeting systems with 8-16GB VRAM. Changes: - Updated webui.sh to use ROCm 6.2 instead of 5.7 for AMD GPUs - Added webui-user-rocm62.sh: Optimized launch script with: * PyTorch ROCm 6.2 installation command * PYTORCH_HIP_ALLOC_CONF for memory fragmentation prevention * Optimized command-line flags (--medvram, --opt-split-attention, etc.) * Detailed inline documentation - Added ROCM_VRAM_OPTIMIZATION.md: Comprehensive 400+ line guide covering: * Launch configuration and environment variables * WebUI settings optimization * Generation settings for different VRAM amounts * ControlNet optimization techniques * Recommended workflows for quality and performance * Extensive troubleshooting section * Performance benchmarks - Added README_ROCM.md: Quick start guide for ROCm setup Key optimizations: - Memory fragmentation prevention via expandable_segments - Optimal command-line arguments for 16GB VRAM - Two-phase workflow (generate at 512x512, upscale separately) - ControlNet low VRAM mode configuration - Batch processing best practices Benefits: - Prevents OOM errors on 16GB VRAM GPUs - Improved stability for long generation sessions - Better quality outputs through optimized workflows - Faster iteration with recommended settings
2026-01-31 04:42:09 -08:00 · 2025-11-15 05:02:29 +00:00 · 2025-11-15 05:02:29 +00:00 · 5e65b0e6a9
commit 5e65b0e6a9
parent 82a973c043
4 changed files with 963 additions and 2 deletions
--- a/README_ROCM.md
+++ b/README_ROCM.md
@ -0,0 +1,295 @@
+# ROCm Setup Guide for Stable Diffusion WebUI
+
+This guide helps you set up and optimize Stable Diffusion WebUI for AMD GPUs using ROCm 6.2.
+
+## Quick Start
+
+### 1. Copy the Optimized Launch Configuration
+
+```bash
+cp webui-user-rocm62.sh webui-user.sh
+```
+
+### 2. Launch the WebUI
+
+```bash
+./webui.sh
+```
+
+The launch script will automatically:
+- Install PyTorch with ROCm 6.2 support
+- Configure optimal VRAM settings for 16GB GPUs
+- Set up memory management to prevent fragmentation
+
+### 3. Configure WebUI Settings
+
+After the WebUI starts, navigate to **Settings → Optimizations** and configure:
+
+- **Cross attention optimization:** `Doggettx` (default)
+- **Enable quantization in K samplers:** ✓ Enabled
+- **Token merging ratio:** `0.5`
+
+See the [VRAM Optimization Guide](ROCM_VRAM_OPTIMIZATION.md) for detailed configuration instructions.
+
+---
+
+## System Requirements
+
+### Supported AMD GPUs
+
+- **RX 6000 Series** (Navi 2): RX 6700 XT, 6800, 6800 XT, 6900 XT
+- **RX 7000 Series** (Navi 3): RX 7600, 7700 XT, 7800 XT, 7900 XT, 7900 XTX
+- **RX 5000 Series** (Navi 1): RX 5700 XT (with limitations)
+
+### Recommended VRAM
+
+- **Minimum:** 8GB VRAM
+- **Recommended:** 16GB VRAM
+- **Optimal:** 24GB VRAM
+
+### Software Requirements
+
+- **ROCm:** 6.2 or newer
+- **Python:** 3.10 or 3.11
+- **Linux:** Ubuntu 22.04, Fedora 38+, or Arch Linux
+
+---
+
+## Installation
+
+### Option 1: Automatic Setup (Recommended)
+
+The `webui.sh` script automatically detects AMD GPUs and installs ROCm 6.2 support.
+
+```bash
+# Clone the repository (if not already done)
+git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
+cd stable-diffusion-webui
+
+# Copy the optimized configuration
+cp webui-user-rocm62.sh webui-user.sh
+
+# Launch (will install dependencies automatically)
+./webui.sh
+```
+
+### Option 2: Manual Setup
+
+If you need manual control over the installation:
+
+```bash
+# Create and activate virtual environment
+python3 -m venv venv
+source venv/bin/activate
+
+# Install PyTorch with ROCm 6.2
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
+
+# Install Stable Diffusion WebUI requirements
+pip install -r requirements_versions.txt
+
+# Set environment variables
+export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
+
+# Launch with optimized flags
+python launch.py --skip-torch-cuda-test --medvram --opt-split-attention --no-half-vae
+```
+
+---
+
+## Configuration Files
+
+### `webui-user-rocm62.sh`
+
+Pre-configured launch script with optimal settings for AMD GPUs with 16GB VRAM.
+
+**Key settings:**
+- PyTorch with ROCm 6.2
+- Memory fragmentation prevention
+- VRAM-optimized command-line arguments
+
+### `ROCM_VRAM_OPTIMIZATION.md`
+
+Comprehensive guide covering:
+- WebUI settings optimization
+- Generation settings for different VRAM amounts
+- ControlNet optimization
+- Workflows for best quality
+- Troubleshooting common issues
+
+---
+
+## Command-Line Arguments Explained
+
+The optimized configuration uses these flags:
+
+```bash
+--skip-torch-cuda-test    # Skip CUDA test (we're using ROCm/HIP)
+--medvram                 # Optimized for 8-16GB VRAM
+--opt-split-attention     # Reduces VRAM usage during attention
+--no-half-vae             # Prevents VAE errors with full precision
+```
+
+### For Different VRAM Amounts
+
+**16GB VRAM (Recommended):**
+```bash
+--skip-torch-cuda-test --medvram --opt-split-attention --no-half-vae
+```
+
+**8GB VRAM:**
+```bash
+--skip-torch-cuda-test --lowvram --opt-split-attention --no-half-vae
+```
+
+**6GB VRAM or less:**
+```bash
+--skip-torch-cuda-test --lowvram --opt-split-attention --no-half-vae --opt-channelslast
+```
+
+---
+
+## Recommended Generation Settings
+
+### For 16GB VRAM
+
+**Safe Mode (Fast, No Errors):**
+- Resolution: 512x512
+- Hires fix: OFF
+- Batch size: 1
+- VRAM usage: ~4-6GB
+
+**Quality Mode (Best Results):**
+- Resolution: 512x512
+- Hires fix: ON (1.5x upscale)
+- Hires steps: 10
+- VRAM usage: ~8-12GB
+
+**With ControlNet:**
+- Resolution: 512x512
+- Hires fix: OFF
+- ControlNet units: 1-2 maximum
+- Low VRAM mode: ON
+- VRAM usage: ~6-10GB
+
+See the [VRAM Optimization Guide](ROCM_VRAM_OPTIMIZATION.md) for detailed workflows and settings.
+
+---
+
+## Verification
+
+### Check PyTorch ROCm Installation
+
+```bash
+source venv/bin/activate
+python -c "import torch; print('ROCm available:', torch.cuda.is_available()); print('ROCm version:', torch.version.hip)"
+```
+
+**Expected output:**
+```
+ROCm available: True
+ROCm version: 6.2.x
+```
+
+### Monitor VRAM Usage
+
+```bash
+watch -n 1 rocm-smi
+```
+
+Or check current usage:
+```bash
+rocm-smi --showmeminfo vram
+```
+
+---
+
+## Troubleshooting
+
+### Out of Memory Errors
+
+If you encounter OOM errors:
+
+1. **Reduce resolution:** 768x768 → 512x512
+2. **Disable Hires fix** or reduce upscale ratio
+3. **Use more aggressive flags:**
+   ```bash
+   export COMMANDLINE_ARGS="--skip-torch-cuda-test --lowvram --opt-split-attention --no-half-vae"
+   ```
+
+### Black Images or Artifacts
+
+Ensure `--no-half-vae` is in your command-line arguments.
+
+### Slow Generation
+
+- Use `--medvram` instead of `--lowvram` for 16GB VRAM
+- Reduce sampling steps to 20
+- Try faster samplers: DPM++ 2M, Euler a
+
+### Model Loading Errors
+
+Verify PyTorch installation:
+```bash
+source venv/bin/activate
+python -c "import torch; print(torch.cuda.is_available())"
+```
+
+If it returns `False`, reinstall PyTorch:
+```bash
+pip uninstall torch torchvision torchaudio
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
+```
+
+For more troubleshooting, see the [VRAM Optimization Guide](ROCM_VRAM_OPTIMIZATION.md#troubleshooting).
+
+---
+
+## Performance Tips
+
+1. **Two-Phase Workflow:**
+   - Generate at 512x512 without Hires fix (fast)
+   - Upscale separately using img2img or Extras tab (best quality)
+
+2. **ControlNet Best Practices:**
+   - Use only 1-2 units at a time
+   - Enable Low VRAM mode
+   - Disable Hires fix when using ControlNet
+
+3. **Batch Processing:**
+   - Use `Batch count` instead of `Batch size`
+   - Keep resolution at 512x512 for batches
+
+4. **Memory Management:**
+   - Restart WebUI after 50-100 generations
+   - Use "Unload SD checkpoint" when switching models
+
+---
+
+## Additional Resources
+
+- **[VRAM Optimization Guide](ROCM_VRAM_OPTIMIZATION.md)** - Comprehensive optimization guide
+- **[ROCm Documentation](https://rocm.docs.amd.com/)** - Official AMD ROCm docs
+- **[SD WebUI Wiki](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki)** - Official WebUI documentation
+- **[AMD GPU Support](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs)** - WebUI AMD GPU guide
+
+---
+
+## Summary
+
+✅ **Key Points:**
+- Use ROCm 6.2 for best compatibility
+- Enable `expandable_segments:True` to prevent memory fragmentation
+- Use `--medvram` for 16GB VRAM
+- Start with 512x512, upscale separately for quality
+- Enable ControlNet Low VRAM mode
+
+❌ **Avoid:**
+- Batch size > 1 (use Batch count instead)
+- Hires fix with 2x upscale on 16GB VRAM
+- More than 2 ControlNet units simultaneously
+- Direct generation at resolutions > 768x768
+
+---
+
+**For detailed configuration and workflows, see [ROCM_VRAM_OPTIMIZATION.md](ROCM_VRAM_OPTIMIZATION.md)**
--- a/ROCM_VRAM_OPTIMIZATION.md
+++ b/ROCM_VRAM_OPTIMIZATION.md
@ -0,0 +1,539 @@
+# ROCm 6.2 VRAM Optimization Guide for AMD GPUs
+
+This guide provides comprehensive instructions for optimizing Stable Diffusion WebUI on AMD GPUs with ROCm 6.2, specifically targeting systems with 8-16GB VRAM.
+
+## Table of Contents
+
+1. [Quick Start](#quick-start)
+2. [Launch Configuration](#launch-configuration)
+3. [WebUI Settings Optimization](#webui-settings-optimization)
+4. [Generation Settings](#generation-settings)
+5. [ControlNet Optimization](#controlnet-optimization)
+6. [Recommended Workflows](#recommended-workflows)
+7. [Troubleshooting](#troubleshooting)
+
+---
+
+## Quick Start
+
+### 1. Setup Launch Script
+
+Copy the optimized ROCm 6.2 configuration:
+
+```bash
+cp webui-user-rocm62.sh webui-user.sh
+```
+
+### 2. Launch WebUI
+
+```bash
+./webui.sh
+```
+
+### 3. Configure WebUI Settings
+
+Navigate to **Settings → Optimizations** and apply the recommended settings (see below).
+
+---
+
+## Launch Configuration
+
+### Environment Variables
+
+The following environment variables are set in `webui-user-rocm62.sh`:
+
+#### PyTorch with ROCm 6.2
+
+```bash
+export TORCH_COMMAND="pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2"
+```
+
+**Purpose:** Installs PyTorch compiled with ROCm 6.2 support for AMD GPUs.
+
+#### HIP Memory Allocation
+
+```bash
+export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
+```
+
+**Purpose:** Prevents memory fragmentation, which is critical for stable VRAM usage and avoiding OOM (Out of Memory) errors.
+
+### Command Line Arguments
+
+```bash
+export COMMANDLINE_ARGS="--skip-torch-cuda-test --medvram --opt-split-attention --no-half-vae"
+```
+
+#### Flag Explanations
+
+| Flag | Purpose | VRAM Impact |
+|------|---------|-------------|
+| `--skip-torch-cuda-test` | Skip CUDA test (using ROCm/HIP instead) | N/A |
+| `--medvram` | Optimized for 8-16GB VRAM, moves models between GPU/CPU as needed | **High** - Critical for 16GB |
+| `--opt-split-attention` | Reduces VRAM usage during attention computation | **Medium** - Saves 1-2GB |
+| `--no-half-vae` | Uses full precision for VAE to prevent errors | **Low** - Prevents artifacts |
+
+#### Alternative Configurations
+
+**For GPUs with less than 8GB VRAM:**
+
+```bash
+export COMMANDLINE_ARGS="--skip-torch-cuda-test --lowvram --opt-split-attention --no-half-vae"
+```
+
+**For maximum compatibility (slower but most stable):**
+
+```bash
+export COMMANDLINE_ARGS="--skip-torch-cuda-test --lowvram --opt-split-attention --no-half-vae --opt-channelslast"
+```
+
+**With xformers (if installed):**
+
+```bash
+export COMMANDLINE_ARGS="--skip-torch-cuda-test --medvram --xformers --no-half-vae"
+```
+
+---
+
+## WebUI Settings Optimization
+
+Navigate to **Settings → Optimizations** in the WebUI and configure:
+
+### Recommended Settings
+
+| Setting | Value | Notes |
+|---------|-------|-------|
+| **Cross attention optimization** | `Doggettx` or `xformers` | Doggettx is default and works well |
+| **Enable quantization in K samplers** | ✓ Enabled | Reduces VRAM usage |
+| **Token merging ratio** | `0.5` | Merges similar tokens to save memory |
+| **Pad prompt/negative prompt** | ✓ Enabled | Recommended for consistency |
+
+### Optional Advanced Settings
+
+| Setting | Value | Effect |
+|---------|-------|--------|
+| **Token merging ratio for hires** | `0.5` | Saves VRAM during hires fix |
+| **Always discard next-to-last sigma** | ✓ Enabled | Minor VRAM savings |
+
+---
+
+## Generation Settings
+
+### Safe Mode (No Errors, 16GB VRAM)
+
+**Best for:** Testing prompts, finding good seeds, general use
+
+```
+Width: 512
+Height: 512
+Batch count: 1
+Batch size: 1
+Hires fix: ☐ Disabled
+Sampling steps: 20-30
+```
+
+**VRAM Usage:** ~4-6GB
+
+---
+
+### Quality Mode (With Hires Fix, 16GB VRAM)
+
+**Best for:** Final high-quality outputs
+
+```
+Width: 512
+Height: 512
+Batch count: 1
+Batch size: 1
+Hires fix: ✓ Enabled
+  Upscale by: 1.5 (avoid 2.0 on 16GB)
+  Hires steps: 10
+  Denoising strength: 0.4
+  Upscaler: Latent or R-ESRGAN 4x+
+Sampling steps: 20-30
+```
+
+**VRAM Usage:** ~8-12GB
+
+**⚠️ Warning:** Using `Upscale by: 2.0` may cause OOM errors on 16GB VRAM.
+
+---
+
+### Portrait Mode (512x768)
+
+**Best for:** Character portraits
+
+```
+Width: 512
+Height: 768
+Batch count: 1
+Batch size: 1
+Hires fix: ☐ Disabled (enable only after finding good seed)
+Sampling steps: 20-30
+```
+
+**VRAM Usage:** ~6-8GB
+
+---
+
+### Landscape Mode (768x512)
+
+**Best for:** Scenery, backgrounds
+
+```
+Width: 768
+Height: 512
+Batch count: 1
+Batch size: 1
+Hires fix: ☐ Disabled (enable only after finding good seed)
+Sampling steps: 20-30
+```
+
+**VRAM Usage:** ~6-8GB
+
+---
+
+## ControlNet Optimization
+
+When using ControlNet extensions, additional VRAM optimizations are necessary.
+
+### ControlNet Settings
+
+Navigate to **Settings → ControlNet** and configure:
+
+| Setting | Value | Purpose |
+|---------|-------|---------|
+| **Low VRAM mode** | ✓ Enabled | Critical for 16GB VRAM |
+| **Pixel Perfect** | ☐ Disabled | Disable during testing to save VRAM |
+| **Control Mode** | `Balanced` | Default, good balance |
+
+### Recommended Usage
+
+```
+Width: 512
+Height: 512
+Hires fix: ☐ Disabled
+Active ControlNet units: 1-2 maximum
+Batch size: 1
+```
+
+**⚠️ Warning:** Using 3+ ControlNet units simultaneously may cause OOM errors.
+
+### ControlNet with Hires Fix
+
+**Not recommended for 16GB VRAM.** If necessary:
+
+```
+Width: 512
+Height: 512
+Hires fix: ✓ Enabled
+  Upscale by: 1.25 (minimum upscale)
+  Hires steps: 5 (minimum steps)
+Active ControlNet units: 1 maximum
+Low VRAM mode: ✓ Enabled
+```
+
+---
+
+## Recommended Workflows
+
+### Workflow 1: Prompt Development (Fast)
+
+**Goal:** Find the perfect prompt and seed quickly
+
+1. **Settings:**
+   - Size: 512x512
+   - Hires fix: OFF
+   - Steps: 20
+   - Batch count: 4-8 (generate multiple images)
+
+2. **Process:**
+   - Experiment with different prompts
+   - Test various seeds
+   - Adjust CFG scale and sampling method
+
+3. **VRAM Usage:** ~4-6GB per image
+
+---
+
+### Workflow 2: High-Quality Output (Two-Phase)
+
+**Goal:** Maximum quality without VRAM errors
+
+#### Phase 1: Generation
+
+```
+Size: 512x512
+Hires fix: OFF
+Steps: 30-40
+Sampler: DPM++ 2M Karras or Euler a
+CFG Scale: 7-8
+```
+
+**Find your perfect image** with the right prompt, seed, and composition.
+
+#### Phase 2: Upscaling
+
+**Option A: Using img2img**
+
+1. Send image to img2img
+2. Settings:
+   - Resize to: 1024x1024 or 768x1152
+   - Denoising: 0.3-0.5
+   - Steps: 20-30
+   - Sampler: Same as generation
+
+**Option B: Using Extras Tab**
+
+1. Send to Extras
+2. Upscaler: R-ESRGAN 4x+ or 4x-UltraSharp
+3. Scale: 2x or 4x
+4. Optional: GFPGAN or CodeFormer for face restoration
+
+**VRAM Usage:** Phase 1: ~4-6GB, Phase 2: ~6-10GB (depends on final resolution)
+
+---
+
+### Workflow 3: ControlNet Generation
+
+**Goal:** Use ControlNet without VRAM errors
+
+1. **Initial Setup:**
+   - Size: 512x512
+   - Hires fix: OFF
+   - ControlNet units: 1-2 maximum
+   - Low VRAM: ON
+
+2. **Generate base image:**
+   - Steps: 20-30
+   - Find good composition
+
+3. **Upscale separately:**
+   - Use img2img without ControlNet
+   - Or use Extras tab
+
+**VRAM Usage:** ~6-10GB (depends on ControlNet type)
+
+---
+
+### Workflow 4: Batch Processing
+
+**Goal:** Generate multiple images efficiently
+
+**Small batches (recommended):**
+
+```
+Size: 512x512
+Batch count: 4
+Batch size: 1
+Hires fix: OFF
+```
+
+**VRAM Usage:** ~4-6GB per image (sequential)
+
+**⚠️ Avoid:**
+- `Batch size > 1` (generates simultaneously, uses much more VRAM)
+- Hires fix with batch processing
+
+---
+
+## Troubleshooting
+
+### Issue: Out of Memory (OOM) Errors
+
+**Symptoms:**
+```
+RuntimeError: HIP out of memory
+```
+
+**Solutions:**
+
+1. **Reduce image resolution:**
+   - 768x768 → 512x512
+   - 512x768 → 512x512
+
+2. **Disable Hires fix or reduce upscale:**
+   - Turn OFF Hires fix
+   - Or change `Upscale by: 2.0` → `1.5` or `1.25`
+
+3. **Use more aggressive VRAM flags:**
+   ```bash
+   export COMMANDLINE_ARGS="--skip-torch-cuda-test --lowvram --opt-split-attention --no-half-vae"
+   ```
+
+4. **Reduce ControlNet units:**
+   - Use only 1 ControlNet unit
+   - Ensure Low VRAM mode is enabled
+
+5. **Close other applications:**
+   - Close browsers, games, or other GPU-intensive apps
+   - Check `rocm-smi` to see VRAM usage
+
+---
+
+### Issue: Slow Generation Speed
+
+**Symptoms:**
+- Images take very long to generate
+- System feels sluggish
+
+**Solutions:**
+
+1. **Check if you're using the right flags:**
+   - Use `--medvram` not `--lowvram` for 16GB VRAM
+   - `--lowvram` is slower but uses less VRAM
+
+2. **Reduce sampling steps:**
+   - Try 20 steps instead of 40-50
+   - Use faster samplers: DPM++ 2M, Euler a
+
+3. **Disable Token Merging:**
+   - Settings → Optimizations → Token merging ratio: 0
+   - Token merging saves VRAM but may slow down generation
+
+4. **Check PyTorch installation:**
+   ```bash
+   python -c "import torch; print(torch.version.hip)"
+   ```
+   Should output ROCm version (e.g., `6.2.x`)
+
+---
+
+### Issue: Black Images or Artifacts
+
+**Symptoms:**
+- Generated images are black
+- Strange artifacts or noise
+
+**Solutions:**
+
+1. **Enable `--no-half-vae`:**
+   ```bash
+   export COMMANDLINE_ARGS="--skip-torch-cuda-test --medvram --opt-split-attention --no-half-vae"
+   ```
+
+2. **Try different VAE:**
+   - Settings → Stable Diffusion → SD VAE
+   - Select `None` or try a different VAE
+
+3. **Check cross attention optimization:**
+   - Settings → Optimizations → Cross attention optimization
+   - Try `Doggettx`, `sub-quadratic`, or `none`
+
+---
+
+### Issue: Model Loading Errors
+
+**Symptoms:**
+```
+Error loading model
+Couldn't load model
+```
+
+**Solutions:**
+
+1. **Verify PyTorch ROCm installation:**
+   ```bash
+   source venv/bin/activate
+   python -c "import torch; print(torch.cuda.is_available()); print(torch.version.hip)"
+   ```
+   Should output: `True` and ROCm version
+
+2. **Reinstall PyTorch with ROCm 6.2:**
+   ```bash
+   source venv/bin/activate
+   pip uninstall torch torchvision torchaudio
+   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
+   ```
+
+3. **Check model file integrity:**
+   - Re-download the model
+   - Verify SHA256 hash if available
+
+---
+
+### Issue: Memory Fragmentation
+
+**Symptoms:**
+- VRAM usage increases over time
+- OOM errors after multiple generations
+
+**Solutions:**
+
+1. **Ensure expandable segments is enabled:**
+   ```bash
+   export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
+   ```
+
+2. **Restart the WebUI periodically:**
+   - After 50-100 generations, restart the WebUI
+
+3. **Use the "Unload SD checkpoint" button:**
+   - Settings → Actions → Unload SD checkpoint to free VRAM
+   - Useful when switching between models
+
+---
+
+### Checking VRAM Usage
+
+**Monitor VRAM in real-time:**
+
+```bash
+watch -n 1 rocm-smi
+```
+
+**Check current VRAM usage:**
+
+```bash
+rocm-smi --showmeminfo vram
+```
+
+---
+
+## Performance Benchmarks
+
+Approximate generation times on AMD RX 6800/6900 XT (16GB VRAM):
+
+| Configuration | Resolution | Hires Fix | Steps | Time |
+|---------------|------------|-----------|-------|------|
+| Safe Mode | 512x512 | No | 20 | ~8-12s |
+| Safe Mode | 512x512 | No | 30 | ~12-18s |
+| Quality Mode | 512x512 → 768x768 | Yes (1.5x) | 20+10 | ~20-30s |
+| Quality Mode | 512x512 → 1024x1024 | Yes (2x) | 20+10 | ~35-50s |
+| Portrait | 512x768 | No | 20 | ~12-16s |
+| ControlNet | 512x512 | No | 20 | ~15-25s |
+
+*Times may vary based on model, sampler, and prompt complexity.*
+
+---
+
+## Additional Resources
+
+- **ROCm Documentation:** https://rocm.docs.amd.com/
+- **Stable Diffusion WebUI Wiki:** https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki
+- **AMD GPU Support:** https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs
+
+---
+
+## Summary of Key Points
+
+✅ **DO:**
+- Use `--medvram` for 16GB VRAM
+- Enable `expandable_segments:True` to prevent fragmentation
+- Start with 512x512 resolution
+- Use Hires fix with `1.5x` upscale maximum
+- Enable ControlNet Low VRAM mode
+- Generate at low resolution, upscale separately for best quality
+
+❌ **DON'T:**
+- Use `Batch size > 1` (use `Batch count` instead)
+- Use `Upscale by: 2.0` with Hires fix on 16GB VRAM
+- Enable 3+ ControlNet units simultaneously
+- Generate at 1024x1024 or higher directly
+- Forget to set `--no-half-vae` (prevents VAE errors)
+
+---
+
+**Last Updated:** 2025-11-15
+**ROCm Version:** 6.2
+**Target VRAM:** 8-16GB
--- a/webui-user-rocm62.sh
+++ b/webui-user-rocm62.sh
@ -0,0 +1,127 @@
+#!/bin/bash
+##########################################################################################
+# ROCm 6.2 Optimized Launch Script for AMD GPUs with 16GB VRAM
+# Based on best practices for VRAM optimization and memory management
+##########################################################################################
+
+# Install directory without trailing slash
+#install_dir="/home/$(whoami)"
+
+# Name of the subdirectory
+#clone_dir="stable-diffusion-webui"
+
+# ============================================================================
+# ROCm 6.2 PyTorch Installation
+# ============================================================================
+# Install PyTorch with ROCm 6.2 support
+export TORCH_COMMAND="pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2"
+
+# ============================================================================
+# PyTorch HIP Memory Allocation Configuration
+# ============================================================================
+# Prevents memory fragmentation - CRITICAL for stable VRAM usage
+export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
+
+# ============================================================================
+# Command Line Arguments for VRAM Optimization (16GB VRAM)
+# ============================================================================
+# Explanation of flags:
+#   --skip-torch-cuda-test   : Skip CUDA test (we're using ROCm/HIP)
+#   --medvram                : Optimized for 8-16GB VRAM, moves models between GPU/CPU as needed
+#   --opt-split-attention    : Reduces VRAM usage during attention computation
+#   --no-half-vae            : Prevents VAE errors by using full precision for VAE
+#
+# Additional optional flags for extreme VRAM savings (uncomment if needed):
+#   --lowvram                : For GPUs with <8GB VRAM (use instead of --medvram)
+#   --xformers               : Use xformers for additional memory optimization (requires installation)
+#   --opt-sdp-attention      : Alternative attention optimization
+export COMMANDLINE_ARGS="--skip-torch-cuda-test --medvram --opt-split-attention --no-half-vae"
+
+# ============================================================================
+# Optional: Additional VRAM Optimization Flags
+# ============================================================================
+# Uncomment the line below for more aggressive VRAM savings:
+# export COMMANDLINE_ARGS="--skip-torch-cuda-test --medvram --opt-split-attention --no-half-vae --opt-channelslast"
+
+# Uncomment for extreme low VRAM mode (<8GB):
+# export COMMANDLINE_ARGS="--skip-torch-cuda-test --lowvram --opt-split-attention --no-half-vae"
+
+# ============================================================================
+# Python and Git Configuration
+# ============================================================================
+# python3 executable
+#python_cmd="python3"
+
+# git executable
+#export GIT="git"
+
+# python3 venv without trailing slash (defaults to ${install_dir}/${clone_dir}/venv)
+#venv_dir="venv"
+
+# script to launch to start the app
+#export LAUNCH_SCRIPT="launch.py"
+
+# ============================================================================
+# Package Configuration
+# ============================================================================
+# Requirements file to use for stable-diffusion-webui
+#export REQS_FILE="requirements_versions.txt"
+
+# Fixed git repos
+#export K_DIFFUSION_PACKAGE=""
+#export GFPGAN_PACKAGE=""
+
+# Fixed git commits
+#export STABLE_DIFFUSION_COMMIT_HASH=""
+#export CODEFORMER_COMMIT_HASH=""
+#export BLIP_COMMIT_HASH=""
+
+# ============================================================================
+# Performance Tuning
+# ============================================================================
+# Uncomment to enable accelerated launch
+#export ACCELERATE="True"
+
+# Uncomment to disable TCMalloc (Thread-Caching Malloc)
+# TCMalloc improves CPU memory allocation performance
+#export NO_TCMALLOC="True"
+
+##########################################################################################
+# Usage Instructions:
+#
+# 1. Copy this file to webui-user.sh:
+#    cp webui-user-rocm62.sh webui-user.sh
+#
+# 2. Launch the WebUI:
+#    ./webui.sh
+#
+# 3. In WebUI Settings → Optimizations, configure:
+#    - Enable quantization in K samplers: ✓
+#    - Token merging ratio: 0.5
+#    - Cross attention optimization: Doggettx (should be active)
+#
+# 4. Recommended Generation Settings for 16GB VRAM:
+#
+#    Safe Mode (no errors):
+#      - Size: 512x512
+#      - Hires fix: OFF
+#      - Batch size: 1
+#
+#    Quality Mode (with upscaling):
+#      - Size: 512x512
+#      - Hires fix: ON
+#        - Upscale by: 1.5 (not 2.0)
+#        - Hires steps: 10
+#        - Denoising: 0.4
+#
+#    With ControlNet:
+#      - Size: 512x512
+#      - Hires fix: OFF
+#      - ControlNet units: max 1-2 active
+#      - Low VRAM mode: ON in ControlNet settings
+#
+# 5. Workflow for Best Quality:
+#    Phase 1 - Generation: 512x512, no hires fix → find perfect seed/prompt
+#    Phase 2 - Upscaling: Use img2img or "Send to Extras" → R-ESRGAN 4x+
+#
+##########################################################################################
--- a/webui.sh
+++ b/webui.sh
@ -153,7 +153,7 @@ case "$gpu_info" in
    *"Navi 2"*) export HSA_OVERRIDE_GFX_VERSION=10.3.0
    ;;
    *"Navi 3"*) [[ -z "${TORCH_COMMAND}" ]] && \
-         export TORCH_COMMAND="pip install torch torchvision --index-url https://download.pytorch.org/whl/nightly/rocm5.7"
+         export TORCH_COMMAND="pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2"
    ;;
    *"Renoir"*) export HSA_OVERRIDE_GFX_VERSION=9.0.0
        printf "\n%s\n" "${delimiter}"
@ -167,7 +167,7 @@ if ! echo "$gpu_info" | grep -q "NVIDIA";
 then
    if echo "$gpu_info" | grep -q "AMD" && [[ -z "${TORCH_COMMAND}" ]]
    then
-	      export TORCH_COMMAND="pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7"
+	      export TORCH_COMMAND="pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2"
    elif npu-smi info 2>/dev/null
    then
        export TORCH_COMMAND="pip install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu; pip install torch_npu==2.1.0"