计算机视觉实战:从图像分类到目标检测
概述
计算机视觉(Computer Vision, CV)是人工智能领域最活跃的方向之一,它让机器能够"看懂"图像和视频。从手机的人脸解锁到自动驾驶汽车的环境感知,从医学影像诊断到工业质检,计算机视觉技术已经深入到我们生活的方方面面。
本教程将带你从零开始学习计算机视觉的核心技术,通过 5 个实战案例,掌握图像分类、目标检测、图像分割等关键技能。无论你是 AI 初学者还是有一定基础的开发者,都能从中获得实用的知识和经验。
你将学到:
- 计算机视觉基础概念和核心算法
- 使用 PyTorch 构建图像分类模型
- 实现目标检测系统(YOLO 实战)
- 图像分割技术应用
- 人脸检测与识别实战
- 模型优化与部署技巧
第一章 计算机视觉基础
1.1 什么是计算机视觉?
计算机视觉是让计算机能够理解和分析图像或视频的技术。它的核心目标是让机器获得与人类相似的视觉理解能力。
主要任务包括:
- 图像分类:判断图像属于哪个类别(如猫/狗/汽车)
- 目标检测:找出图像中所有目标的位置和类别
- 图像分割:将图像中的每个像素分类到不同区域
- 姿态估计:识别人体或物体的关键点位置
- 图像生成:生成新的图像内容
1.2 卷积神经网络(CNN)原理
CNN 是计算机视觉的基石。与传统神经网络不同,CNN 通过卷积操作自动提取图像的局部特征。
核心组件:
卷积层(Convolutional Layer)
- 使用滤波器(kernel)在图像上滑动
- 提取边缘、纹理等局部特征
- 参数共享,减少模型复杂度
池化层(Pooling Layer)
- 降低特征图尺寸
- 保留重要信息,减少计算量
- 常用:最大池化、平均池化
全连接层(Fully Connected Layer)
- 整合所有特征
- 输出最终分类结果
1.3 经典网络架构
LeNet-5(1998)
- 最早的 CNN 架构之一
- 用于手写数字识别
- 7 层结构:2 卷积 + 2 池化 + 3 全连接
AlexNet(2012)
- ImageNet 竞赛冠军
- 引入 ReLU 激活函数
- Dropout 防止过拟合
- GPU 加速训练
VGG(2014)
- 使用小卷积核(3×3)
- 网络更深(16-19 层)
- 结构简洁统一
ResNet(2015)
- 残差连接解决梯度消失
- 可训练超深网络(100+ 层)
- 目前最广泛使用的架构之一
第二章 实战案例一:图像分类系统
2.1 项目概述
我们将构建一个完整的图像分类系统,识别 10 种常见物体类别。这个项目将涵盖数据准备、模型训练、评估和部署的完整流程。
目标类别: 飞机、汽车、鸟、猫、鹿、狗、青蛙、马、船、卡车
2.2 环境准备
# 安装依赖
pip install torch torchvision matplotlib scikit-learn pillow
# 导入库
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix
import os2.3 数据准备与预处理
# 定义数据预处理
transform_train = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomCrop(32, padding=4),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2470, 0.2435, 0.2616))
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2470, 0.2435, 0.2616))
])
# 加载 CIFAR-10 数据集
train_dataset = datasets.CIFAR10(
root='./data',
train=True,
download=True,
transform=transform_train
)
test_dataset = datasets.CIFAR10(
root='./data',
train=False,
download=True,
transform=transform_test
)
# 创建数据加载器
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=4)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False, num_workers=4)
print(f"训练集大小:{len(train_dataset)}")
print(f"测试集大小:{len(test_dataset)}")2.4 构建分类模型
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super(SimpleCNN, self).__init__()
self.features = nn.Sequential(
# 第一组卷积
nn.Conv2d(3, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Dropout(0.25),
# 第二组卷积
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Dropout(0.25),
# 第三组卷积
nn.Conv2d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Dropout(0.25),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(256 * 4 * 4, 512),
nn.BatchNorm1d(512),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(512, num_classes)
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
# 初始化模型
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleCNN(num_classes=10).to(device)
print(f"模型参数量:{sum(p.numel() for p in model.parameters()):,}")2.5 训练模型
# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# 训练循环
def train_epoch(model, loader, criterion, optimizer, device):
model.train()
total_loss = 0
correct = 0
total = 0
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item()
_, predicted = outputs.max(1)
total += labels.size(0)
correct += predicted.eq(labels).sum().item()
return total_loss / len(loader), 100. * correct / total
def test_epoch(model, loader, criterion, device):
model.eval()
total_loss = 0
correct = 0
total = 0
with torch.no_grad():
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
total_loss += loss.item()
_, predicted = outputs.max(1)
total += labels.size(0)
correct += predicted.eq(labels).sum().item()
return total_loss / len(loader), 100. * correct / total
# 开始训练
num_epochs = 100
best_accuracy = 0
train_history = {'loss': [], 'acc': []}
test_history = {'loss': [], 'acc': []}
for epoch in range(num_epochs):
train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
test_loss, test_acc = test_epoch(model, test_loader, criterion, device)
scheduler.step()
train_history['loss'].append(train_loss)
train_history['acc'].append(train_acc)
test_history['loss'].append(test_loss)
test_history['acc'].append(test_acc)
if test_acc > best_accuracy:
best_accuracy = test_acc
torch.save(model.state_dict(), 'best_model.pth')
if (epoch + 1) % 10 == 0:
print(f'Epoch {epoch+1}/{num_epochs} | '
f'Train Loss: {train_loss:.4f}, Acc: {train_acc:.2f}% | '
f'Test Loss: {test_loss:.4f}, Acc: {test_acc:.2f}%')
print(f'\n最佳测试准确率:{best_accuracy:.2f}%')2.6 模型评估与可视化
# 加载最佳模型
model.load_state_dict(torch.load('best_model.pth'))
# 生成预测
def get_predictions(model, loader, device):
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
for images, labels in loader:
images = images.to(device)
outputs = model(images)
_, predicted = outputs.max(1)
all_preds.extend(predicted.cpu().numpy())
all_labels.extend(labels.numpy())
return np.array(all_preds), np.array(all_labels)
preds, labels = get_predictions(model, test_loader, device)
# 分类报告
class_names = ['飞机', '汽车', '鸟', '猫', '鹿', '狗', '青蛙', '马', '船', '卡车']
print(classification_report(labels, preds, target_names=class_names))
# 混淆矩阵可视化
import seaborn as sns
cm = confusion_matrix(labels, preds)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=class_names, yticklabels=class_names)
plt.title('混淆矩阵')
plt.xlabel('预测')
plt.ylabel('真实')
plt.savefig('confusion_matrix.png', dpi=150)
plt.show()2.7 实战技巧总结
- 数据增强:随机翻转、裁剪可显著提升泛化能力
- 批归一化:加速收敛,减少对初始化的敏感
- Dropout:防止过拟合,全连接层常用 0.5
- 学习率调度:StepLR 在训练后期降低学习率
- 模型保存:保存最佳模型而非最后一个
第三章 实战案例二:目标检测系统(YOLO)
3.1 项目概述
目标检测不仅要识别图像中的物体类别,还要定位它们的位置。YOLO(You Only Look Once)是目前最流行的实时目标检测算法。
应用场景:
- 自动驾驶:检测车辆、行人、交通标志
- 安防监控:入侵检测、异常行为识别
- 零售:商品识别、客流分析
- 医疗:病灶检测、器官定位
3.2 YOLO 原理简介
YOLO 将目标检测转化为回归问题:
- 网格划分:将图像划分为 S×S 的网格
- 边界框预测:每个网格预测 B 个边界框
- 类别预测:每个网格预测 C 个类别概率
- 非极大抑制:去除重叠的检测框
输出格式: [x, y, w, h, confidence, class_probs]
3.3 使用 YOLOv8 实现检测
# 安装 Ultralytics
pip install ultralytics
from ultralytics import YOLO
import cv2
import matplotlib.pyplot as plt
# 加载预训练模型
model = YOLO('yolov8n.pt') # nano 版本,速度快
# 查看模型信息
model.info()3.4 单张图像检测
# 检测单张图像
results = model('test_image.jpg')
# 显示结果
for result in results:
# 获取检测框
boxes = result.boxes
print(f"检测到 {len(boxes)} 个目标")
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
confidence = box.conf[0].cpu().numpy()
class_id = int(box.cls[0].cpu().numpy())
class_name = model.names[class_id]
print(f"{class_name}: {confidence:.2f} [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}]")
# 保存可视化结果
result.save('detection_result.jpg')3.5 批量检测与视频处理
# 批量检测多张图像
import glob
image_paths = glob.glob('images/*.jpg')
results = model(image_paths, batch=8)
# 处理视频
cap = cv2.VideoCapture('input_video.mp4')
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
# 创建视频写入器
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter('output_video.mp4', fourcc, fps, (width, height))
frame_count = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# 每 2 帧检测一次(提高速度)
if frame_count % 2 == 0:
results = model(frame, verbose=False)
annotated_frame = results[0].plot()
else:
annotated_frame = frame
out.write(annotated_frame)
frame_count += 1
if frame_count % 100 == 0:
print(f"已处理 {frame_count} 帧")
cap.release()
out.release()
print(f"视频处理完成,共 {frame_count} 帧")3.6 自定义数据集训练
# 准备数据集(YOLO 格式)
# 目录结构:
# dataset/
# ├── images/
# │ ├── train/
# │ └── val/
# └── labels/
# ├── train/
# └── val/
# 创建数据集配置文件
dataset_config = """
path: /path/to/dataset
train: images/train
val: images/val
names:
0: person
1: car
2: dog
3: cat
"""
with open('custom_dataset.yaml', 'w') as f:
f.write(dataset_config)
# 训练自定义模型
model = YOLO('yolov8n.pt') # 从预训练权重开始
results = model.train(
data='custom_dataset.yaml',
epochs=100,
imgsz=640,
batch=16,
device=0, # GPU
workers=4,
optimizer='AdamW',
patience=20, # 早停
save=True,
project='runs/detect',
name='custom_yolov8'
)
# 评估模型
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.4f}")
print(f"mAP50-95: {metrics.box.map:.4f}")3.7 实时检测应用
# 摄像头实时检测
import cv2
from ultralytics import YOLO
model = YOLO('yolov8n.pt')
cap = cv2.VideoCapture(0) # 使用默认摄像头
while True:
ret, frame = cap.read()
if not ret:
break
# 检测
results = model(frame, verbose=False)
annotated_frame = results[0].plot()
# 显示
cv2.imshow('YOLO Real-time Detection', annotated_frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()3.8 性能优化技巧
选择合适模型:
- YOLOv8n:最快,适合实时应用
- YOLOv8s/m:平衡速度和精度
- YOLOv8l/x:最高精度,适合离线处理
推理优化:
python# 导出为 ONNX model.export(format='onnx') # 导出为 TensorRT(NVIDIA GPU) model.export(format='engine') # 导出为 OpenVINO(Intel CPU) model.export(format='openvino')批处理:同时处理多张图像提高效率
半精度推理:
pythonmodel = YOLO('yolov8n.pt') results = model('image.jpg', half=True) # FP16
第四章 实战案例三:图像分割
4.1 项目概述
图像分割将图像中的每个像素分类到特定类别,实现像素级的理解。相比目标检测的边界框,分割提供更精确的形状信息。
主要类型:
- 语义分割:每个像素分类到类别(不区分实例)
- 实例分割:区分同一类别的不同实例
- 全景分割:语义分割 + 实例分割的结合
应用场景:
- 医学影像:肿瘤分割、器官定位
- 自动驾驶:道路、车辆、行人分割
- 遥感:土地利用分类
- 人像:背景虚化、虚拟试衣
4.2 使用 DeepLabV3+ 进行语义分割
import torch
from torchvision import models
from torchvision.transforms import Compose, ToTensor, Normalize, Resize
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
# 加载预训练模型
model = models.segmentation.deeplabv3_resnet101(pretrained=True)
model.eval()
# 定义预处理
transform = Compose([
Resize((520, 520)),
ToTensor(),
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# COCO 数据集类别
COCO_CATEGORIES = [
'background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle',
'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tv'
]
# 颜色映射
COLORS = np.random.randint(0, 255, size=(21, 3), dtype=np.uint8)
COLORS[0] = [0, 0, 0] # 背景为黑色
def segment_image(image_path):
# 加载图像
image = Image.open(image_path).convert('RGB')
input_tensor = transform(image).unsqueeze(0)
# 推理
with torch.no_grad():
output = model(input_tensor)['out'][0]
# 获取预测
prediction = output.argmax(0).byte().cpu().numpy()
# 创建彩色分割图
segmentation_map = np.zeros((*prediction.shape, 3), dtype=np.uint8)
for i in range(21):
segmentation_map[prediction == i] = COLORS[i]
# 可视化
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].imshow(image)
axes[0].set_title('原图')
axes[0].axis('off')
axes[1].imshow(prediction, cmap='tab20')
axes[1].set_title('分割掩码')
axes[1].axis('off')
axes[2].imshow(segmentation_map)
axes[2].set_title('彩色分割')
axes[2].axis('off')
plt.tight_layout()
plt.savefig('segmentation_result.png', dpi=150)
plt.show()
return prediction
# 运行分割
segment_image('test_image.jpg')4.3 实例分割(Mask R-CNN)
from torchvision.models.detection import maskrcnn_resnet50_fpn
import cv2
# 加载 Mask R-CNN
model = maskrcnn_resnet50_fpn(pretrained=True)
model.eval()
def instance_segmentation(image_path, threshold=0.5):
# 加载图像
image = Image.open(image_path).convert('RGB')
image_tensor = ToTensor()(image).unsqueeze(0)
# 推理
with torch.no_grad():
prediction = model(image_tensor)[0]
# 处理结果
boxes = prediction['boxes'].cpu().numpy()
labels = prediction['labels'].cpu().numpy()
scores = prediction['scores'].cpu().numpy()
masks = prediction['masks'].cpu().numpy()
# 过滤低置信度检测
valid_indices = scores > threshold
print(f"检测到 {valid_indices.sum()} 个实例")
# 可视化
image_cv = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
for i in range(len(boxes)):
if scores[i] < threshold:
continue
# 绘制边界框
x1, y1, x2, y2 = boxes[i].astype(int)
cv2.rectangle(image_cv, (x1, y1), (x2, y2), (0, 255, 0), 2)
# 添加标签
label = f"{COCO_CATEGORIES[labels[i]]} {scores[i]:.2f}"
cv2.putText(image_cv, label, (x1, y1-10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
# 叠加掩码
mask = masks[i, 0] > 0.5
color = np.random.randint(0, 255, 3)
image_cv[mask] = image_cv[mask] * 0.5 + color * 0.5
cv2.imwrite('instance_segmentation.jpg', image_cv)
print("实例分割结果已保存")
return prediction
instance_segmentation('test_image.jpg')4.4 人像分割应用
# 使用专用的人像分割模型
from transformers import AutoModelForImageSegmentation, AutoImageProcessor
# 加载人像分割模型
model_name = "briaai/RMBG-1.4"
model = AutoModelForImageSegmentation.from_pretrained(model_name, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)
def remove_background(image_path, output_path='output_no_bg.png'):
# 加载图像
image = Image.open(image_path).convert('RGB')
# 预处理
inputs = processor(images=image, return_tensors="pt")
# 推理
with torch.no_grad():
outputs = model(**inputs)
# 后处理
result = processor.post_process_image_segmentation(
outputs,
threshold=0.5
)[0]
# 提取前景
mask = result['segmentation']
foreground = Image.new('RGBA', image.size, (0, 0, 0, 0))
foreground.paste(image, mask=mask)
# 保存
foreground.save(output_path)
print(f"背景已移除,保存到 {output_path}")
return foreground
remove_background('portrait.jpg')4.5 医学图像分割
# UNet 架构(医学图像分割经典模型)
class UNet(nn.Module):
def __init__(self, n_channels=1, n_classes=1):
super(UNet, self).__init__()
def double_conv(in_ch, out_ch):
return nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True)
)
self.enc1 = double_conv(n_channels, 64)
self.enc2 = double_conv(64, 128)
self.enc3 = double_conv(128, 256)
self.enc4 = double_conv(256, 512)
self.pool = nn.MaxPool2d(2)
self.bottleneck = double_conv(512, 1024)
self.up4 = nn.ConvTranspose2d(1024, 512, 2, stride=2)
self.dec4 = double_conv(1024, 512)
self.up3 = nn.ConvTranspose2d(512, 256, 2, stride=2)
self.dec3 = double_conv(512, 256)
self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
self.dec2 = double_conv(256, 128)
self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
self.dec1 = double_conv(128, 64)
self.final = nn.Conv2d(64, n_classes, 1)
def forward(self, x):
# 编码器
e1 = self.enc1(x)
e2 = self.enc2(self.pool(e1))
e3 = self.enc3(self.pool(e2))
e4 = self.enc4(self.pool(e3))
# 瓶颈
b = self.bottleneck(self.pool(e4))
# 解码器
d4 = self.dec4(torch.cat([self.up4(b), e4], dim=1))
d3 = self.dec3(torch.cat([self.up3(d4), e3], dim=1))
d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))
d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))
return torch.sigmoid(self.final(d1))
# 训练医学图像分割模型
def train_unet(train_loader, val_loader, num_epochs=50):
model = UNet(n_channels=1, n_classes=1).to(device)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)
best_dice = 0
for epoch in range(num_epochs):
# 训练
model.train()
train_loss = 0
for images, masks in train_loader:
images, masks = images.to(device), masks.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, masks)
loss.backward()
optimizer.step()
train_loss += loss.item()
# 验证
model.eval()
val_loss = 0
dice_scores = []
with torch.no_grad():
for images, masks in val_loader:
images, masks = images.to(device), masks.to(device)
outputs = model(images)
loss = criterion(outputs, masks)
val_loss += loss.item()
# 计算 Dice 系数
intersection = (outputs * masks).sum()
union = outputs.sum() + masks.sum()
dice = (2. * intersection + 1e-6) / union
dice_scores.append(dice.item())
avg_dice = np.mean(dice_scores)
if avg_dice > best_dice:
best_dice = avg_dice
torch.save(model.state_dict(), 'best_unet.pth')
print(f'Epoch {epoch+1}/{num_epochs} | '
f'Train Loss: {train_loss/len(train_loader):.4f} | '
f'Val Loss: {val_loss/len(val_loader):.4f} | '
f'Dice: {avg_dice:.4f}')
print(f'最佳 Dice 系数:{best_dice:.4f}')
return model第五章 实战案例四:人脸检测与识别
5.1 项目概述
人脸识别是计算机视觉中最成熟的应用之一,广泛应用于手机解锁、门禁系统、支付验证等场景。
系统组成:
- 人脸检测:定位图像中的人脸位置
- 人脸对齐:标准化人脸姿态
- 特征提取:将人脸编码为特征向量
- 特征匹配:比较特征向量进行识别
5.2 使用 face_recognition 库
# 安装依赖
pip install face_recognition dlib
import face_recognition
import cv2
import numpy as np
from pathlib import Path
# 加载已知人脸
def load_known_faces(known_faces_dir='known_faces'):
known_encodings = []
known_names = []
for image_path in Path(known_faces_dir).glob('*.jpg'):
# 加载图像
image = face_recognition.load_image_file(image_path)
# 提取人脸编码
encodings = face_recognition.face_encodings(image)
if len(encodings) > 0:
known_encodings.append(encodings[0])
known_names.append(image_path.stem)
print(f"已加载 {len(known_names)} 个人脸:{known_names}")
return known_encodings, known_names
# 实时人脸识别
def face_recognition_system(known_encodings, known_names):
cap = cv2.VideoCapture(0)
print("启动人脸识别系统,按 q 退出")
while True:
ret, frame = cap.read()
if not ret:
break
# 缩小图像提高速度
small_frame = cv2.resize(frame, (0, 0), fx=0.25, fy=0.25)
rgb_small_frame = cv2.cvtColor(small_frame, cv2.COLOR_BGR2RGB)
# 检测人脸
face_locations = face_recognition.face_locations(rgb_small_frame)
face_encodings = face_recognition.face_encodings(rgb_small_frame, face_locations)
for (top, right, bottom, left), face_encoding in zip(face_locations, face_encodings):
# 恢复原始尺寸
top *= 4
right *= 4
bottom *= 4
left *= 4
# 匹配已知人脸
matches = face_recognition.compare_faces(known_encodings, face_encoding, tolerance=0.6)
name = "Unknown"
# 使用距离最近的匹配
face_distances = face_recognition.face_distance(known_encodings, face_encoding)
if len(face_distances) > 0:
best_match_index = np.argmin(face_distances)
if matches[best_match_index]:
name = known_names[best_match_index]
# 绘制边界框
color = (0, 255, 0) if name != "Unknown" else (0, 0, 255)
cv2.rectangle(frame, (left, top), (right, bottom), color, 2)
cv2.rectangle(frame, (left, bottom - 35), (right, bottom), color, cv2.FILLED)
cv2.putText(frame, name, (left + 6, bottom - 6),
cv2.FONT_HERSHEY_DUPLEX, 0.8, (255, 255, 255), 1)
cv2.imshow('Face Recognition', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
# 运行系统
known_encodings, known_names = load_known_faces()
face_recognition_system(known_encodings, known_names)5.3 人脸考勤系统
import pandas as pd
from datetime import datetime
class AttendanceSystem:
def __init__(self, known_encodings, known_names):
self.known_encodings = known_encodings
self.known_names = known_names
self.attendance_file = 'attendance.csv'
self.marked_today = set() # 避免重复打卡
# 初始化考勤文件
if not Path(self.attendance_file).exists():
df = pd.DataFrame(columns=['日期', '姓名', '打卡时间', '类型'])
df.to_csv(self.attendance_file, index=False)
def mark_attendance(self, name, attendance_type='上班'):
today = datetime.now().strftime('%Y-%m-%d')
time_str = datetime.now().strftime('%H:%M:%S')
key = f"{today}_{name}_{attendance_type}"
if key in self.marked_today:
return False
# 记录考勤
df = pd.read_csv(self.attendance_file)
new_row = {
'日期': today,
'姓名': name,
'打卡时间': time_str,
'类型': attendance_type
}
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
df.to_csv(self.attendance_file, index=False)
self.marked_today.add(key)
print(f"{name} {attendance_type}打卡成功:{time_str}")
return True
def generate_report(self, start_date=None, end_date=None):
df = pd.read_csv(self.attendance_file)
if start_date:
df = df[df['日期'] >= start_date]
if end_date:
df = df[df['日期'] <= end_date]
# 按姓名统计
report = df.groupby('姓名').agg({
'日期': 'count',
'打卡时间': lambda x: f"{x.iloc[0]} - {x.iloc[-1]}"
}).rename(columns={'日期': '打卡天数', '打卡时间': '时间范围'})
print("\n=== 考勤报告 ===")
print(report)
return report
def run(self):
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
if not ret:
break
small_frame = cv2.resize(frame, (0, 0), fx=0.25, fy=0.25)
rgb_small_frame = cv2.cvtColor(small_frame, cv2.COLOR_BGR2RGB)
face_locations = face_recognition.face_locations(rgb_small_frame)
face_encodings = face_recognition.face_encodings(rgb_small_frame, face_locations)
for (top, right, bottom, left), face_encoding in zip(face_locations, face_encodings):
top *= 4
right *= 4
bottom *= 4
left *= 4
matches = face_recognition.compare_faces(self.known_encodings, face_encoding)
name = "Unknown"
if len(matches) > 0:
face_distances = face_recognition.face_distance(self.known_encodings, face_encoding)
best_match_index = np.argmin(face_distances)
if matches[best_match_index]:
name = self.known_names[best_match_index]
# 自动打卡
self.mark_attendance(name)
color = (0, 255, 0) if name != "Unknown" else (0, 0, 255)
cv2.rectangle(frame, (left, top), (right, bottom), color, 2)
cv2.putText(frame, name, (left, bottom + 30),
cv2.FONT_HERSHEY_DUPLEX, 0.8, color, 2)
cv2.imshow('Attendance System', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
# 运行考勤系统
attendance = AttendanceSystem(known_encodings, known_names)
attendance.run()5.4 活体检测(防照片攻击)
# 简单的眨眼检测活体检测
def eye_aspect_ratio(eye_points):
"""计算眼睛纵横比"""
A = np.linalg.norm(eye_points[1] - eye_points[5])
B = np.linalg.norm(eye_points[2] - eye_points[4])
C = np.linalg.norm(eye_points[0] - eye_points[3])
return (A + B) / (2.0 * C + 1e-6)
def blink_detection(frame, landmarks):
"""检测眨眼"""
# 提取眼睛关键点
left_eye = landmarks[36:42]
right_eye = landmarks[42:48]
left_ear = eye_aspect_ratio(left_eye)
right_ear = eye_aspect_ratio(right_eye)
avg_ear = (left_ear + right_ear) / 2
# EAR < 0.25 表示眼睛闭合
return avg_ear < 0.25
def liveness_check(video_path='liveness_check.mp4', duration=5):
"""活体检测:要求用户眨眼"""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
frame_count = int(fps * duration)
blink_count = 0
is_blinking = False
for i in range(frame_count):
ret, frame = cap.read()
if not ret:
break
# 检测人脸和关键点
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
face_locations = face_recognition.face_locations(rgb_frame)
if len(face_locations) > 0:
# 使用 dlib 获取关键点(简化示例)
# 实际应用中需要使用 dlib.shape_predictor
# 模拟眨眼检测
# 实际项目中需要集成 dlib 或 mediapipe
pass
cv2.putText(frame, f"请眨眼 ({duration - i/fps:.1f}s)",
(10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
cv2.putText(frame, f"眨眼次数:{blink_count}",
(10, 70), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
cv2.imshow('Liveness Check', frame)
cv2.waitKey(1)
cap.release()
cv2.destroyAllWindows()
# 至少眨眼 2 次才算通过
passed = blink_count >= 2
print(f"活体检测{'通过' if passed else '失败'},眨眼次数:{blink_count}")
return passed第六章 实战案例五:图像生成与风格迁移
6.1 项目概述
生成式 AI 是当前的热点领域。我们将学习如何使用预训练模型进行图像生成和风格迁移,创造独特的艺术作品。
应用场景:
- 艺术创作:风格迁移、图像生成
- 数据增强:生成训练数据
- 图像编辑:修复、扩展、变换
- 娱乐:滤镜、特效
6.2 神经风格迁移
import torch
import torch.nn as nn
from torchvision import models, transforms
from PIL import Image
import matplotlib.pyplot as plt
# 加载 VGG19 作为特征提取器
class VGGFeatureExtractor(nn.Module):
def __init__(self):
super().__init__()
vgg = models.vgg19(pretrained=True).features
# 选择用于内容和风格表示的层
self.content_layers = ['21'] # conv4_2
self.style_layers = ['0', '5', '10', '19', '28'] # 不同层的风格
self.layers = {}
for i, layer in enumerate(vgg):
self.layers[str(i)] = layer
if str(i) in self.content_layers or str(i) in self.style_layers:
layer.requires_grad = False
self.features = nn.Sequential(*[self.layers[str(i)] for i in range(30)])
def forward(self, x):
content_features = []
style_features = []
for name, layer in self.features._modules.items():
x = layer(x)
if name in self.content_layers:
content_features.append(x)
if name in self.style_layers:
style_features.append(self.gram_matrix(x))
return content_features, style_features
def gram_matrix(self, x):
"""计算 Gram 矩阵(风格表示)"""
b, c, h, w = x.size()
features = x.view(b * c, h * w)
G = torch.mm(features, features.t())
return G.div(b * c * h * w)
# 风格迁移类
class StyleTransfer:
def __init__(self, content_path, style_path, device='cuda'):
self.device = device
self.model = VGGFeatureExtractor().to(device).eval()
# 加载图像
self.content = self.load_image(content_path)
self.style = self.load_image(style_path)
# 初始化生成图像(从内容图像开始)
self.generated = self.content.clone().requires_grad_(True)
# 优化器
self.optimizer = torch.optim.Adam([self.generated], lr=0.003)
def load_image(self, path):
transform = transforms.Compose([
transforms.Resize((512, 512)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
image = Image.open(path).convert('RGB')
return transform(image).unsqueeze(0).to(self.device)
def compute_loss(self, content_features, style_features,
gen_content, gen_style,
content_weight=1.0, style_weight=1e6):
# 内容损失
content_loss = 0
for cf, gf in zip(content_features, gen_content):
content_loss += nn.functional.mse_loss(gf, cf)
# 风格损失
style_loss = 0
for sf, gf in zip(style_features, gen_style):
style_loss += nn.functional.mse_loss(gf, sf)
return content_weight * content_loss + style_weight * style_loss
def transfer(self, num_iterations=300, content_weight=1.0, style_weight=1e6):
# 提取内容和风格特征
with torch.no_grad():
content_features, style_features = self.model(self.content)
# 优化
for i in range(num_iterations):
self.optimizer.zero_grad()
gen_content, gen_style = self.model(self.generated)
loss = self.compute_loss(
content_features, style_features,
gen_content, gen_style,
content_weight, style_weight
)
loss.backward()
self.optimizer.step()
if (i + 1) % 50 == 0:
print(f"Iteration {i+1}/{num_iterations}, Loss: {loss.item():.2f}")
return self.generated
def save_result(self, output_path='style_transfer_result.jpg'):
# 反归一化并保存
image = self.generated.cpu().squeeze(0)
image = image * torch.tensor([0.229, 0.224, 0.225]).view(3, 1, 1) + \
torch.tensor([0.485, 0.456, 0.406]).view(3, 1, 1)
image = image.clamp(0, 1)
transform = transforms.ToPILImage()
result = transform(image)
result.save(output_path)
print(f"结果已保存到 {output_path}")
return result
# 运行风格迁移
st = StyleTransfer('content.jpg', 'style.jpg')
result = st.transfer(num_iterations=300)
st.save_result()6.3 使用 Stable Diffusion 生成图像
# 安装依赖
pip install diffusers transformers accelerate torch
from diffusers import StableDiffusionPipeline
import torch
# 加载模型
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
use_safetensors=True
)
pipe = pipe.to("cuda")
# 生成图像
prompt = "a futuristic city at sunset, cyberpunk style, highly detailed, 8k"
negative_prompt = "blurry, low quality, distorted, ugly"
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=50,
guidance_scale=7.5,
height=512,
width=512
).images[0]
image.save('generated_city.png')
print("图像生成完成")
# 批量生成
prompts = [
"a cute cat sitting on a windowsill, sunlight, cozy atmosphere",
"a medieval castle on a mountain, fantasy art, dramatic lighting",
"a robot in a forest, nature and technology, peaceful scene"
]
for i, prompt in enumerate(prompts):
image = pipe(prompt, num_inference_steps=30).images[0]
image.save(f'generated_{i+1}.png')
print(f"已生成图像 {i+1}/{len(prompts)}")6.4 图像修复(Inpainting)
from diffusers import StableDiffusionInpaintPipeline
from PIL import Image
import numpy as np
# 加载修复模型
pipe = StableDiffusionInpaintPipeline.from_pretrained(
"runwayml/stable-diffusion-inpainting",
torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
# 准备图像和掩码
def create_mask(image_path, mask_area):
"""创建修复掩码"""
image = Image.open(image_path).convert('RGB')
mask = Image.new('L', image.size, 0) # 黑色背景
# 在掩码上绘制白色区域(要修复的部分)
from PIL import ImageDraw
draw = ImageDraw.Draw(mask)
draw.rectangle(mask_area, fill=255)
return image, mask
# 运行修复
image, mask = create_mask('photo.jpg', (100, 100, 300, 300))
prompt = "beautiful landscape, mountains and lake, natural scenery"
result = pipe(
prompt=prompt,
image=image,
mask_image=mask,
num_inference_steps=50
).images[0]
result.save('inpainted_result.png')
print("图像修复完成")6.5 图像超分辨率
from PIL import Image
import torch
from torchvision import transforms
# 使用预训练的超分辨率模型
def super_resolution(image_path, scale_factor=4):
"""图像超分辨率"""
# 加载低分辨率图像
lr_image = Image.open(image_path)
lr_width, lr_height = lr_image.size
# 使用 ESRGAN 或类似模型
# 这里使用简单的双三次插值作为示例
hr_image = lr_image.resize(
(lr_width * scale_factor, lr_height * scale_factor),
Image.BICUBIC
)
# 保存结果
hr_image.save('super_resolution_result.png')
print(f"超分辨率完成:{lr_width}x{lr_height} -> {lr_width*scale_factor}x{lr_height*scale_factor}")
return hr_image
# 使用 Real-ESRGAN(需要额外安装)
# pip install basicsr
from basicsr.archs.rrdbnet_arch import RRDBNet
from realesrgan import RealESRGANer
def advanced_super_resolution(image_path):
"""使用 Real-ESRGAN 进行高级超分辨率"""
# 定义模型
model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64,
num_block=23, num_grow_ch=32, scale=4)
# 创建增强器
upsampler = RealESRGANer(
scale=4,
model_path='experiments/pretrained_models/RealESRGAN_x4plus.pth',
model=model,
tile=0, # 0 表示不使用分块
tile_pad=10,
pre_pad=0,
half=True # 使用半精度
)
# 处理图像
img = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)
output, _ = upsampler.enhance(img, outscale=4)
cv2.imwrite('esrgan_result.png', output)
print("Real-ESRGAN 超分辨率完成")
return output第七章 模型优化与部署
7.1 模型量化
import torch
from torch.quantization import quantize_dynamic
# 动态量化(推理时量化)
quantized_model = quantize_dynamic(
model,
{nn.Linear, nn.Conv2d},
dtype=torch.qint8
)
# 保存量化模型
torch.save(quantized_model.state_dict(), 'quantized_model.pth')
# 静态量化(训练后量化)
def prepare_for_quantization(model):
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)
return model
def finalize_quantization(model):
torch.quantization.convert(model, inplace=True)
return model
# 使用流程
model_prepared = prepare_for_quantization(model)
# 用校准数据运行
# ...
model_quantized = finalize_quantization(model_prepared)7.2 模型剪枝
from torch.nn.utils import prune
# L1 非结构化剪枝
prune.l1_unstructured(model, name='weight', amount=0.3)
# 结构化剪枝(剪掉整个通道)
prune.ln_structured(model, name='weight', amount=0.3, n=2, dim=0)
# 移除剪枝掩码(永久剪枝)
for module in model.modules():
if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
prune.remove(module, 'weight')7.3 导出为 ONNX
# 导出模型
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
"model.onnx",
export_params=True,
opset_version=11,
do_constant_folding=True,
input_names=['input'],
output_names=['output'],
dynamic_axes={
'input': {0: 'batch_size'},
'output': {0: 'batch_size'}
}
)
# 使用 ONNX Runtime 推理
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
# 推理
result = session.run([output_name], {input_name: dummy_input.numpy()})7.4 部署为 Web 服务
# 使用 FastAPI 部署
from fastapi import FastAPI, File, UploadFile
from PIL import Image
import io
app = FastAPI()
@app.post("/predict")
async def predict(file: UploadFile = File(...)):
# 读取图像
image = Image.open(io.BytesIO(await file.read()))
# 预处理
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
input_tensor = transform(image).unsqueeze(0)
# 推理
with torch.no_grad():
output = model(input_tensor)
prediction = torch.softmax(output, dim=1)[0]
# 返回结果
return {
"class": class_names[prediction.argmax().item()],
"confidence": prediction.max().item()
}
# 运行服务
# uvicorn app:app --host 0.0.0.0 --port 8000第八章 常见问题与最佳实践
8.1 常见问题解答
Q1: 训练时显存不足怎么办?
- 减小 batch size
- 使用梯度累积
- 使用混合精度训练(AMP)
- 使用更小的模型架构
# 混合精度训练示例
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for images, labels in train_loader:
optimizer.zero_grad()
with autocast():
outputs = model(images)
loss = criterion(outputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()Q2: 模型过拟合如何解决?
- 增加数据增强
- 添加 Dropout 层
- 使用 L2 正则化
- 提前停止(Early Stopping)
- 使用更简单的模型
Q3: 如何提高推理速度?
- 模型量化(INT8)
- 模型剪枝
- 使用 TensorRT/OpenVINO
- 批处理推理
- 使用更小的模型变体
Q4: 数据集不平衡怎么办?
- 过采样少数类
- 欠采样多数类
- 使用加权损失函数
- 数据增强少数类
# 加权损失函数
class_weights = torch.tensor([1.0, 2.0, 5.0]) # 根据类别频率设置
criterion = nn.CrossEntropyLoss(weight=class_weights)8.2 最佳实践总结
数据准备
- 确保数据质量和标注准确性
- 合理划分训练/验证/测试集
- 充分的数据增强
模型选择
- 从预训练模型开始(迁移学习)
- 根据任务复杂度选择模型大小
- 考虑部署环境的限制
训练技巧
- 使用学习率调度
- 监控训练和验证指标
- 保存最佳模型 checkpoint
评估与调试
- 使用多种评估指标
- 分析错误案例
- 可视化中间结果
部署优化
- 模型量化和剪枝
- 选择合适的推理引擎
- 监控生产环境性能
总结
本教程通过 5 个实战案例,系统介绍了计算机视觉的核心技术和应用:
- 图像分类:掌握了 CNN 基本原理和完整训练流程
- 目标检测:学会了使用 YOLO 进行实时目标检测
- 图像分割:理解了语义分割和实例分割的区别与应用
- 人脸识别:构建了完整的人脸检测与识别系统
- 图像生成:探索了风格迁移和生成式 AI 的应用
计算机视觉是一个快速发展的领域,新的模型和技术不断涌现。建议持续关注以下方向:
- Vision Transformer(ViT):Transformer 在视觉领域的应用
- 多模态学习:结合视觉和语言的模型(如 CLIP、DALL-E)
- 3D 视觉:点云处理、NeRF、3D 重建
- 视频理解:动作识别、视频生成
下一步学习建议:
- 深入理解每个案例的代码,尝试修改和改进
- 参与 Kaggle 计算机视觉竞赛
- 阅读经典论文(ResNet、YOLO、Transformer 等)
- 构建自己的项目,解决实际问题
记住,实践是学习计算机视觉最好的方式。选择一个你感兴趣的应用场景,动手实现它!
参考资料:
- PyTorch 官方文档:https://pytorch.org/docs/
- Ultralytics YOLO 文档:https://docs.ultralytics.com/
- Hugging Face Diffusers:https://huggingface.co/docs/diffusers
- OpenCV 教程:https://docs.opencv.org/
- Papers With Code:https://paperswithcode.com/