源码地址：https://zh-v2.d2l.ai/d2l-zh.zip

torch基本操作

nn.LazyLinear()

LazyLinear 是 PyTorch 中的一种线性层，它的特点是可以延迟初始化。在使用 LazyLinear 时，你不需要在创建层的时候指定输入的特征数，而是在第一次将输入传入这个层的时候自动推断输入的特征数并进行初始化。这在某些情况下会很方便，尤其是当你的网络结构比较复杂或者输入的维度在构建模型时还不确定时。代码如下：

class PositionWiseFFN(nn.Module):  #@save
    """The positionwise feed-forward network."""
    def __init__(self, ffn_num_hiddens, ffn_num_outputs):
        super().__init__()
        self.dense1 = nn.LazyLinear(ffn_num_hiddens)
        self.relu = nn.ReLU()
        self.dense2 = nn.LazyLinear(ffn_num_outputs)

    def forward(self, X):
        return self.dense2(self.relu(self.dense1(X)))

利用transforms实验并可视化

from torchvision import transforms
from PIL import Image
import matplotlib.pyplot as plt

# 创建图像增强管道
augment_trans = transforms.Compose([
    transforms.RandomPerspective(fill=1, p=1, distortion_scale=0.5),
    transforms.RandomResizedCrop((150, 224), scale=(0.7,0.9)),
])

# 加载示例图像
image = Image.open("/Users/jc/Pictures/d827c8d6gy1hhmzwdkyvhj20qo0xpaig.jpg")

# 应用增强管道
augmented_image = augment_trans(image)

# 显示原始和增强后的图像
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Original Image")
plt.imshow(image)

plt.subplot(1, 2, 2)
plt.title("Augmented Image")
plt.imshow(augmented_image)
plt.show()

torch.isfinite

检查数值稳定性，确认不是NaN, 正无穷，负无穷

import torch

# Create a tensor with various values
x = torch.tensor([1.0, float('inf'), -float('inf'), float('nan'), 3.0, 4.0])

# Use torch.isfinite to check which values are finite
finite_mask = torch.isfinite(x)

print(finite_mask)  # Output: tensor([ True, False, False, False,  True,  True])
print(finite_mask.all()) # Output: tensor(False)

向量排序与对应排列

import torch

# 创建两个向量
values = torch.tensor([3, 1, 4, 1, 5, 9, 2, 6])
other_vector = torch.tensor([10, 20, 30, 40, 50, 60, 70, 80])

# 对values向量排序，并得到排序后的索引
sorted_indices = torch.argsort(values)

# 使用排序后的索引对other_vector进行重新排列
sorted_other_vector = other_vector[sorted_indices]

# 打印结果
print("原始向量 values:", values)
print("排序后的索引:", sorted_indices)
print("根据排序后的索引重新排列的 other_vector:", sorted_other_vector)

对于str类型

import torch

# 创建两个向量
values = torch.tensor([3, 1, 4, 1, 5, 9, 2, 6])
other_vector = ['apple', 'orange', 'banana', 'grape', 'pear', 'cherry', 'kiwi', 'melon']

# 对values向量排序，并得到排序后的索引
sorted_indices = torch.argsort(values)

# 使用排序后的索引对other_vector进行重新排列
sorted_other_vector = [other_vector[i] for i in sorted_indices]

# 打印结果
print("原始向量 values:", values)
print("排序后的索引:", sorted_indices)
print("根据排序后的索引重新排列的 other_vector:", sorted_other_vector)

获得排序后的x_l，y_l

import torch

# 创建两个向量
values = torch.tensor([3, 1, 4, 1, 5, 9, 2, 6])
other_vector = torch.tensor([10, 20, 30, 40, 50, 60, 70, 80])

# 对values向量排序，并得到排序后的索引
sorted_indices = torch.argsort(values)

# 使用排序后的索引对other_vector进行重新排列
sorted_other_vector = other_vector[sorted_indices]
sorted_values = values[sorted_indices]

# 打印结果
print("原始向量 values:", values)
print("排序后的索引:", sorted_indices)
print("根据排序后的索引重新排列的 other_vector:", sorted_other_vector)
print("根据排序后的索引重新排列的 sorted_values:", sorted_values)

取top-k

import torch

pred = torch.rand((4, 5))
print(pred)
print("------------k=1------------------")
vals, indices = pred.topk(k=1, dim=1, largest=True, sorted=True)
print(indices)
print("------------k=2------------------")
vals, indices = pred.topk(k=2, dim=1, largest=True, sorted=True)
print(indices)

概率分布

import torch
from torch.distributions import multinomial
fair_probs = torch.ones([6]) / 6
print('fair_probs: ', fair_probs)
# fair_probs = torch.tensor([1, 0, 0, 0, 0, 0])
multinomial.Multinomial(1, fair_probs).sample() # 试验1次

fair_probs:  tensor([0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667])





tensor([0., 1., 0., 0., 0., 0.])

1	multinomial.Multinomial(10, fair_probs).sample() # 试验10次

tensor([4., 0., 1., 3., 0., 2.])

1 2	counts = multinomial.Multinomial(10, fair_probs).sample((500,)) counts.shape

torch.Size([500, 6])

累加: torch.cumsum(a, dim=0)

torch.pi:

1 2	import torch torch.pi = torch.acos(torch.zeros(1)).item() * 2

创建恒等矩阵I：

1 2	import torch torch.eye(3)

交换通道

import torch
X = torch.arange(64).reshape(2, 4, 8)
print(X.shape)
X = X.permute(1, 0, 2)
print(X.shape)

配置gpu or cpu

查看 gpu 数量

import torch

gpu_n = torch.cuda.device_count()
print('gpu_n: ', gpu_n)

取0卡作为计算硬件

1 2	if torch.cuda.device_count() > 0: device = torch.device(f'cuda{0}')

broatcasting

import torch
X1 = torch.ones(8)*2 + torch.arange(8).reshape(1, 8)
X2 = torch.ones(8)*2 + torch.arange(8).reshape(8, 1)
X1, X2

# 广播机制 (要保证维度数量相等,但二维可以搭配一维)
a = torch.arange(3).reshape(3, 1)
b = torch.arange(2).reshape(1, 2)
c = torch.arange(2)
print(a, b, c, sep='\n')
print('-'*30)
a * b, a * b == a * c

1
2
3

import torch
X = torch.randn(2,7,4)
X.sum(axis=1, keepdims=True).shape, X/X.sum(axis=1, keepdims=True)

(torch.Size([2, 1, 4]),
 tensor([[[-0.0854, -0.2663, -0.0131, -0.0275],
          [-1.4016, -0.0819, -0.1089, -0.0186],
          [ 0.6619, -1.1219,  0.0251,  0.1489],
          [-3.7513,  1.2105,  1.0315,  0.1439],
          [ 4.3160,  1.5818, -0.2686,  0.3375],
          [ 0.9566, -0.2599,  0.0430,  0.0494],
          [ 0.3037, -0.0624,  0.2909,  0.3664]],

         [[ 1.9549,  0.0095, -0.5526,  0.2624],
          [-0.4357, -0.5184,  0.6116,  0.2836],
          [ 0.0822, -0.3908,  1.1681,  0.3244],
          [-0.2784,  0.4929,  0.6904,  0.0872],
          [-0.3724, -0.2463, -0.2890, -0.3864],
          [-1.1481,  0.9092, -0.7448,  0.1903],
          [ 1.1975,  0.7439,  0.1163,  0.2385]]]))

矩阵堆叠

1
2
3

import torch
X, Y = torch.normal(0, 1, (3, 4)), torch.normal(0, 1, (3, 4))
torch.cat((X, Y), dim=1)

.reshape() (与.view()等效)

1
2
3

import torch
X = torch.arange(16).reshape(2,2,4)
X, X.reshape(-1, 4)  # 按第0维堆叠

(tensor([[[ 0,  1,  2,  3],
          [ 4,  5,  6,  7]],

         [[ 8,  9, 10, 11],
          [12, 13, 14, 15]]]),
 tensor([[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11],
         [12, 13, 14, 15]]))

1
2
3

import torch
X = torch.arange(16).reshape(2,2,4)
X, X.reshape(2, -1)  # 按第0维堆叠

(tensor([[[ 0,  1,  2,  3],
          [ 4,  5,  6,  7]],

         [[ 8,  9, 10, 11],
          [12, 13, 14, 15]]]),
 tensor([[ 0,  1,  2,  3,  4,  5,  6,  7],
         [ 8,  9, 10, 11, 12, 13, 14, 15]]))

1
2

# .reshape(-1)
import torch
X = torch.randn(2, 6)
X.reshape(-1).shape

1	torch.zeros(2, 3, 4) == torch.zeros((2, 3, 4))

reshape后得到的新向量开辟了新内存，但内存共享

a = torch.arange(12)
b = a.reshape(3, 4)
print('id(a) == id(b): ', id(a) == id(b))
print('a: ', a)
a[0] = 77
b

扩展维度

import torch
X = torch.arange(16).reshape(2,2,4)
print(X.shape)
Y = torch.unsqueeze(X, 0)
Y.shape, Y.squeeze().shape

torch.Size([2, 2, 4])





(torch.Size([1, 2, 2, 4]), torch.Size([2, 2, 4]))

使变量脱离自动求导

X = torch.randn(2, 6)
X.requires_grad_()
print(X)
X.detach_()

argmax: 取行的最大值的index

import torch
Y = torch.rand(2, 6)
idx = torch.tensor(range(len(Y)))
idx2 = Y.argmax(dim=1)
Y, Y.argmax(dim=1), Y[idx, idx2]

torch.numel 对Tensor内所有元素的个数

1
2
3

import torch
a = torch.zeros(1, 4, 4)
torch.numel(a), a.size(), a.size().numel(), a.numel()

(16, torch.Size([1, 4, 4]), 16, 16)

取元素

间隔取元素

![[10.png]]

X[-1], X[1:3]表示切的行

1
2
3

import torch
X = torch.arange(9).reshape(3, 3)
X[-1], X[1:3]

取矩阵指定行和列

取指定行

1
2
3

X = torch.arange(16).reshape(4, 4)
indices = torch.tensor([2, 3])
X, X[indices]

取指定列

1
2
3

X = torch.arange(16).reshape(4, 4)
indices = torch.tensor([2, 3])
X, X[:, indices], X[:, indices] == X.T[indices].T

取指定元素

import torch
Y = torch.rand(2, 6)
idx = torch.tensor(range(len(Y)))
idx2 = Y.argmax(dim=1)
Y, Y[idx, idx2]

对tensor进行shuffle

import torch
t=torch.tensor([[1,2],[3,4]])
r=torch.randperm(2)
c=torch.randperm(2)
print(r, c)
print(r[:, None])
print(r[:, None].shape, c.shape)
t=t[r[:, None], c]

t=torch.tensor([[1,2],[3,4]])
print(t)
r=torch.randperm(2)
c=torch.randperm(2)
t=t[r][:,c]
t

X = torch.arange(9).reshape(3, 3)
r=torch.randperm(3).reshape(3, 1)
c=torch.randperm(3)
X[r, c]

X = torch.arange(9).reshape(3, 3)
r=torch.randperm(3)
c=torch.randperm(3)
X[r, c]

按元素进行：x + y, x - y, x y, x / y, x*y

赋值

import torch
X = torch.arange(12).reshape(3, 4)
X[0:2, :] = 12
X

减少内存消耗

#     不要：Y = Y + X  
#     而是：X[:] = X + Y 或 X += Y (原地操作)
import torch
X = torch.arange(12).reshape(3, 4)
Y = torch.ones(3, 4)
before = id(Y)
Y = Y + X 
print(id(Y) == before)
before = id(Y)
Y += X 
print(id(Y) == before)

tensor与numpy

tensor转换到numpy后共享内存。

numpy转换到tensor后不共享内存。

import torch
X = torch.arange(12).reshape(3, 4)
A = X.numpy()
B = torch.tensor(A)
C = B.numpy()
type(A), type(B)

A[0] = 100
B[0] = 200
C[0] = 300
A, B, X,

2. 把大小为1的张量转化为标量

1
2
3

import torch
a = torch.tensor([3.5])
a, a.item(), float(a), int(a)

torch.exp()

1
2
3

import torch
x = torch.ones(3)
torch.exp(x)

torch.zeros_like(), torch.ones_like()

1
2
3

import torch
x = torch.ones(3)
torch.zeros_like(x), torch.ones_like(x)

.clone()

不进行内存共享, 地址不同

import torch
A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
B = A.clone()  # 通过分配新内存，将A的一个副本分配给B
id(A) == id(B)

.sum()降维

1
2
3

import torch
x = torch.arange(4, dtype=torch.float32)
x, x.sum()

1
2
3

import torch
X = torch.randn(2,7,4)
print(X.sum([0,1,2]))

keepdims=True时不降维

1
2
3

import torch
X = torch.ones(2, 5, 4)
X.sum(axis=0, keepdims=True)

矩阵乘法：torch.matmul()包括torch.mm()(两个元素必须都为矩阵)和torch.mv()

import torch
X = torch.randn(4, 2)
Y = torch.randn(2, 3)
torch.mm(X, Y) == torch.matmul(X, Y)

范数：向量的长度或大小，从张量映射到标量。

![[ds1.png]]
其中最后一行的a是标量

$L_2$范数

1
2
3

import torch
u = torch.tensor([3.0, -4.0])
torch.norm(u)

$L_1$范数

1
2
3

import torch
u = torch.tensor([3.0, -4.0])
torch.abs(u).sum()

len(X）为X第0维度数量

1
2
3

import torch
X = torch.randn(2,3,4)
len(X)

自动求导

在获取.grad前需要手动执行backward:
因为backward计算成本较大，所以若需要.grad，则再提前执行backward

生成高斯数据

指定均值和方差

1	torch.normal(0, 2, (2, 3))

生成正太分布数据

1	torch.randn((2, 3)), torch.randn(2, 3)

转换数据类型

直接制定要转换的数据类型

# 1. 使用独立的函数如 .int(),.float()等进行转换
# 2. 使用torch.long()等函数，直接显示输入需要转换的类型
# 3. 使用type_as()函数，将该tensor转换为另一个tensor的type
#使用独立的函数
import torch

X = torch.randn(2, 2)
print(X.type())
print(type(X))

# torch.long() 将tensor转换为long类型
long_tensor = X.long()
print(long_tensor.type())

# torch.half()将tensor转换为半精度浮点类型
half_tensor = X.half()
print(half_tensor.type())

# torch.int()将该tensor转换为int类型
int_tensor = X.int()
print(int_tensor.type())

# torch.double()将该tensor转换为double类型
double_tensor = X.double()
print(double_tensor.type())

# torch.float()将该tensor转换为float类型
float_tensor = X.float()
print(float_tensor.type())

# torch.char()将该tensor转换为char类型
char_tensor = X.char()
print(char_tensor.type())

# torch.byte()将该tensor转换为byte类型
byte_tensor = X.byte()
print(byte_tensor.type())

# torch.short()将该tensor转换为short类型
short_tensor = X.short()
print(short_tensor.type())

和其他变量类型一致

X = torch.randn(2, 5)
X = X.short()
y = torch.randn(2)
# key:
y.type(X.dtype)

softmax()

1
2
3

import torch
x = torch.randn(9)
torch.softmax(x, dim=0) # 必须要指定维度

训练中的原则

python中默认为float64，对深度学习一般为float32

批次约小，训练的效果越好，因为有随机噪声，反而对模型训练有帮助。

牛顿法？因为其收敛快，为什么不用牛顿法（二阶导算法）？

训练最后不够一个batch的样本：

1. 直接丢掉
2. 把它作为最小批次
3. 把下个的epoch里拿部分数据补全一个batch

将每个数据样本作为矩阵中的行向量更为常见。

tensor维数：

x.ndim

变量转移其他设备

1 2	X = torch.randn(2) X.to(torch.device('cpu'))

保存模型

state_dict 是什么？

# 定义模型
import torch
from torch import nn, optim
import torch.nn.functional as F


class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# 初始化 model
model = TheModelClass()

# 初始化 optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# 输出 model 的 state_dict
print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

# 输出 optimizer 的 state_dict
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
    print(var_name, "\t", optimizer.state_dict()[var_name])

保存和加载模型

保存

约定保持格式：.pth或.pt

1	torch.save(model.state_dict(), PATH)

对应加载模型命令：

1	torch.load(PATH)

加载

1
2
3

model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.eval()  # 将 dropout 和 batch normalization layers 设置为评估模式, 如果不这样做，就会产生不一致的推理结果。

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

/var/folders/51/p367wkbd1hz23m7nkt6_v8_40000gp/T/ipykernel_6926/767594367.py in <module>
----> 1 model = TheModelClass(*args, **kwargs)
      2 model.load_state_dict(torch.load(PATH))
      3 model.eval()  # 将 dropout 和 batch normalization layers 设置为评估模式, 如果不这样做，就会产生不一致的推理结果。


NameError: name 'TheModelClass' is not defined

保存 & 加载 Checkpoint 用于推断 and/or 恢复训练

保存

torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
            ...
            }, PATH) # 约定俗称：.tar

加载

model = TheModelClass(*args, **kwargs)
optimizer = TheOptimizerClass(*args, **kwargs)

checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

model.eval() # 推理前
# - or -
model.train() # 训练前

保存多个模型

应用：如GAN、sequence-to-sequence model 或 ensemble of models

# 保存
torch.save({
            'modelA_state_dict': modelA.state_dict(),
            'modelB_state_dict': modelB.state_dict(),
            'optimizerA_state_dict': optimizerA.state_dict(),
            'optimizerB_state_dict': optimizerB.state_dict(),
            ...
            }, PATH)

# 加载
modelA = TheModelAClass(*args, **kwargs)
modelB = TheModelBClass(*args, **kwargs)
optimizerA = TheOptimizerAClass(*args, **kwargs)
optimizerB = TheOptimizerBClass(*args, **kwargs)

checkpoint = torch.load(PATH)
modelA.load_state_dict(checkpoint['modelA_state_dict'])
modelB.load_state_dict(checkpoint['modelB_state_dict'])
optimizerA.load_state_dict(checkpoint['optimizerA_state_dict'])
optimizerB.load_state_dict(checkpoint['optimizerB_state_dict'])

modelA.eval()
modelB.eval()
# - or -
modelA.train()
modelB.train()

不同模型之间的热启动

1 2	# 保存 torch.save(modelA.state_dict(), PATH)

1
2
3

# 加载
modelB = TheModelBClass(*args, **kwargs)
modelB.load_state_dict(torch.load(PATH), strict=False) # 不相同key的进行加载

保存与加载在不同设备

保存在gpu，加载在cpu

# 保存
torch.save(model.state_dict(), PATH)

# 加载
device = torch.device('cpu')
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH, map_location=device))

保存在gpu，加载在gpu

# 保存
torch.save(model.state_dict(), PATH)

# 加载
device = torch.device("cuda")
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.to(device)
# Make sure to call input = input.to(device) on any input tensors that you feed to the model

保存在cpu，加载在gpu

# 保存
torch.save(model.state_dict(), PATH)

# 加载
device = torch.device("cuda")
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH, map_location="cuda:0"))  # Choose whatever GPU device number you want
model.to(device)
# 确保调用 input = input.to(device) 在任意的输入tensors上

加载并行模型

1 2	state_dict = torch.load(PATH).module.state_dict() model.load_state_dict(state_dict)

torch.device相关操作

查看显卡数量

1	torch.cuda.device_count()

选择显卡

1 2	torch.device('cpu') # torch.device(f'cuda:{0}') 如果有显卡的话

L2 Normalization

![[ds7.png]]

torch.empty

import torch
a = torch.empty(1, 2)
b = torch.zeros(1, 2)
print(a, a.shape, b)
a == b

网络定义

import torch.nn as nn
class DotProduct_Classifier(nn.Module):
    
    def __init__(self, num_classes=1000, feat_dim=2048, *args):
        super(DotProduct_Classifier, self).__init__()
        # print('<DotProductClassifier> contains bias: {}'.format(bias))
        self.fc = nn.Linear(feat_dim, num_classes)
        
    def forward(self, x, *args):
        x = self.fc(x)
        return x, None

    
clf = DotProduct_Classifier()
for k in clf.fc.state_dict():
    print(k)
print('-')
for k in clf.state_dict():
    print(k)

list2tensor, tensor2list

# list to tensor
import torch
X = [[1, 2, 3], [4, 5, 6]]
torch.tensor(X)

# tensor to list
import torch
X = torch.arange(6).reshape(2, 3)
X, X.tolist()

幂乘

import torch

x = torch.arange(9)
y = torch.ones(x.shape)*2
torch.pow(x, y)

batch matrix multiplication

torch.bmm(X Y)

import torch
X = torch.arange(8).reshape((2, 1, 4))
Y = torch.arange(24).reshape((2, 4, 3))
print(X, Y, sep='\n')
torch.bmm(X, Y), torch.bmm(X, Y).shape
# output:
# tensor([[[0, 1, 2, 3]],

#         [[4, 5, 6, 7]]])
# tensor([[[ 0,  1,  2],
#          [ 3,  4,  5],
#          [ 6,  7,  8],
#          [ 9, 10, 11]],

#         [[12, 13, 14],
#          [15, 16, 17],
#          [18, 19, 20],
#          [21, 22, 23]]])
# tensor([[[ 42,  48,  54]],

#         [[378, 400, 422]]]) torch.Size([2, 1, 3])

重复元素

torch.repeat_interleave

import torch
X = torch.arange(9)
print(X)
# X = torch.repeat_interleave(X, 3).reshape(-1, 3)
X = torch.repeat_interleave(X, 3).reshape(-1, 3)
X

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8])

tensor([[0, 0, 0],
        [1, 1, 1],
        [2, 2, 2],
        [3, 3, 3],
        [4, 4, 4],
        [5, 5, 5],
        [6, 6, 6],
        [7, 7, 7],
        [8, 8, 8]])

l = [1, 2]
import torch
l = torch.tensor(l)
l.repeat((2, 1))

tensor([[1, 2],
        [1, 2]])

repeat

1
2
3

import torch
X = torch.arange(2)
X.repeat((3, 3))

tensor([[0, 1, 0, 1, 0, 1],
        [0, 1, 0, 1, 0, 1],
        [0, 1, 0, 1, 0, 1]])

seed

使每次随机生成的数字相同，方便复现

import torch
torch.manual_seed(0)
print(torch.rand(1))
torch.manual_seed(0)
print(torch.rand(1))

tensor([0.4963])
tensor([0.4963])

import torch
torch.manual_seed(0)
print(torch.rand(1))
print(torch.rand(1))

tensor([0.4963])
tensor([0.7682])

参考： https://blog.csdn.net/qq_42951560/article/details/112174334

nn.Embedding

![[ds99.png]]

grad_fn

叶子节点通常为None，只有结果节点的grad_fn才有效，用于指示梯度函数是哪种类型。例如下面示例代码中的y.grad_fn=, z.grad_fn=

import torch

x = torch.rand(3, requires_grad=True)
y = x**2
z = x + x

import torch

x = torch.rand(3, requires_grad=True)
y = x**2
z = x + x
y.backward(x)

tensor([2., 2., 2.])

torch.autograd.backward

torch.autograd.backward(
		tensors, 
		grad_tensors=None, 
		retain_graph=None, 
		create_graph=False, 
		grad_variables=None)

tensor: 用于计算梯度的tensor。也就是说这两种方式是等价的：torch.autograd.backward(z) == z.backward()
grad_tensors: 在计算矩阵的梯度时会用到。他其实也是一个tensor，shape一般需要和前面的tensor保持一致。其实可以理解为，把矩阵中各个值当作不同的loss,grad_tensors表示对各个loss加权求和。
retain_graph: 通常在调用一次backward后，pytorch会自动把计算图销毁，所以要想对某个变量重复调用backward，则需要将该参数设置为True
create_graph: 当设置为True的时候可以用来计算更高阶的梯度
grad_variables: 这个官方说法是grad_variables’ is deprecated. Use ‘grad_tensors’ instead.也就是说这个参数后面版本中应该会丢弃，直接使用grad_tensors就好了。

参考文档： https://zhuanlan.zhihu.com/p/83172023

torch.nn.Embedding

定义

根据index取矩阵的某行

import torch
from torch import nn

embedding = nn.Embedding(10, 3)
print(embedding)
input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
x = embedding(input)
print(x)

Embedding(10, 3)
tensor([[[-1.0888, -1.7198, -1.3069],
         [ 1.2080, -1.1661,  0.2130],
         [ 1.4148, -1.8029,  0.2960],
         [ 0.3168, -0.7697, -0.5757]],

        [[ 1.4148, -1.8029,  0.2960],
         [-1.0824,  0.4938, -0.5015],
         [ 1.2080, -1.1661,  0.2130],
         [-0.5536, -0.6143, -0.3154]]], grad_fn=<EmbeddingBackward0>)

可以看到第一个样本中的2和第二个样本中的2的embedding相同。
通过debug可发行，embedding的权重们的第二行和样本中的2的embedding相同。

不可用复数index

即使在embeddding构造时使用了负数指针，在对输入进行embedding时仍然不可使用负数指针。

import torch
import torch.nn as nn

emb = nn.Embedding(20, 100, padding_idx=-1)
inp = torch.tensor([5, 2, 7, 12, 3])
bad_padding = torch.cat((inp, torch.tensor([-1] * 3)))
good_padding = torch.cat((inp, torch.tensor([0] * 3)))
out = emb(good_padding)
out = emb(bad_padding)  # RuntimeError

1
2

torch.roll

>>> x = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8]).view(4, 2)
>>> x
tensor([[1, 2],
        [3, 4],
        [5, 6],
        [7, 8]])
'''第0维度向下移1位，多出的[7,8]补充到顶部'''
>>> torch.roll(x, 1, 0)
tensor([[7, 8],
        [1, 2],
        [3, 4],
        [5, 6]])
'''第0维度向上移1位，多出的[1,2]补充到底部'''
>>> torch.roll(x, -1, 0)
tensor([[3, 4],
        [5, 6],
        [7, 8],
        [1, 2]])
'''tuple元祖,维度一一对应：
第0维度向下移2位，多出的[5,6][7,8]补充到顶部，
第1维向右移1位，多出的[6,8,2,4]补充到最左边'''
>>> torch.roll(x, shifts=(2, 1), dims=(0, 1))
tensor([[6, 5],
        [8, 7],
        [2, 1],
        [4, 3]])
```    
参考文档： https://blog.csdn.net/weixin_42899627/article/details/116095067

## 用slice对tensor赋值


```python
import torch
from torch import nn


input = torch.ones(1, 6, 5, 1)
input[:,slice(0, -1),slice(0, -4),:] = 0
print(input)

tensor([[[[0.],
          [1.],
          [1.],
          [1.],
          [1.]],

         [[0.],
          [1.],
          [1.],
          [1.],
          [1.]],

         [[0.],
          [1.],
          [1.],
          [1.],
          [1.]],

         [[0.],
          [1.],
          [1.],
          [1.],
          [1.]],

         [[0.],
          [1.],
          [1.],
          [1.],
          [1.]],

         [[1.],
          [1.],
          [1.],
          [1.],
          [1.]]]])

1
2
3

import torch
input = torch.arange(10).reshape(2, 5)
input

tensor([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]])

1	torch.max(input), torch.min(input)

(tensor(9), tensor(0))

对dataset或dataloader保存或加载到硬盘

dataset

保存

1 2	with open(f"/home/featurize/work/huggingFace/data/dataset_save_{i}.pkl","wb") as f: dill.dump(encoded_dataset, f)

加载

1 2	with open(f"/home/featurize/work/huggingFace/data/dataset_save_{i}.pkl","rb") as f: dataset = dill.load(f)

拼接多个dataset

from torch.utils.data import ConcatDataset
dataset_l = []
with open(f"/home/featurize/work/huggingFace/data/dataset_save_{i}.pkl","rb") as f:
    dataset = dill.load(f)
    dataset_l.append(dataset)
    
encoded_dataset = ConcatDataset(dataset_l)

dataloader

# 保存class
from torch.utils.data import TensorDataset, DataLoader
import torch
from sklearn.datasets import make_classification

data, target = make_classification()
# data.shape, target.shape
# ((100, 20), (100,))

batch_size=10
dataset = TensorDataset(torch.from_numpy(data), torch.from_numpy(target))
dataloader = DataLoader(dataset, shuffle=False, drop_last=True, batch_size=batch_size)

保存：

with open('./dataset_save.pkl','wb') as f:
    dill.dump(dataset, f)

with open('./dataloader_save.pkl','wb') as f:
    dill.dump(dataloader, f)

加载：

with open('./dataset_save.pkl','rb') as f:
    dataset_save = dill.load(f)

with open('./dataloader_save.pkl','rb') as f:
    dataloader_save = dill.load(f)

数据比较：

x, y = next(iter(dataloader))
x_save, y_save = next(iter(dataloader_save))
torch.equal(x, x_save), torch.equal(y, y_save)
# (True, True)

另一种方法：

import numpy as np

limit = 1024**3.8  # 假设最大为4GB => 1024MB * 3.8 (需要预留额外的内存，不能满)
save_path = ''  # 设置dataloader的存储路径（默认当前运行路径）
data = np.ones((180000,60,80)).astype('float64')  # 假设数据是一个numpy类型的
batchsize = 128

# chunk data
len_tmp = data.shape[0]
if len_tmp%batchsize != 0:
  less_num = batchsize-(len_tmp - int(len_tmp/batchsize) *batchsize)
  new_tmp = data[:less_num].copy()
  tmp = np.vstack((tmp,new_tmp))

num_chunk = math.ceil(data.size / limit)
unit_portion = data.size//num_chunk
unit_element = unit_portion//data[0].size
fn = [save_path,'','dataloader.pth']
for x in range(num_chunk):
    train_loader = DataLoader(data[x*unit_element:(x+1)*unit_element], shuffle=True, batch_size=batchsize, drop_last=False)
    fn[1] = str(x)
    torch.save(train_loader, ''.join(fn))

参考文档：
dataset，dataloader的保存和加载：https://blog.51cto.com/u_15479825/5759638?u_atoken=7e0b820d-ca44-475b-9a54-c82cf94bdb12&u_asession=01GPwzmWa1ry9c7KdXQea33kN9K5j7vEBc78zDUBqF0YtNty_fK2ENlcKD5fNnNAJ5X0KNBwm7Lovlpxjd_P_q4JsKWYrT3W_NKPr8w6oU7K8GBpNDzsUeT-qo4RAakzb7GALgmy0OhKJ6h8uIjom8j2BkFo3NEHBv0PZUm6pbxQU&u_asig=05qqfmDpV5jnzQ3zaOR-kKvqWmM_gMVMMD705BpMkyv3CPUJQvtVvj78WHjb5wcd0h-4YcVa1bfEmeCJlHjP_VQke_nCTwh7QVkVO0jrZofOCb0W-MpihYoTg3Oo5pVqUxvl_JATPprCgVh9LhRCxEcXxAdtoL9xXJoNYkHhLKGKn9JS7q8ZD7Xtz2Ly-b0kmuyAKRFSVJkkdwVUnyHAIJzZ3FgKYC5IX1b2gc4RTsOK2ObJJ7YHP9e00NXNNBqjPfBMZyRAui7XvSM8Ig_GQPYO3h9VXwMyh6PgyDIVSG1W-_5u5D0fj0TriSdrRfzwU0URVBt9lQiJmjkIAkwyBHKBOYz0QKgIqkvS_sNnhtA4mWv3ThKYYL7qPVo8qEq4ufmWspDxyAEEo4kbsryBKb9Q&u_aref=vBU%2FLA5cyHxZiTAh4kXuv46vzeA%3D
合并dataset: https://zj-image-processing.readthedocs.io/zh_CN/latest/pytorch/preprocessing/[torchvision][ConcatDataset]%E8%BF%9E%E6%8E%A5%E5%A4%9A%E4%B8%AA%E6%95%B0%E6%8D%AE%E9%9B%86/
保存dataloader，第二种方法：https://blog.csdn.net/u013302570/article/details/120353481

transforms

transforms.ColorJitter

import torchvision.transforms as transforms


# 单独设置
# 随机改变图像的亮度
brightness_change = transforms.ColorJitter(brightness=0.5)
# 随机改变图像的色调
hue_change = transforms.ColorJitter(hue=0.5)
# 随机改变图像的对比度
contrast_change = transforms.ColorJitter(contrast=0.5)

# 综合设置
color_aug = transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5)

transform = transforms.Compose([
        brightness_change,
        hue_change,
        contrast_change,
    ])

参考文档： https://blog.csdn.net/flyfish1986/article/details/108831332

# GPU是否可用

1
2
3

import torch
device = torch.device("cuda:0" if (torch.cuda.is_available() and ngpu > 0) else "cpu")
device

device(type='cpu')

##

1
2
3

print(
    torch.version.cuda
)

None

##

1
2
3

print(
    torch.__version__
)

1.10.0

.index_select

import torch
import random
import numpy as np

x = torch.arange(16).reshape(4, 4)
print('ori_x: \n', x)
index_l = np.random.choice(4, 2).tolist()
print('index: ', index_l)
print('type(index_l): ', type(index_l))
print(
    x[index_l] == x.index_select(0, torch.tensor(index_l)),
    x[index_l] == x[torch.tensor(index_l)],
    sep='\n'
)

ori_x: 
 tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15]])
index:  [2, 0]
type(index_l):  <class 'list'>
tensor([[True, True, True, True],
        [True, True, True, True]])
tensor([[True, True, True, True],
        [True, True, True, True]])

1
2

tensor([[0, 1, 2, 3],
        [0, 1, 2, 3]])

1	x.index_select(1, torch.tensor(index_l))

tensor([[ 0,  0],
        [ 4,  4],
        [ 8,  8],
        [12, 12]])

@: matrix multiplier

X = torch.arange(20).reshape(4, 5).float()
Y = torch.randn(5, 2)
Z = X@Y
Z, Z.shape

(tensor([[ -5.4068,   4.8318],
         [ -8.5776,  10.9364],
         [-11.7484,  17.0410],
         [-14.9192,  23.1456]]),
 torch.Size([4, 2]))

torch.where

x = torch.randn(3, 2)
y = torch.ones(3, 2)
print(x)
torch.where(x>0, x, y)

tensor([[ 0.1448,  0.7330],
        [-0.6909, -0.0393],
        [ 0.8537, -0.2563]])





tensor([[0.1448, 0.7330],
        [1.0000, 1.0000],
        [0.8537, 1.0000]])

pandas基本操作

pandas可以填补或删除数据缺失；
pandas可以转换成张量。

import os
import pandas as pd

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')  # 列名
    f.write('NA,Pave,127500\n')  # 每行表示一个数据样本
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

data = pd.read_csv(data_file)
print(f'data:\n {data}')
 

inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
print(f'inputs1:\n {inputs}')


inputs = pd.get_dummies(inputs, dummy_na=True)
print(f'inputs2:\n {inputs}')


import torch

X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

data:
    NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000
inputs1:
    NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN
inputs2:
    NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1


/var/folders/51/p367wkbd1hz23m7nkt6_v8_40000gp/T/ipykernel_46842/3935832856.py:18: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  inputs = inputs.fillna(inputs.mean())





(tensor([[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))

# 1.删除缺失值最多的列。
count = 0
count_max = 0
labels = ['NumRooms','Alley','Price']
flag = ''
for label in labels:
    count = data[label].isna().sum()
    if count > count_max:
        count_max = count
        flag = label
if flag:
    data_new = data.drop(flag, axis=1)
    print(data_new)

损失函数：

l2 loss

1 2	from torch import nn loss = nn.MSELoss()

l1 loss: |y - y’|
huber’s robudt loss:

![[1.png]]

理论知识

亚导数

手工对不可导处进行赋值
![[ds2.png]]

梯度

![[ds3.jpg]]
![[ds4.png]]
![[ds5.png]]
![[ds6.jpg]]

ill-condition VS well-condition

下图左为ill-condition，右为well-condition
![[ds36.png]]

线性神经网络

线性回归

前提假设：

x与y线性相关（y是x加权和）
需考虑噪声

![[ds15.png]]
![[ds16.png]]
![[ds17.png]]
![[ds18.png]]

特点：

是对原始特征进行放射变换：加权和进行线性变换，偏置项进行平移。
这个域中只有一个极值。

平方误差：

![[ds19.png]]
整个数据集的损失均值：
![[ds13.png]]
寻找参数：
![[ds20.png]]

线性回归的解析解

小批量梯度下降

![[ds21.png]]
即

![[ds22.png]]

正态分布

公式：
![[ds23.png]]
正太分布的定义：

%matplotlib inline
import math
import time
import numpy as np
import torch
from d2l import torch as d2l


def normal(x, mu, sigma):
    p = 1 / math.sqrt(2 * math.pi * sigma**2)
    return p * np.exp(-0.5 / sigma**2 * (x-mu)**2)


x = np.arange(-7, 7, 0.01)
params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
        ylabel='p(x)', figsize=(4.5, 2.5),
        legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])

基于正态分布构建的平方损失

假设噪声属于正态分布

![[ds24.png]]
即
![[ds25.png]]
![[ds26.png]]
根据最大似然估计
![[ds27.png]]
由此可知，通过调整w, b参数得到最优解，最小化均方误差等价于对线性模型的最大似然估计

从零实现

import random
import torch
from d2l import torch as d2l


# 生成人造数据
def synthetic_data(w, b, num_examples):
    X = torch.normal(0, 1, (num_examples, len(w)))
    y = X @ w + b
    y += torch.normal(0, 0.01, y.shape)
    return X, y.reshape(-1, 1)


true_w = torch.tensor([2, -3.4])
# true_w = torch.tensor([0.0, 0])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)
print(features.shape, labels.shape)
print('features: ', features[0], '\nlabel: ', labels[0])

d2l.set_figsize((2.5, 1.5))
d2l.plt.scatter(features[:, 0].detach().numpy(), labels.detach().numpy(), 1)


def data_iter(batch_size, features, labels):
    num_samples = len(features)
    indices = list(range(num_samples))
    random.shuffle(indices)
    for i in range(0, num_samples, batch_size):
        indices_sub = torch.tensor(indices[i: min(i+batch_size, num_samples)])
        yield features[indices_sub], labels[indices_sub]
        

for X, y in data_iter(2, features, labels):
    pass
print(X, y, sep='\n')

w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)


def linreg(X, w, b):
    return X@w + b


def squared_loss(y_hat, y):
    return (y_hat - y.reshape(y_hat.shape))**2/2
#     return (y_hat - y)**2/2


def sgd(params, lr, batch_size): 
    for param in params:
#         param -= lr * param.grad / batch_size
#         param.grad.zero_()
        with torch.no_grad(): # 利用参数的梯度对参数进行更新时，如果不脱离求导，则会报错：a leaf Variable that requires grad is being used in an in-place operation.
            param -= lr * param.grad / batch_size
            param.grad.zero_()
            
            
lr = 0.03
num_epochs = 3
batch_size = 10
for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
#         print('test1: ', linreg(X, w, b).shape, y.shape)
        l = sum(squared_loss(linreg(X, w, b), y))
        l.backward()
        sgd([w, b], lr, batch_size)
    with torch.no_grad():
#         print('test1: ', linreg(features, w, b).shape, labels.shape)
        l_train = sum(squared_loss(linreg(features, w, b), labels)) / len(labels)
        print('epoch {}:\tloss {}'.format(epoch+1, l_train))
        
print('w估计的误差：', w - true_w.reshape(w.shape))
print('b估计的误差：', b - true_b)

1	d2l.set_figsize??

1	d2l.use_svg_display??

简洁实现

import numpy as np
import torch
from torch import nn
from torch.utils import data
from d2l import torch as d2l

true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = d2l.synthetic_data(true_w, true_b, 1000)
print('features.shape, labels.shape: ', features.shape, labels.shape)

batch_size = 10
dataset = data.TensorDataset(features, labels)
data_iter = data.DataLoader(dataset, batch_size, shuffle=True)
print(next(iter(data_iter)))

net = nn.Sequential(nn.Linear(2, 1))
print(net)
net[0].weight.data.normal_(0, 0.01)
net[0].bias.data.fill_(0)

loss = nn.MSELoss()
lr=0.03
trainer = torch.optim.SGD(net.parameters(), lr=lr)

num_epochs = 3
for epoch in range(num_epochs):
    for X, y in data_iter:
        l = loss(net(X), y)
        trainer.zero_grad()
        l.backward()
        trainer.step()
    train_l = loss(net(features), labels)
    print(f'epoch {epoch + 1} loss: {train_l}')
    
w = net[0].weight.data
b = net[0].bias.data
print(f'w的估计误差：{true_w - w}')
print(f'w的估计误差：{true_b - b}')

features.shape, labels.shape:  torch.Size([1000, 2]) torch.Size([1000, 1])
[tensor([[-0.3293, -2.7195],
        [ 0.1543, -1.2371],
        [-0.6956, -2.1153],
        [-1.4573, -0.7525],
        [ 1.0091, -1.0744],
        [ 2.0950, -0.1321],
        [ 0.1768,  0.0210],
        [ 0.3708,  0.6088],
        [-1.0228, -0.4681],
        [ 1.6810,  0.8739]]), tensor([[12.8028],
        [ 8.7138],
        [ 9.9941],
        [ 3.8357],
        [ 9.8562],
        [ 8.8426],
        [ 4.4883],
        [ 2.8696],
        [ 3.7497],
        [ 4.5712]])]
Sequential(
  (0): Linear(in_features=2, out_features=1, bias=True)
)
epoch 1 loss: 0.00016310506907757372
epoch 2 loss: 0.00010406416549813002
epoch 3 loss: 0.00010402813495602459
w的估计误差：tensor([[ 0.0008, -0.0005]])
w的估计误差：tensor([0.0006])

softmax回归

标签问题

类别如果是{婴儿,儿童,青少年,青年人,中年人,老年人} ，则其类别可设为{0,1,2,3,4,5}，转变为回归问题。
但如果类别相互之间没有关系，则可根据one-hot的方式设定标签。

不可以直接用原始输出：

输出值域是$\left ( -\infty, \infty \right ) $, 不能说明置信度的大小
不能满足概率的特性：可能为负值；总和相加不为1.

熵、交叉熵需要重新回顾prml？

图片分类数据集

数据迭代器是获得更高性能的关键组件。
减少batch_size会减少读取性能。

%matplotlib inline
import torch
import torchvision
from torch.utils import data
from torchvision import transforms
from d2l import torch as d2l

d2l.use_svg_display()

trans = transforms.ToTensor()
mnist_train = torchvision.datasets.FashionMNIST(
    root='../data', train=True, transform=trans, download=True)
mnist_test = torchvision.datasets.FashionMNIST(
    root='../data', train=False, transform=trans, download=True)

print('len(mnist_train), len(mnist_test): ', len(mnist_train), len(mnist_test))
print('mnist_train[0][0].shape: ', mnist_train[0][0].shape)
print('mnist_train[0][1]: ', mnist_train[0][1])


def get_fashion_mnist_labels(labels):
    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
    return [text_labels[int(i)] for i in labels]


def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
    figsize = (num_cols*scale, num_rows*scale)
    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
    axes = axes.flatten()
    for i, (ax, img) in enumerate(zip(axes, imgs)):
        if torch.is_tensor(img):
            ax.imshow(img.numpy())
        else:
            ax.imshow(img)
        ax.axes.get_xaxis().set_visible(False)        
        ax.axes.get_yaxis().set_visible(False)
        if titles:
            ax.set_title(titles[i])
    return axes


X, y = next(iter(data.DataLoader(mnist_train, batch_size=18)))
print('X.shape: ', X.shape)
show_images(X.reshape(18, 28, 28), 2, 9, titles=get_fashion_mnist_labels(y));

batch_size = 1


def get_dataloader_workers():
    return 4


train_iter = data.DataLoader(mnist_train, batch_size, shuffle=True, num_workers=get_dataloader_workers())
timer = d2l.Timer()
for X, y in train_iter:
    continue
print(f'{timer.stop():.2f} sec')


# 整合
def load_data_fashion_mnist(batch_size, resize=None):
    trans = [transforms.ToTensor()]
    if resize:
        trans.insert(0, transforms.Resize(resize))
    trans = transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(
        root='../data', train=True, transform=trans, download=True)
    mnist_test = torchvision.datasets.FashionMNIST(
        root='../data', train=False, transform=trans, download=True)
    return (data.DataLoader(mnist_train, batch_size, shuffle=True, num_workers=get_dataloader_workers()),
           data.DataLoader(mnist_test, batch_size, shuffle=False, num_workers=get_dataloader_workers()))


train_iter, test_iter = load_data_fashion_mnist(32, 64)
for X, y in train_iter:
    print(
    X.shape, X.dtype, y.shape, y.dtype
    )
    break

softmax回归的从零开始实现

import torch
from IPython import display
from d2l import torch as d2l

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

num_features = 784
num_outputs = 10
W = torch.normal(0, 0.01, size=(num_features, num_outputs), requires_grad=True)
b = torch.zeros(num_outputs, requires_grad=True)


def softmax(X):
    X_exp = torch.exp(X)
    partition = X_exp.sum(1, keepdim=True)
    return X_exp / partition
    # ps: 代码实现是草率的，矩阵中非常大或非常小的元素会造成上溢或下溢？


X = torch.normal(0, 1, (2, 5))
X = softmax(X)
print(X, X.sum(1), sep='\n')


def net(X):
    return softmax(X.reshape(-1, W.shape[0])@W + b)


def cross_entropy(y_hat, y):
    return -torch.log(y_hat[range(len(y_hat)), y])
# 如果y_hat中出现0，则会报错


def accuracy(y_hat, y):
    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
        y_hat = y_hat.argmax(axis=1)
    cmp = y_hat.type(y.dtype) == y
    return float(cmp.type(y.dtype).sum())


class Accumulator(object):
    def __init__(self, n):
        self.data = [0.0] * n

    def add(self, *args):
        self.data = [a + b for a, b in zip(self.data, args)]

    def reset(self):
        self.data = [0.0] * len(self.data)

    def __getitem__(self, item):
        return self.data[item]


def evaluate_accuracy(net, data_iter):
    if isinstance(net, torch.nn.Module):
        net.eval()
    metric = Accumulator(2) # 存放 猜中个数、总数
    for X, y in data_iter:
        metric.add(accuracy(net(X), y), y.numel())
    return metric[0]/metric[1]


def train_epoch_ch3(net, train_iter, loss, updater):
    if isinstance(net, torch.nn.Module):
        net.train()
    metric = Accumulator(3) # 训练损失总和、训练准确度总和、样本数
    for X, y in train_iter:
        y_hat = net(X)
        l = loss(y_hat, y)
        if isinstance(updater, torch.optim.Optimizer):
            updater.zero_grad()
            l.backward()
            updater.step()
            metric.add(float(l)*len(y), accuracy(y_hat, y), len(y))
        else:
            l.sum().backward()
            updater(len(y))
            metric.add(float(l.sum()), accuracy(y_hat, y), len(y))
    # 返回训练损失和训练准确度
    return metric[0] / metric[2], metric[1] / metric[2]


class Animator:
    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
                 ylim=None, xscale='linear', yscale='linear',
                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
                 figsize=(3.5, 2.5)):
        if legend == None:
            legend = []
        d2l.use_svg_display()
        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
        if nrows * ncols == 1:
            self.axes = [self.axes, ]
        self.config_axes = lambda: d2l.set_axes(
            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend
        )
        self.X, self.Y, self.fmts = None, None, fmts

    def add(self, x, y):
        if not hasattr(y, '__len__'):
            y = [y]
        n = len(y)
        if not hasattr(x, '__len__'):
            x = [x] * n
        if not self.X:
            self.X = [[] for _ in range(n)]
        if not self.Y:
            self.Y = [[] for _ in range(n)]
        for i, (a, b) in enumerate(zip(x, y)):
            if a is not None and b is not None:
                self.X[i].append(a)
                self.Y[i].append(b)
        self.axes[0].cla()
        for x, y, fmt in zip(self.X, self.Y, self.fmts):
            self.axes[0].plot(x, y, fmt)
        self.config_axes()
        display.display(self.fig)
        display.clear_output(wait=True)


def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):
    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
                        legend=['train loss', 'train acc', 'test acc'])
    for epoch in range(num_epochs):
        train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
        test_acc = evaluate_accuracy(net, test_iter)
#         print('loss: ', train_metrics[0])
        animator.add(epoch+1, train_metrics + (test_acc,))
    train_loss, train_acc = train_metrics
    assert train_loss < 0.5, train_loss
    assert train_acc <= 1 and train_acc > 0.7, train_acc
    assert test_acc <= 1 and test_acc > 0.7, test_acc


lr = 0.1

def updater(batch_size):
    return d2l.sgd([W, b], lr, batch_size)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz


98.0%

Extracting ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100.6%


Extracting ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100.0%


Extracting ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


119.3%

Extracting ../data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw

tensor([[0.1926, 0.1602, 0.0841, 0.0954, 0.4677],
        [0.1114, 0.2495, 0.4155, 0.0890, 0.1346]])
tensor([1.0000, 1.0000])

d2l.sgd??

1 2	num_epochs = 10 train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, updater)

# 数据上溢
print(torch.exp(torch.ones(1)*100))
# 数据下溢
print(torch.exp(torch.ones(1)*(-1000)))

softmax回归的简洁实现

import torch
from torch import nn
from d2l import torch as d2l

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

net = nn.Sequential(nn.Flatten(), nn.Linear(784, 10))


def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)


net.apply(init_weights)
print(net)
print(net[1].weight, net[1].bias)

loss = nn.CrossEntropyLoss()
trainer = torch.optim.SGD(net.parameters(), lr=0.1)
num_epochs = 100
train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

1
2

1
2

1
2

多层感知机（multilayer perceptron）

—包含隐含层的全连接网络

简介

隐含层的作用

有些数据，可直接，或进行预处理后，进行线性映射
很多映射关系都需要考虑特征间的关系 -> 隐藏层

激活函数的作用

然而，如果简单的全连接层的堆叠，多感知机会退化为线性模型，如：

![[ds9.png]]

故，需要加入非线性的激活函数
很多映射关系都是非线性的映射

多层感知机的特点：

多层感知机可捕捉特征间复杂的关系。
即使单个隐含层，足够多的神经元可拟合任意函数。
实际中，更深的网络更容易学习。

激活函数

relu

![[ds28.png]]
当输入为0时不可导，人为设导数为0

1
2

import torch
from d2l import torch as d2l

x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
y = torch.relu(x)
d2l.plot(x.detach(), y.detach(), 'x', 'relu(x)', figsize=(5, 2.5))

(MakeHandsDirty_files/MakeHandsDirty_281_0.svg)

1 2	y.backward(torch.ones_like(x), retain_graph=True) # retain_graph=True可以重复backward，重复计算梯度 d2l.plot(x.detach(), x.grad, 'x', 'grad of relu', figsize=(5, 2.5))

(MakeHandsDirty_files/MakeHandsDirty_282_0.svg)

1
2

1
2

表现好的原因：

无论在前向还是在后项传播中，要么参数让消失，要么让参数通过
避免了梯度消失

变体：

![[ds10.png]]
优点：
即使参数为负，某些信息仍可以通过

示例：

1
2
3

x = torch.randn(3,5)
print(x)
torch.relu(x)

sigmoid

[[PapersRead#概念#sigmoid|sigmoid]]

tanh

![[ds12.png]]
特点：

值域（-1, 1）。
接近0时近似线性。
中心对称。

1 2	y = torch.tanh(x) d2l.plot(x.detach(), y.detach(), 'x', 'tanh(x)', figsize=(5, 2.5))

1
2
3

x.grad.data.zero_()
y.backward(torch.ones_like(x), retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))

从零实现

设计特点：

隐含层的宽度一般为2的幂（因为内存在硬件上的分配和寻址方式，这样做更高效）

import torch
from torch import nn
from d2l import torch as d2l


def relu(X):
    A = torch.zeros_like(X)
    return torch.max(X, A)


def net(X):
    X = X.reshape(-1, num_inputs)
    H = relu(X@W1 + b1)
    return (H@W2 + b2)


batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

num_inputs, num_outputs, num_hiddens = 784, 10, 256

W1 = nn.Parameter(torch.randn(num_inputs, num_hiddens, requires_grad=True)*0.01)
b1 = nn.Parameter(torch.zeros(num_hiddens, requires_grad=True))
W2 = nn.Parameter(torch.randn(num_hiddens, num_outputs, requires_grad=True)*0.01)
b2 = nn.Parameter(torch.zeros(num_outputs, requires_grad=True))
params = [W1, b1, W2, b2]

loss = nn.CrossEntropyLoss()

num_epochs, lr = 10, 0.1
updater = torch.optim.SGD(params, lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)
d2l.predict_ch3(net, test_iter)

num_hiddens = 1024:
![[ds30.png]]
num_hiddens1 = 1024, num_hiddens2 = 256:
![[ds31.png]]

简洁实现

import torch
from torch import nn
from d2l import torch as d2l

net = nn.Sequential(nn.Flatten(),
                   nn.Linear(784, 256),
                   nn.ReLU(),
                   nn.Linear(256, 10))


def init_weight(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)


net.apply(init_weight)
batch_size, lr, num_epochs = 256, 0.1, 10
loss = nn.CrossEntropyLoss()
trainer = torch.optim.SGD(net.parameters(), lr)

train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

模型选择、欠拟合和过拟合

目标：拟合多项式：
![[ds32.png]]
其中分母加n!是为了避免梯度过大或损失值过大，造成梯度爆炸。

import math
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l

max_degree = 20
n_train, n_test = 100, 100
true_w = np.zeros(max_degree)
true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])

features = np.random.normal(size=(n_train + n_test, 1))
np.random.shuffle(features)
poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))
for i in range(max_degree):
    poly_features[:, i] /= math.gamma(i+1)
labels = np.dot(poly_features, true_w)
labels += np.random.normal(0, 0.1)

true_w, features, poly_features, labels = [torch.tensor(x, dtype=torch.float32) for x in [true_w, features, poly_features, labels]]


def evaluate_loss(net, data_iter, loss):
    metric = d2l.Accumulator(2)
    for X, y in data_iter:
        out = net(X)
        y = y.reshape(out.shape)
        l = loss(out, y)
        metric.add(l.sum(), l.numel())
    return metric[0] / metric[1]


def train(train_features, test_features, train_labels, test_labels, num_epochs=400):
    loss = nn.MSELoss()
    input_shape = train_features.shape[-1]
    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
    batch_size = min(10, train_labels.shape[0])
    train_iter = d2l.load_array((train_features, train_labels.reshape(-1, 1)), batch_size)
    test_iter = d2l.load_array((test_features, test_labels.reshape(-1, 1)), batch_size, is_train=False)
    trainer = torch.optim.SGD(net.parameters(), lr=0.01)
    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log', 
                            xlim=[1, num_epochs], ylim=[1e-8, 1e2], legend=['train', 'test'])
    for epoch in range(num_epochs):
        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
        if epoch == 0 or (epoch + 1) % 20 == 0:
            animator.add(epoch+1, (evaluate_loss(net, train_iter, loss), evaluate_loss(net, test_iter, loss)))
    print('weight: ', net[0].weight.data.numpy())
    print('true_weight: ', true_w[0:4])
    

train(poly_features[:n_train, :4], poly_features[n_train:, :4], labels[:n_train], labels[n_train:])

1	train(poly_features[:n_train, :2], poly_features[n_train:, :2], labels[:n_train], labels[n_train:])

1	train(poly_features[:n_train, :], poly_features[n_train:, :], labels[:n_train], labels[n_train:])

权重衰减

正则化来源：
　　直觉上：函数 f=0 为最简单的函数
　　所以，我们认为，越接近0的函数复杂度越小
　　那么，如何定义函数与0接近的程度呢？
　　一种方法是：权重向量的范数
　　所以把权重范数加入loss中，即：
![[ds33.png]]
为什么用对L2范数进行平方，而不是直接利用L2范数呢？
因为对L2范数进行平方后，就是各元素平方的和，求导方便。
L1范数和L2范数都是有效和受欢迎的。

L0 VS L1

L0是非零元素的个数，也可以用于对参数稀疏化，即若非零元素越多，越对网络进行惩罚。
L1是各元素绝对值之和，也可用于稀疏矩阵，因为在L1变成极小时下降速率不变，可以使一些参数下降为0。
一般使用L1而非L0，因为L1是L0的最优凸近似。 L1比L0具有更好的优化求解性能。
特征稀疏的好处：

特征选择：去掉没用的特征干扰
可解释性：可人为知道哪些特征比较重要

L1 VS L2

L2范数（岭回归、权值衰减）：减小参数，但不会使参数变成0。
L1为什么可以稀疏参数，而L2不能？
1）在0位置L1下降速度比L2快：L2越接近0导数越小，下降速率越慢
![[ds34.png]]
2）模型空间限制
由下图可知，L1得到的极值点为菱角的角点，会使某些特征值为0（如w1），而L2得到的极值点不会出现该情况。
![[ds35.png]]

从零实现：

import torch
from torch import nn
from d2l import torch as d2l

n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
true_w, true_b = torch.ones((num_inputs, 1)) * 0.01, 0.05
train_data = d2l.synthetic_data(true_w, true_b, n_train)
train_iter = d2l.load_array(train_data, batch_size)
test_data = d2l.synthetic_data(true_w, true_b, n_test)
test_iter = d2l.load_array(test_data, batch_size, is_train=False)


def init_params():
    w = torch.normal(0, 1, size=(num_inputs, 1), requires_grad=True)
    b = torch.zeros(1, requires_grad=True)
    return [w, b]


def l2_penalty(w):
    return torch.sum(w.pow(2)) / 2


def l1_penalty(w):
    return torch.abs(w).mean()


def train(lambd):
    w, b = init_params()
    net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
    num_epochs, lr = 100, 0.003
    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                           xlim=[5, num_epochs], legend=['train', 'test'])
    for epoch in range(num_epochs):
        for X, y in train_iter:
            with torch.enable_grad():
                l = loss(net(X), y) + lambd * l2_penalty(w)
            l.sum().backward()
            d2l.sgd([w, b], lr, batch_size)
        if (epoch+1)%5 == 0:
            animator.add(epoch+1, (d2l.evaluate_loss(net, train_iter, loss),
                                  d2l.evaluate_loss(net, test_iter, loss)))
    print('w的L2范数是：', torch.norm(w).item())
    
    
train(lambd=0)  # 过拟合

1	train(lambd=3) # 使用正则项减轻过拟合

简洁实现

def train_concise(wd):
    net = nn.Sequential(nn.Linear(num_inputs, 1))
    for param in net.parameters():
        param.data.normal_()
    loss = nn.MSELoss()
    num_epochs, lr = 100, 0.003
    trainer = torch.optim.SGD([
        {'params': net[0].weight, 'weight_decay': wd},
        {'params': net[0].bias}], lr=lr)
    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                           xlim=[5, num_epochs], legend=['train', 'test'])
    for epoch in range(num_epochs):
        for X, y in train_iter:
            with torch.enable_grad():
                trainer.zero_grad()
                l = loss(net(X), y)
            l.backward()
            trainer.step()
        if (epoch + 1) % 5 == 0:
            animator.add(epoch+1, (d2l.evaluate_loss(net, train_iter, loss),
                                  d2l.evaluate_loss(net, test_iter, loss)))
    print('w的L2范数：', net[0].weight.norm().item())
    
    
train_concise(0)

1	train_concise(3)

1	train_concise(6)

1	train_concise(9)

Dropout

2017年，一组数据随机贴标签，可完美拟合。所以深度学习有强大的表示能力，过拟合非常正常。所有深度网络的泛化性质令人费解，其数学基础仍悬而未决。
加入噪声减缓过拟合。（1. 对输入加噪声。2. 每层的输入都加噪声。）
如何注入噪声：以无偏方式加入，即噪声的期望为零。
在标准dropout中，

![[ds57.png]]

常见技巧：靠近输入层设置较低丢弃概率

从零实现

import torch
from torch import nn
from d2l import torch as d2l


def dropout_layer(X, dropout_p):
    assert 0 <= dropout_p <= 1
    if dropout_p == 1:
        return torch.zeros_like(X)
    elif dropout_p == 0:
        return X
    else:
        mask = (torch.Tensor(X.shape).uniform_(0, 1) > dropout_p).float()
        return mask * X / (1.0 - dropout_p)


def test_dropout():
    X = torch.arange(16).reshape(4, 4)
    X1 = dropout_layer(X, 0)
    X2 = dropout_layer(X, 0.5)
    X3 = dropout_layer(X, 1)
    print(X1, X2, X3, sep='\n')


test_dropout()
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
dropout_p1, dropout_p2 = 0.2, 0.5
# dropout_p1, dropout_p2 = dropout_p2, dropout_p1


class Net(nn.Module):
    def __init__(self, num_inputs, num_outputs, num_hiddens1, num_hiddens2, is_train=True):
        super(Net, self).__init__()
        self.num_inputs = num_inputs
        self.is_train = is_train
        self.linear1 = nn.Linear(num_inputs, num_hiddens1)
        self.linear2 = nn.Linear(num_hiddens1, num_hiddens2)
        self.linear3 = nn.Linear(num_hiddens2, num_outputs)
        self.relu = nn.ReLU()

    def forward(self, X):
        H1 = self.relu(self.linear1(X.reshape(-1, self.num_inputs)))
        if self.is_train:
            H1 = dropout_layer(H1, dropout_p1)
        H2 = self.relu(self.linear2(H1))
        if self.is_train:
            H2 = dropout_layer(H2, dropout_p2)
        output = self.relu(self.linear3(H2))
        return output


net = Net(num_inputs, num_outputs, num_hiddens1, num_hiddens2)
num_epochs, lr, batch_size = 10, 0.5, 256
loss = nn.CrossEntropyLoss()
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
trainer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

简洁实现

net = nn.Sequential(
    nn.Flatten(),
    nn.Linear(num_inputs, num_hiddens1),
    nn.ReLU(),
    nn.Dropout(dropout_p1),
    nn.Linear(num_hiddens1, num_hiddens2),
    nn.ReLU(),
    nn.Dropout(dropout_p2),
    nn.Linear(num_hiddens2, num_outputs)
)


def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)


net.apply(init_weights)
trainer = torch.optim.SGD(net.parameters(), lr=lr)
num_epochs = 30
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

深度学习计算

层与块

import torch
from torch import nn
from torch.nn import functional as F

net = nn.Sequential(nn.Linear(20, 256), nn.ReLU(), nn.Linear(256, 10))
X = torch.rand(2, 20)
net(X)

自定义块

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(20, 256)
        self.out = nn.Linear(256, 10)
        
    def forward(self, X):
        return self.out(F.relu(self.hidden(X)))
    

net = MLP()
X = torch.rand(2, 20)
net(X)

顺序块：仿写Sequential

其中，用self._modules存放block，是因为，系统知道在其中查找子块进行初始化

class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        for block in args:
            self._modules[block] = block
    
    def forward(self, X):
        for block in self._modules.values():
            X = block(X)
        return X
    

net = MySequential(nn.Linear(20, 256), nn.ReLU(), nn.Linear(256, 10))
X = torch.rand(2, 20)
net(X)

在块中执行python流

class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.rand_weight = torch.rand((20, 20), requires_grad=False)
        self.linear = nn.Linear(20, 20)
        
    def forward(self, X):
        X = self.linear(X)
        X = F.relu(torch.mm(X, self.rand_weight) + 1)
        X = self.linear(X)
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()
    
    
net = FixedHiddenMLP()
net(X)

混搭各种自模块

class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(20, 64), nn.ReLU(),
                                nn.Linear(64, 32), nn.ReLU())
        self.linear = nn.Linear(32, 16)
        
    def forward(self, X):
        return self.linear(self.net(X))
    
    
fixnet = nn.Sequential(NestMLP(), nn.Linear(16, 20), FixedHiddenMLP())
X = torch.rand(2, 20)
fixnet(X)

参数管理

读取参数

import torch
from torch import nn

net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1))
X = torch.rand(size=(2, 4))
net(X)

1 2	print(net[2].state_dict()) print(net)

1
2
3

print(type(net[2].bias))
print(net[2].bias)
print(net[2].bias.data)

1	net[2].weight.grad == None

1 2	print([(name, param.shape) for name, param in net[0].named_parameters()]) print([(name, param.shape) for name, param in net.named_parameters()])

1	net.state_dict()

1	net.state_dict()['2.bias'].data

def block1():
    return nn.Sequential(nn.Linear(4, 8), nn.ReLU(), 
                         nn.Linear(8, 4), nn.ReLU())


def block2():
    net = nn.Sequential()
    for i in range(4):
        net.add_module(f'block {i}', block1())
    return net


rgnet = nn.Sequential(block2(), nn.Linear(4, 1))
X = torch.rand(size=(2, 4))
rgnet(X)

1	print(rgnet)

1	rgnet.state_dict()

1	rgnet[0][1][0].bias.data

初始化参数

调包

def init_normal(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, mean=0, std=0.01)
        nn.init.zeros_(m.bias)

        
net.apply(init_normal)
net[0].weight.data[0], net[0].bias.data[0]

def init_constant(m):
    if type(m) == nn.Linear:
        nn.init.constant_(m.weight, 1)
        nn.init.zeros_(m.bias)
    
    
net.apply(init_constant)
net[0].weight.data[0], net[0].bias.data[0]

def xavier(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)
        
        
def init_42(m):
    if type(m) == nn.Linear:
        nn.init.constant_(m.weight, 42)
        

net[0].apply(xavier)
net[2].apply(init_42)
print(
    net[0].weight.data[0],
    net[2].weight.data,
    sep='\n'
)

自定义

实现：
![[ds58.png]]

def my_init(m):
    if type(m) == nn.Linear:
        print('Init', *[(name, param.shape) for name, param in m.named_parameters()][0])
        nn.init.uniform_(m.weight, -10, 10)
        m.weight.data *= m.weight.data > 5
            
        
net.apply(my_init)
net[0].weight[:2]

可直接设置参数

1
2
3

print(net[0].weight.shape)
net[0].weight.data[0, 0] = 42
net[0].weight.data[0]

共享参数

shared = nn.Linear(8, 8)
net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.Linear(8, 1))
X = torch.rand(2, 4)
net(X)
print(net[2].weight.data[0] == net[4].weight.data[0])
net[2].weight.data[0, 0] = 100
print(net[4].weight.data[0, 0])
print(net[2].weight.data[0] == net[4].weight.data[0])

自定义层

不带参数的层

import torch
import torch.nn.functional as F
from torch import nn


class CenteredLayer(nn.Module):
    def __init__(self):
        super().__init__()
        
    def forward(self, X):
        return X - X.mean()
    
    
layer = CenteredLayer()
layer(torch.arange(1, 6).float())

1
2
3

net = nn.Sequential(nn.Linear(8, 128), CenteredLayer())
Y = net(torch.rand(4, 8))
Y.mean()

带参数的层

class MyLinear(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(in_channels, out_channels))
        self.bias = nn.Parameter(torch.randn(out_channels))
        
    def forward(self, X):
        linear = torch.mm(X, self.weight.data) + self.bias.data
        return F.relu(linear)
    
    
linear = MyLinear(5, 3)
linear.weight

1	linear(torch.rand(2, 5))

把自己定义的网络当作基本模块使用

1 2	net = nn.Sequential(MyLinear(64, 8), MyLinear(8, 1)) net(torch.rand(2, 64))

读写文件

加载和保存张量

变量

import torch
from torch import nn
from torch.nn import functional as F

x = torch.arange(4)
torch.save(x, 'ds/x-file')
x2 = torch.load('ds/x-file')
x2

列表

y = torch.zeros(4)
torch.save([x, y], 'ds/x-files')
x2, y2 = torch.load('ds/x-files')
(x2, y2)

字典

mydict = {'x': x, 'y': y}
torch.save(mydict, 'ds/mydict')
mydict2 = torch.load('ds/mydict')
mydict2

加载、保存模型参数

保存模型

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(20, 256)
        self.output = nn.Linear(256, 10)
        
    def forward(self, x):
        return self.output(F.relu(self.hidden(x)))
    
    
net = MLP()
X = torch.randn(size=(2, 20))
Y = net(X)
Y

1	torch.save(net.state_dict(), 'ds/mlp.params')

加载模型

1
2
3

clone = MLP()
clone.load_state_dict(torch.load('ds/mlp.params'))
clone.eval()

验证是否和原始模型相同

1 2	Y_clone = clone(X) Y_clone == Y

GPU

torch.device(‘cpu’) 用cpu所有的核心.
torch.cuda.device(‘cuda’) 等价于 torch.cuda.device(f’cuda:{0}’)

import torch
from torch import nn

torch.device('cpu'), torch.cuda.device('cuda')

1	torch.cuda.device_count()

# 尝试使用低i卡
def try_gpu(i=0):
  if torch.cuda.device_count() >= i + 1:
    return torch.device(f'cuda:{i}')
  else:
    return torch.device('cpu')


def try_all_gpu():
  devices = [torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())]
  return devices if devices else torch.device('cpu')


try_all_gpu(), try_gpu()

默认，张量在cpu上创建

1 2	x = torch.tensor([1, 2, 3]) x.device

对两个张量进行计算，两张量都必须在同一设备上。

X = torch.ones(2, 3, device=try_gpu(0))
Y = torch.ones(2, 3, device=torch.device('cpu'))
print(X, Y, sep='\n')
# print(X + Y) # 由于不在同一设备上而报错
Z = Y.cuda(0) # 等价于 Z = Y.cuda()
X + Z

神经网络与GPU

net = nn.Sequential(nn.Linear(3, 1))
net = net.to(device=try_gpu())
X = torch.ones(2, 3, device=try_gpu())
net(X)

1	net[0].weight.data.device

不经意的移动数据可能会显著降低性能。一个典型的错误：计算每个gpu的小批量损失，并打印出来。这将触发全局解释器，导致所有GPU阻塞。

卷积神经网络

图像卷积

import torch
from torch import nn
# from d2l import torch as d2l


def corr2d(X, K):
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - 2 + 1))
    print(X.shape, X.shape[1])
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i][j] = (X[i: i+h, j: j+w] * K).sum()
    return Y


X = torch.arange(9).reshape(3, 3)
K = torch.arange(4).reshape(2, 2)
corr2d(X, K)

定制卷积层

class Conv2D(nn.Module):
    def __init__(self, kernel_size):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias = nn.Parameter(torch.zeros(1))
    
    def forward(self, x):
        return corr2d(x, self.weight) + self.bias

学习一个边缘检测卷积核

真值：

X = torch.ones(6, 8)
X[:, 2:6] = 0
print("X: ", X)
K = torch.tensor([[1.0, -1.0]])
Y = corr2d(X, K)
print('Y: ', Y)

学习：

conv2d = nn.Conv2d(1, 1, kernel_size=(1, 2), bias=False)
X = X.reshape(1, 1, 6, 8)
Y = Y.reshape(1, 1, 6, 7)
lr = 3e-2

for i in range(10):
    Y_hat = conv2d(X)
    l = (Y_hat - Y) ** 2
    conv2d.zero_grad()
    l.sum().backward()
    conv2d.weight.data -= lr * conv2d.weight.grad
    if i % 2 == 0:
        print('fuck loss: ', l.sum())
        
print(conv2d.weight.data)

填充和步幅

填充

若输入：$n_h \times n_w$
kernel: $k_h \times k_w$
高宽的填充：$p_h , p_w$
则输出：$(n_h-k_h+p_h+1) \times (n_w-k_w+p_w+1)$

kernel大小设置奇数：

左右两侧填充数量相同。
若输入与输出相同大小，则Y[i, j]是以X[i, j]为中心计算得出。

填充后进行卷机计算(展现输入与输出维度相同)：

import torch
from torch import nn


def comp_conv2d(conv2d, X):
    X = X.reshape((1, 1) + X.shape)
    Y = conv2d(X)
    Y = Y.reshape(Y.shape[2:])
    return Y


conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1)
X = torch.randn(8, 8)
comp_conv2d(conv2d, X).shape

步幅

![[ds59.png]]

1
2
3

conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1, stride=2)
X = torch.randn(8, 8)
comp_conv2d(conv2d, X).shape

1
2
3

conv2d = nn.Conv2d(1, 1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))
X = torch.randn(8, 8)
comp_conv2d(conv2d, X).shape

在实践中，通常是：$p_h = p_w, s_h = s_w$

多输入多输出通道

若输入输出通道、kernel宽高分别为：$c_i, c_o, k_h, k_w$，则卷积核的维度为：$c_i \times c_o\times k_h\times k_w$

多输入通道,单输出通道

import torch
from d2l import torch as d2l


def corr2d_multi_in(X, K):
    return sum(d2l.corr2d(x, k) for x, k in zip(X, K))


X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
               [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])
print(X.shape, K.shape)
corr2d_multi_in(X, K)

torch.Size([2, 3, 3]) torch.Size([2, 2, 2])





tensor([[ 56.,  72.],
        [104., 120.]])

多输入通道,多输出通道

def corr2d_multi_in_out(X, K):
    return torch.stack([corr2d_multi_in(X, k) for k in K], 0)
    

K_ = torch.stack((K, K+1, K+2), 0)
print(K.shape, K_.shape)
res = corr2d_multi_in_out(X, K_)
res, res.shape

torch.Size([2, 2, 2]) torch.Size([3, 2, 2, 2])





(tensor([[[ 56.,  72.],
          [104., 120.]],

         [[ 76., 100.],
          [148., 172.]],

         [[ 96., 128.],
          [192., 224.]]]),
 torch.Size([3, 2, 2]))

$1\times 1$卷机层

作用：调整通道数量

def corr2d_multi_in_out_1x1(X, K):
    c_i, h, w = X.shape
    c_o = K.shape[0]
    X = X.reshape(c_i, h*w)
    K = K.reshape(c_o, c_i)
    Y = torch.matmul(K, X)
    return Y.reshape((c_o, h, w))


X = torch.normal(0, 1, (3, 3, 3))
K = torch.normal(0, 1, (2, 3, 1, 1))
Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
torch.abs(Y1-Y2) < 1e-6

Pooling层

作用：
1. 减轻卷积层对位置的敏感。
2. 可减少高宽。
默认若pooling窗口为k,则步幅为k.可手动设置。
在多个通道时，对每个通道单独运行，与卷机层不同。
pooling层是输入输出通道数相同。

import torch
from torch import nn
from d2l import torch as d2l


def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size
    Y = torch.zeros(X.shape[0]-p_h+1, X.shape[1]-p_w+1)
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j] = X[i:i+p_h, j:j+p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i:i+p_h, j:j+p_w].float().mean()
    return Y


X = torch.randint(0, 10, (3, 3))
print(X)
pool2d(X, (2, 2), 'max'), pool2d(X, (2, 2), 'avg')

填充与步幅

默认若pooling窗口为k,则步幅为k.

1 2	X = torch.arange(16, dtype=torch.float32).reshape(1, 1, 4, 4) X

1 2	pool2d = nn.MaxPool2d(3) pool2d(X), pool2d(X).shape

手动设置填充和步幅：

1 2	pool2d = nn.MaxPool2d(3, padding=1, stride=2) pool2d(X)

多个通道

1 2	X = torch.cat((X, X+1), 1) X, X.shape

1
2
3

pool2d = nn.MaxPool2d(3, padding=1, stride=2)
res = pool2d(X)
res, res.shape

1
2
3

import torch
from torch import nn
from d2l import torch as d2l

Lenet

import torch
from torch import nn
from d2l import torch as d2l


net = nn.Sequential(
    nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Conv2d(6, 16, kernel_size=5), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Flatten(),
    nn.Linear(16*5*5, 120), nn.Sigmoid(),
    nn.Linear(120, 84), nn.Sigmoid(),
    nn.Linear(84, 10)
)


X = torch.rand(size=(1, 1, 28, 28), dtype=torch.float32)
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__, 'output shape: ', X.shape)


batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)

def evaluate_accuracy_gpu(net, data_iter, device=None):
    if isinstance(net, nn.Module):
        net.eval()
        if not device:
            device = next(iter(net.parameters())).device
        metric = d2l.Accumulator(2)
        with torch.no_grad():
            for X, y in data_iter:
                if isinstance(X, list):
                    X = [x.to(device) for x in X]
                else:
                    X = X.to(device)
                y = y.to(device)
                metric.add(d2l.accuracy(net(X), y), y.numel())
        return metric[0] / metric[1]


def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
    def init_weights(m):
        if type(m) == nn.Linear or type(m) == nn.Conv2d:
            nn.init.xavier_normal_(m.weight)
    net.apply(init_weights)
    print('training on ', device)
    net.to(device)
    optimizer = torch.optim.SGD(net.parameters(), lr=lr)
    loss = nn.CrossEntropyLoss()
    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
                            legend=['train loss', 'train acc', 'test acc'])
    timer, num_batchs = d2l.Timer(), len(train_iter)
    for epoch in range(num_epochs):
        metric = d2l.Accumul ator(3)
        net.train()
        for i, (X, y) in enumerate(train_iter):
            timer.start()
            optimizer.zero_grad()
            X, y = X.to(device), y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y)
            l.backward()
            optimizer.step()
            with torch.no_grad():
                metric.add(l*X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
            timer.stop()
            train_l = metric[0] / metric[2]
            train_acc = metric[1] /metric[2]
            if (i + 1) % (num_batchs // 5) == 0 or i == num_batchs - 1:
                animator.add(epoch + (i + 1) / num_batchs, (train_l, train_acc, None))
        test_acc = evaluate_accuracy_gpu(net, test_iter)
        animator.add(epoch + 1, (None, None, test_acc))
    print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, test acc {test_acc:.3f}')
    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec on {str(device)}')


lr, num_epochs = 0.9, 10
train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

现代卷积神经网络

网络最底层，模型学习到了类似于传统滤波器的特征提取器。
![[ds60.png]]
AlexNet首次证明：学习到的特征可以超越手工设计的特征。

ALexNet

sigmoid：当模型没有正确初始化，及sigmoid输出接近0或1时，梯度几乎为0，无法进行反向传播。
ReLU: 在正区间梯度都为1，适应各种初始化方法强。

import torch
from torch import nn
from d2l import torch as d2l

net = nn.Sequential(
    nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Conv2d(96, 256, kernel_size=5, padding=2), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Conv2d(256, 384, kernel_size=3, padding=1), nn.ReLU(),
    nn.Conv2d(384, 384, kernel_size=3, padding=1), nn.ReLU(),
    nn.Conv2d(384, 256, kernel_size=3, padding=1), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Flatten(),
    nn.Linear(6400, 4096), nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(4096, 4096), nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(4096, 10)
)

X = torch.randn(1, 1, 224, 224)
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__, 'output shape:\t', X.shape)

batch_size = 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)

lr, num_epochs = 0.01, 10
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

VGG

![[ds61.png]]
作者实验得到，深且窄的卷积比浅且宽的卷积有效。
vgg11的实现：

import torch
from torch import nn
from d2l import torch as d2l


def vgg_block(num_convs, in_channels, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
        in_channels = out_channels
    layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
    return nn.Sequential(*layers)


def vgg11(conv_arch):
    conv_blks = []
    in_channels = 1
    for (num_convs, out_channels) in conv_arch:
        conv_blks.append(vgg_block(num_convs, in_channels, out_channels))
        in_channels = out_channels

    return nn.Sequential(
        *conv_blks, nn.Flatten(),
        nn.Linear(out_channels * 7 * 7, 4096), nn.ReLU(), nn.Dropout(0.5),
        nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(0.5),
        nn.Linear(4096, 10)
    )


conv_arch = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))
net = vgg11(conv_arch)
X = torch.randn(size=(1, 1, 224, 224))
for blk in net:
    X = blk(X)
    print(blk.__class__.__name__, 'output shape:\t', X.shape)

# 由于vgg11的复杂度很大，所以我们对其中对通道数量进行缩减
ratio = 4
small__conv_arch = [(pair[0], pair[1]//ratio) for pair in conv_arch]
net = vgg11((small__conv_arch))

lr, num_epochs, batch_size = 0.05, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

NiN

去掉了易造成过拟合的全连接层，同时显著减少了模型参数。但实际操作时会增加训练时间。
最后通道数为输出个数，由平均池化层输出。
使用了多个1X1卷积层组成的块。这样相当于对不同通道相同位置的像素进行全连接后加入非线性，给模型带入了更多的非线性。

![[ds62.png]]

import torch
from torch import nn
from d2l import torch as d2l


def nin_block(in_channels, out_channels, kernel_size, strides, padding):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding),
        nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU()
    )


net = nn.Sequential(
    nin_block(1, 96, kernel_size=11, strides=4, padding=0),
    nn.MaxPool2d(3, stride=2),
    nin_block(96, 256, kernel_size=5, strides=1, padding=2),
    nn.MaxPool2d(3, stride=2),
    nin_block(256, 384, kernel_size=3, strides=1, padding=1),
    nn.MaxPool2d(3, stride=2),
    nn.Dropout(0.5),
    nin_block(384, 10, kernel_size=3, strides=1, padding=1),
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten()
)

X = torch.rand(size=(1, 1, 224, 224))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__, 'output shape:\t', X.shape)

lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

GooLeNet

吸收了NiN中串联网络思路。
四个支路组成一个Inception.
Inception中各支路的通道数是由大量实验得到。
曾一度是最有效的模型。

Inception块的架构：
![[ds63.png]]

googLeNet架构：
![[ds64.png]]

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l


class Inception(nn.Module):
    def __init__(self, in_channels, c1, c2, c3, c4, **kwargs):
        super(Inception, self).__init__(**kwargs)
        self.p1_1 = nn.Conv2d(in_channels, c1, kernel_size=1)
        self.p2_1 = nn.Conv2d(in_channels, c2[0], kernel_size=1)
        self.p2_2 = nn.Conv2d(c2[0], c2[1], kernel_size=3, padding=1)
        self.p3_1 = nn.Conv2d(in_channels, c3[0], kernel_size=1)
        self.p3_2 = nn.Conv2d(c3[0], c3[1], kernel_size=5, padding=2)
        self.p4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        self.p4_2 = nn.Conv2d(in_channels, c4, kernel_size=1)

    def forward(self, x):
        p1 = F.relu(self.p1_1(x))
        p2 = F.relu(self.p2_2(F.relu(self.p2_1(x))))
        p3 = F.relu(self.p3_2(F.relu(self.p3_1(x))))
        p4 = F.relu(self.p4_2(F.relu(self.p4_1(x))))
        return torch.cat((p1, p2, p3, p4), dim=1)


# 网络第一阶段
b1 = nn.Sequential(
    nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
b2 = nn.Sequential(
    nn.Conv2d(64, 64, kernel_size=1),
    nn.ReLU(),
    nn.Conv2d(64, 192, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
b3 = nn.Sequential(
    Inception(192, 64, (96, 128), (16, 32), 32),
    Inception(256, 128, (128, 192), (32, 96), 64),
    nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
b4 = nn.Sequential(
    Inception(480, 192, (96, 208), (16, 48), 64),
    Inception(512, 160, (112, 224), (24, 64), 64),
    Inception(512, 128, (128, 256), (24, 64), 64),
    Inception(512, 112, (144, 288), (32, 64), 64),
    Inception(528, 256, (160, 320), (32, 128), 128),
    nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
b5 = nn.Sequential(
    Inception(832, 256, (160, 320), (32, 128), 128),
    Inception(832, 384, (192, 384), (48, 128), 128),
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten()
)
net = nn.Sequential(b1, b2, b3, b4, b5, nn.Linear(1024, 10))

X = torch.rand(size=(1, 1, 96, 96))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__, 'output shape:\t', X.shape)

lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

批量归一化（batch normalization）

简介。

输出先 batch normalization，后激活函数。
由来：
1. 在对输入进行归一化后，若模型较深，仍会造成输出端强烈的数值不稳定性，因为中间层的输出变化大。
2. 中间层变量的偏移可能会阻碍模型收敛。故，不断对中间输出进行归一化。
batch norm + 残差块　　—->　　训练100层以上网络
如果批量大小等于1，则经过batch norm后中间输出为0，将学不到任何东西。
有研究将batch norm与贝叶斯先验联系起来，解释为什么批量为50～100时最适合batch norm.
可起正则化的作用。

全连接层

求各特征的均值和方差：
![[ds65.png]]
归一化：
![[ds66.png]]
设置两个可学习参数拉升r和偏移β，可让模型判断是否进行归一化及归一化的程度：
![[ds67.png]]

卷积层

若中间输出维度为：m×n×p×q (样本数×通道数×高×宽)。
则对m×p×q应用相同的均值、方差、拉伸和偏移参数（均为标量）。

预测时：

不希望对输入样本的处理不同。
故，可　　在训练中　　用　　移动平均　　获得　　均值和方差，　　用于预测

从零实现

import torch
from torch import nn
from d2l import torch as d2l


def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    if not torch.is_grad_enabled():
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        if X.ndim == 2:
            mean = X.mean(dim=0)
            var = ((X - mean)**2).mean(dim=0)
        else:
            mean = X.mean(dim=(0, 2, 3), keepdim=True)
            var = ((X - mean)**2).mean(dim=(0, 2, 3), keepdim=True)
        X_hat = (X - mean) / torch.sqrt(var + eps)
        moving_mean = moving_mean * momentum + mean * (1 - momentum)
        moving_var = moving_var * momentum + var * (1 - momentum)
    Y = X_hat * gamma + beta
    return Y, moving_mean, moving_var


class batchNorm(nn.Module):
    def __init__(self, n_features, n_dims):
        super().__init__()
        if n_dims == 2:
            self.shape = (1, n_features)
        elif n_dims == 4:
            self.shape = (1, n_features, 1, 1)
        else:
            return
        self.moving_mean = torch.zeros(self.shape)
        self.moving_var = torch.ones(self.shape)
        self.beta = nn.Parameter(torch.zeros(self.shape))
        self.gamma = nn.Parameter(torch.ones(self.shape))
        self.eps = 1e-5
        self.momentum = 0.9

    def forward(self, X):
        assert X.ndim in (2, 4)
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
            self.beta = self.beta.to(X.device)
            self.gamma = self.gamma.to(X.device)
        Y, self.moving_mean, self.moving_var  = batch_norm(X, self.gamma, self.beta, self.moving_mean, self.moving_var,
                                                           self.eps, self.momentum)
        return Y


net = nn.Sequential(
    nn.Conv2d(1, 6, kernel_size=5), batchNorm(6, 4), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Conv2d(6, 16, kernel_size=5), batchNorm(16, 4), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(),
    nn.Linear(16*4*4, 120), batchNorm(120, 2), nn.Sigmoid(),
    nn.Linear(120, 84), batchNorm(84, 2), nn.Sigmoid(),
    nn.Linear(84, 10)
)


lr, num_epochs, batch_size = 1.0, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

简明实现

net = nn.Sequential(
    nn.Conv2d(1, 6, kernel_size=5), nn.BatchNorm2d(6), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Conv2d(6, 16, kernel_size=5), nn.BatchNorm2d(16), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(),
    nn.Linear(256, 120), nn.BatchNorm1d(120), nn.Sigmoid(),
    nn.Linear(120, 84), nn.BatchNorm1d(84), nn.Sigmoid(),
    nn.Linear(84, 10))


d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

ResNet

设计原则

如下图，面积越大函数越复杂。
非嵌套函数中，更复杂的函数可能未包含最优解，反而据最优解更远（F3距最优解$f^*$最近）。
嵌套函数中，更复杂的函数包含简单函数所有解，合理。(F6距离最优解$f^*$最近)

![[ds69.png]]

残差块的设计，使添加的层，可进行恒等映射，即f(x)=x
要学习到恒等映射，让权重层和偏差都为0即可。
即加入新的残差块，可以让网络学习到和原来不加前相同的效果。
可使网络自动调节网络复杂度。

![[ds70.png]]

结构设计

残差块中包含两个卷积层,输出通道数相同，都为3X3卷机层，卷机层后跟batch norm.
若要使宽高减半，则要设第一个卷积层的stride=2。

![[ds71.png]]

resnet的block包含几个残差块，若要宽高减半，则是第一个残差块的第一个卷积层stride=2.
因前接stide=2的maxPooling层，第一个block输入和输出通道数相同，宽高相同。
之后的block，输出通道翻倍，宽高减半。shortcut路径时，用1X1卷积层调整通道数和高宽。

代码实现：

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l


class Residual(nn.Module):
    def __init__(self, input_channels, output_channels, use_1x1conv=False, strides=1):
        super(Residual, self).__init__()
        self.conv1 = nn.Conv2d(input_channels, output_channels, kernel_size=3, stride=strides, padding=1)
        self.conv2 = nn.Conv2d(output_channels, output_channels, kernel_size=3, padding=1)
        if use_1x1conv:
            self.conv3 = nn.Conv2d(input_channels, output_channels, kernel_size=1, stride=strides)
        else:
            self.conv3 = None
        self.bn1 = nn.BatchNorm2d(output_channels)
        self.bn2 = nn.BatchNorm2d(output_channels)

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            Y += self.conv3(X)
        return F.relu(Y)

b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
                   nn.BatchNorm2d(64), nn.ReLU(),
                   nn.MaxPool2d(kernel_size=3, stride=2, padding=1))


def resnet_block(input_channels, num_channels, num_residuals,
                 first_block=False):
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            b = Residual(input_channels, num_channels, use_1x1conv=True, strides=2)
        else:
            b = Residual(num_channels, num_channels)
        blk.append(b)
    return nn.Sequential(*blk)


b2 = resnet_block(64, 64, 2, first_block=True)
b3 = resnet_block(64, 128, 2)
b4 = resnet_block(128, 256, 2)
b5 = resnet_block(256, 512, 2)
b6 = nn.Sequential(nn.AdaptiveAvgPool2d((1,1)), nn.Flatten(), nn.Linear(512, 10))
net = nn.Sequential(b1, b2, b3, b4, b5, b6)

lr, num_epochs, batch_size = 0.05, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

DenseNet

简介

与resnet的主要区别是：
1. resnet:相加
2. densenet: cat连接

![[ds72.png]]

连接形式：

![[ds73.png]]

网络结构：
1. 网络由若干densenet块组成，每个块油若干3X3不改变宽高的卷积层组成。
2. 相邻块间插入trinsition block，其中有1X1卷积层和平均池化层，来减少宽带，减半宽高。
3. 开头和结尾与resnet相同。

代码实现：

import torch
from torch import nn
from d2l import torch as d2l


def conv_block(input_channels, num_channels):
    return nn.Sequential(
        nn.BatchNorm2d(input_channels), nn.ReLU(),
        nn.Conv2d(input_channels, num_channels, kernel_size=3, stride=1, padding=1)
    )


class DenseBlock(nn.Module):
    def __init__(self, num_conv, input_channels, num_channels):
        super(DenseBlock, self).__init__()
        self.blk = []
        for i in range(num_conv):
            self.blk.append(conv_block(input_channels + i * num_channels, num_channels))
        self.blk = nn.Sequential(*self.blk)

    def forward(self, X):
        for block in self.blk:
            Y = block(X)
            X = torch.cat((X, Y), dim=1)
        return X


def transition_block(input_channels, num_channels):
    return nn.Sequential(
        nn.BatchNorm2d(input_channels), nn.ReLU(),
        nn.Conv2d(input_channels, num_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2)
    )


b1 = nn.Sequential(
    nn.Conv2d(1, 64, kernel_size=7, padding=3, stride=2), nn.BatchNorm2d(64), nn.ReLU(),
    nn.MaxPool2d(stride=3, kernel_size=2, padding=1)
)

num_channels, growth_rate = 64, 32
num_convs_in_dense_blocks = [4, 4, 4, 4]
blk = []
for i in num_convs_in_dense_blocks:
    blk.append(DenseBlock(i, num_channels, growth_rate))
    num_channels = i * growth_rate + num_channels
    if i != len(num_convs_in_dense_blocks) - 1:
        blk.append(transition_block(num_channels, num_channels//2))
        num_channels  //= 2

net = nn.Sequential(
    b1, *blk,
    nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
    nn.Linear(num_channels, 10)
)


lr, num_epochs, batch_size = 0.1, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

循环神经网络

RNN

![[ds8.png]]

# 统计词的频率：
import collections
tokens = [1, 2, 1, 1, 1, 3]
collections.Counter(tokens)

Counter({1: 4, 2: 1, 3: 1})

# 创建 one-hot
import torch
from torch.nn import functional as F
x1 = F.one_hot(torch.tensor([1, 3, 5]), 8)
print('x1: ', x1, x1.shape)
#或者
x = torch.arange(10).reshape((2, 5))
x2 = F.one_hot(x, 10)
print('x2: ', x2)
print('x2.shape: ', x2.shape)

x3 = F.one_hot(x.T, 10)
print('x3: ', x3)
print('x3.shape: ', x3.shape)

x1:  tensor([[0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0]]) torch.Size([3, 8])
x2:  tensor([[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]],

        [[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]])
x2.shape:  torch.Size([2, 5, 10])
x3:  tensor([[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]],

        [[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]],

        [[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]],

        [[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]],

        [[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]])
x3.shape:  torch.Size([5, 2, 10])

门控循环单元（gated recurrent unit, GRU）

提出的目的（解决梯度消失或梯度爆炸）：

希望有一个存储单元，包含早期重要的信息。
忽略一些无关的token.
序列中间可能出现逻辑断层，故希望可以重置中间状态。

重置门更好地捕获短期记忆
更新门更好地捕获长期记忆

$R_{t}$为重置门，$Z_{t}$为更新门
![[ds96.png]]

重置门发挥作用
![[ds97.png]]

更新门发挥作用
![[ds98.png]]

![[6.png]]

1
2

1
2

1
2

1
2

1
2

1
2

1
2

1
2

1
2

1
2

1
2

1
2

1
2

1
2

LSTM

输入门、遗忘门、输出门的定义

候选记忆

真正的记忆

隐藏状态

![[7.png]]

深层循环神经网络

![[8.png]]

vocab_size, num_hiddens, num_layers = len(vocab), 256, 2
num_inputs = vocab_size
device = d2l.try_gpu()
lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers)
model = d2l.RNNModel(lstm_layer, len(vocab))
model = model.to(device)

双向循环神经网络

![[9.png]]

Encoder-Decoder框架

structrue

![[ds38.png]]
其中
![[ds39.png]]
![[ds40.png]]
![[ds41.png]]

应用领域：

文本处理领域：
1. 机器翻译：Source是中文句子，Target是英文句子
2. 文本摘要：Source是文章，Target是摘要
3. 对话机器人：Source是问题，Target是答案
语音识别领域：Source是语音，Target是文字
图像领域：Source是图像，Target是文字描述

其中，文本及语音领域encoder为RNN,图像领域encoder为CNN

基础模块：

import torch
from torch import nn


class Encoder(nn.Module):
     """The base encoder interface for the encoder-decoder architecture."""
     def __init__(self, **kwarge) -> None:
         super(Encoder, self).__init__(**kwarge)
        
     def forward(self, X, *args):
        raise NotImplemented


class Decoder(nn.Module):
    """The base decoder interface for the encoder-decoder architecture."""
    def __init__(self, **kwargs) -> None:
        super().__init__(**kwargs)
    
    def init_state(self, enc_outputs, *args):
        raise NotImplemented
    
    def forward(self, X, state):
        raise NotImplemented
    
    
# combine Encoder and Decoder
class EncoderDecoder(nn.Module):
    """The base class for the encoder-decoder architecture."""
    def __init__(self, encoder, decoder, **kwargs) -> None:
        super().__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, enc_X, dec_X, *args):
        enc_outputs = self.encoder(enc_X, *args)
        dec_state = self.decoder.init_state(enc_outputs, *args)
        return self.decoder(dec_X, dec_state)

细节

![[ds93.png]]

encode最后的语义编码，在解码器中，一般两种用法：

在最开始使用，之后后面都用前面的隐含变量。如图。
每个节点都会用到encode的语义编码和前面的隐含变量。如上图。

对encode生产最后的语义编码, 要考虑所有中间隐含变量的情况：
![[ds94.png]]
对于上图：
![[ds95.png]]

1
2

Attention

介绍

类似人类的观察机制
![[ds37.jpeg]]
按认知神经学区分：

聚焦式注意力：有目的有任务的主动聚焦，主动（自上而下）。
显著性注意力：由外界刺激而产生，被动（自下而上）

由认知科学引出 Attention算法:
![[ds74.png]]

ori attention: Nadaraya-Watson Kernel Regression

generate data

# from matplotlib.pyplot import vlines
import torch
from torch import nn
from d2l import torch as d2l

n_train = 50
x_train, _ = torch.sort(torch.rand(n_train) * 5)


def f(x):
    return 2 * torch.sin(x) + x**0.8


y_train = f(x_train) + torch.normal(0.0, 0.5, (n_train,))
x_test= torch.arange(0, 5, 0.1)
y_truth = f(x_test)
n_test = len(x_test)

average pooling

![[ds80.png]]

def plot_kernel_reg(y_hat:torch.tensor):
    d2l.plot(x_test, [y_truth, y_hat], 'x', 'y', legend=['Truth', 'Pred'],
    xlim=[0, 5], ylim=[-1, 5])
    d2l.plt.plot(x_train, y_train, 'o', alpha=0.5)
    d2l.plt.show() 


# 
y_hat = torch.repeat_interleave(y_train.mean(), n_test)
plot_kernel_reg(y_hat)

(MakeHandsDirty_files/MakeHandsDirty_511_0.svg)

Nonparametric Attention Pooling

average pooling is inefficiency.
so, new solutions:
weight $y_i$ according to the input locations

![[ds81.png]]

e.i.
![[ds82.png]]

$\alpha$ is a probability distribution

set K as Gaussian kernel:
![[ds83.png]]

so:
![[ds84.png]]

the closer a key $x$ is to $x_i$, the more attenton

X_repeat = x_test.repeat_interleave(n_train).reshape((-1, n_train))
print('X_repeat.shape, x_train.shape: ', X_repeat.shape, x_train.shape)
attention_weights = nn.functional.softmax(-(X_repeat - x_train)**2 / 2)
y_hat = torch.matmul(attention_weights, y_train)
plot_kernel_reg(y_hat)

d2l.show_heatmaps(attention_weights.unsqueeze(0).unsqueeze(0),
                  xlabel='Sorted training inputs',
                  ylabel='Sorted testing inputs')

X_repeat.shape, x_train.shape:  torch.Size([50, 50]) torch.Size([50])


/var/folders/51/p367wkbd1hz23m7nkt6_v8_40000gp/T/ipykernel_5136/264305427.py:3: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  attention_weights = nn.functional.softmax(-(X_repeat - x_train)**2 / 2)

![[MakeHandsDirty_519_2.svg]]

![[MakeHandsDirty_519_3.svg]]

class NWKernelRegression(nn.Module):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.w = nn.Parameter(torch.rand((1,), requires_grad=True))

    def forward(self, queries, keys, values):
        # Shape of the output `queries` and `attention_weights`:
        # (no. of queries, no. of key-value pairs)
        queries = queries.repeat_interleave(keys.shape[1]).reshape((-1, keys.shape[1]))
        self.attention_weights = nn.functional.softmax(
            -((queries - keys) * self.w)**2 / 2, dim=1)
        # Shape of `values`: (no. of queries, no. of key-value pairs)
        return torch.bmm(self.attention_weights.unsqueeze(1),
                         values.unsqueeze(-1)).reshape(-1)
    

# X_tile = x_train.repeat((n_train, 1))
X_tile = x_train.repeat((n_train, 1))
Y_tile = y_train.repeat((n_train, 1))
# print(x_train.shape, n_train, X_tile.shape, X_tile)
keys = X_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape((n_train, -1))
values = Y_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape((n_train, -1))

net = NWKernelRegression()
loss = nn.MSELoss(reduction='none')
trainer = torch.optim.SGD(net.parameters(), lr=0.5)

# print('x_train.shape, keys.shape, values.shape: ', x_train.shape, keys.shape, values.shape)
animator = d2l.Animator(xlabel='epoch', ylabel='loss', xlim=[1, 5])
for epoch in range(5):
    trainer.zero_grad()
    l = loss(net(x_train, keys, values), y_train)
    l.sum().backward()
    trainer.step()
    print(f'epoch {epoch + 1}, loss {float(l.sum()):.6f}')
    animator.add(epoch + 1, float(l.sum()))

(MakeHandsDirty_files/MakeHandsDirty_520_0.svg)

keys = x_train.repeat((n_test, 1))
values = y_train.repeat((n_test, 1))
y_hat = net(x_test, keys, values).unsqueeze(1).detach()
print('y_hat.shape: ', y_hat.shape)
plot_kernel_reg(y_hat)

print(net.w)
d2l.show_heatmaps(net.attention_weights.unsqueeze(0).unsqueeze(0),
                  xlabel='Sorted training inputs',
                  ylabel='Soretd testing inputs') # TODO: 如何定时显示图片

y_hat.shape:  torch.Size([50, 1])

![[MakeHandsDirty_521_1.svg]]

Parameter containing:
tensor([27.5289], requires_grad=True)

(MakeHandsDirty_files/MakeHandsDirty_521_3.svg)

Attention Scoring Functions

本节介绍两种：

1. Additive Attention。  
2. Scaled Dot-Product Attention。

Defines:
![[ds85.png]]

Masked Softmax Operation

import math
import torch
from torch import nn
from d2l import torch as d2l


def sequence_mask(X, valid_len, value=0):
    """在序列中屏蔽不相关的项

    Defined in :numref:`sec_seq2seq_decoder`"""
    maxlen = X.size(1)
    mask = torch.arange((maxlen), dtype=torch.float32,
                        device=X.device)
    a = valid_len[:, None]
    b = mask[None, :]
    mask = mask[None, :] < valid_len[:, None]
    X[~mask] = value
    return X


def masked_softmax(X, valid_lens):
    """Perform softmax operation by masking elements on the last axis."""
    # `X`: 3D tensor, `valid_lens`: 1D or 2D tensor
    if valid_lens is None:
        return nn.functional.softmax(X, dim=-1)
    else:
        shape = X.shape
        if valid_lens.dim() == 1:
            valid_lens = torch.repeat_interleave(valid_lens, shape[1])
        else:
            valid_lens = valid_lens.reshape(-1)        # On the last axis, replace masked elements with a very large negative
        # value, whose exponentiation outputs 0
        X = sequence_mask(X.reshape(-1, shape[-1]), valid_lens,
                              value=-1e6)
        return nn.functional.softmax(X.reshape(shape), dim=-1)
    
    
print(
    masked_softmax(torch.rand(2, 2, 4), torch.tensor([2, 3])),
    masked_softmax(torch.rand(2, 2, 4), torch.tensor([[1, 3], [2, 4]])),
    sep = f"\n {'-'*60} \n"
)

tensor([[[0.3798, 0.6202, 0.0000, 0.0000],
         [0.4504, 0.5496, 0.0000, 0.0000]],

        [[0.4211, 0.3013, 0.2777, 0.0000],
         [0.2478, 0.2652, 0.4870, 0.0000]]])
 ------------------------------------------------------------ 
tensor([[[1.0000, 0.0000, 0.0000, 0.0000],
         [0.2348, 0.4178, 0.3475, 0.0000]],

        [[0.5791, 0.4209, 0.0000, 0.0000],
         [0.2989, 0.1669, 0.3208, 0.2134]]])

Additive Attention

![[ds88.png]]
公式为：
![[ds87.png]]
其等价于：把query和key拼接在一起，输入一个仅含一个隐含层的MLP中，隐含层单元数：h, 最后输出为一个实数。

class AdditiveAttention(nn.Module):
    """Additive attention."""
    def __init__(self, key_size, query_size, num_hiddens, dropout, **kwargs):
        super(AdditiveAttention, self).__init__(**kwargs)
        self.W_k = nn.Linear(key_size, num_hiddens, bias=False)
        self.W_q = nn.Linear(query_size, num_hiddens, bias=False)
        self.w_v = nn.Linear(num_hiddens, 1, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, queries, keys, values, valid_lens):
        queries, keys = self.W_q(queries), self.W_k(keys)
        # After dimension expansion, shape of `queries`: (`batch_size`, no. of
        # queries, 1, `num_hiddens`) and shape of `keys`: (`batch_size`, 1,
        # no. of key-value pairs, `num_hiddens`). Sum them up with
        # broadcasting
        features = queries.unsqueeze(2) + keys.unsqueeze(1)
        features = torch.tanh(features) # (2, 1, 10, 8)
        # There is only one output of `self.w_v`, so we remove the last
        # one-dimensional entry from the shape. Shape of `scores`:
        # (`batch_size`, no. of queries, no. of key-value pairs)
        scores = self.w_v(features).squeeze(-1) # (2, 1, 10)
        self.attention_weights = masked_softmax(scores, valid_lens)
        # Shape of `values`: (`batch_size`, no. of key-value pairs, value
        # dimension)
        return torch.bmm(self.dropout(self.attention_weights), values)
    
    
queries, keys = torch.normal(0, 1, (2, 1, 20)), torch.ones((2, 10, 2))
# The two value matrices in the `values` minibatch are identical
values = torch.arange(40, dtype=torch.float32).reshape(1, 10, 4).repeat(
    2, 1, 1)
valid_lens = torch.tensor([2, 6])

attention = AdditiveAttention(key_size=2, query_size=20, num_hiddens=8,
                              dropout=0.1)
attention.eval()
print(
    attention(queries, keys, values, valid_lens)
)

d2l.show_heatmaps(attention.attention_weights.reshape((1, 1, 2, 10)),
                  xlabel='Keys', ylabel='Queries')

out: tensor([[[ 2.0000, 3.0000, 4.0000, 5.0000]], [[10.0000, 11.0000, 12.0000, 13.0000]]], grad_fn=<BmmBackward0>)

(MakeHandsDirty_files/MakeHandsDirty_529_1.svg)

Scaled Dot-Product Attention

函数形式：
![[ds89.png]]
为什么除 $\sqrt{d}$ 现在我还不能弄明白。

为计算效率：
![[ds91.png]]
公式：
![[ds92.png]]

class DotProductAttention(nn.Module):
    """Scaled dot product attention."""
    def __init__(self, dropout, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)
        self.dropout = nn.Dropout(dropout)

    # Shape of `queries`: (`batch_size`, no. of queries, `d`)
    # Shape of `keys`: (`batch_size`, no. of key-value pairs, `d`)
    # Shape of `values`: (`batch_size`, no. of key-value pairs, value
    # dimension)
    # Shape of `valid_lens`: (`batch_size`,) or (`batch_size`, no. of queries)
    def forward(self, queries, keys, values, valid_lens=None):
        d = queries.shape[-1]
        # Set `transpose_b=True` to swap the last two dimensions of `keys`
        scores = torch.bmm(queries, keys.transpose(1,2)) / math.sqrt(d)
        self.attention_weights = masked_softmax(scores, valid_lens)
        return torch.bmm(self.dropout(self.attention_weights), values)
    
    
queries = torch.normal(0, 1, (2, 1, 2))
attention = DotProductAttention(dropout=0.5)
attention.eval()
attention(queries, keys, values, valid_lens)

d2l.show_heatmaps(attention.attention_weights.reshape((1, 1, 2, 10)),
                  xlabel='Keys', ylabel='Queries')

(MakeHandsDirty_files/MakeHandsDirty_533_0.svg)

Bahdanau Attention

soft Attention

对于上面的语义编码C，对每个输出的作用都一样,这是不合理的
![[ds42.png]]
如翻译“Tom chase Jerry”的Jerry是，应为（Tom,0.3）(Chase,0.2) (Jerry,0.5)贡献值。所以Encoder-Decoder结构应为：
![[ds43.png]]
即：
![[ds44.png]]
![[ds45.png]]
![[ds46.png]]
![[ds47.jpg]]
如何生成这些不同的权重(注意：图中h1、h2、h3分别与$H_{i-1}$计算获取权重，并一起进行正则化)：
![[ds48.png]]
nlp中：把Attention当作对齐模型（输入句子中单词和目标生成单词的对齐概率）。
软性注意力机制有两种：普通模式（Key=Value=X）和键值对模式（Key！=Value）

hard Attention

两种方式：

选取最高概率作为输入信息。
在注意力分布式式随机采样。

缺点：由于采取的是最大采样或随机采样，所以不可导，需要用强化学习来训练。

Attention思想本质

![[ds49.png]]

相似度计算的三种方式：
![[ds50.png]]

从上面的建模，我们可以大致感受到 Attention 的思路简单，四个字“加权求和”就可以高度概括，大道至简。做个不太恰当的类比，人类学习一门新语言基本经历四个阶段：死记硬背（通过阅读背诵学习语法练习语感）->提纲挈领（简单对话靠听懂句子中的关键词汇准确理解核心意思）->融会贯通（复杂对话懂得上下文指代、语言背后的联系，具备了举一反三的学习能力）->登峰造极（沉浸地大量练习）。
这也如同attention的发展脉络，RNN 时代是死记硬背的时期，attention 的模型学会了提纲挈领，进化到 transformer，融汇贯通，具备优秀的表达学习能力，再到 GPT、BERT，通过多任务大规模学习积累实战经验，战斗力爆棚。
要回答为什么 attention 这么优秀？是因为它让模型开窍了，懂得了提纲挈领，学会了融会贯通。
——阿里技术

self Attention

即encoder与encoder进行attention, decoder与decoder进行attention
google翻译正在使用
如，捕获句法特征之间的联系(making … more difficult)
![[ds51.jpeg]]
捕获语义特征之间的联系(its 代指 Law)
![[ds52.jpeg]]

cnn或rnn不能处理长序列：
如下图a是基于n-gram的局部编码，rnn是由于梯度消失等问题也是局部编码。
如何解决：

增加层数，通过深层网络获取远距离信息融合特征。
使用全链接。

self Attention流程：
用 X = [x1, · · · , xN ]表示 N 个输入信息；通过线性变换得到为查询向量序列，键向量序列和值向量序列：
![[ds56.webp]]
![[ds55.webp]]

视觉Attention

![[ds53.png]]
下图是在生成划线单词时的，图片的attention情况：
![[ds54.jpeg]]

Optimizer

Adam

网络设计原则

先用一个函数定义算法，在用一个类定义层。如 batch norm的实现：

import torch
from torch import nn
from d2l import torch as d2l


# 算法的定义
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    # 通过is_grad_enabled来判断当前模式是训练模式还是预测模式
    if not torch.is_grad_enabled():
        # 如果是在预测模式下，直接使用传入的移动平均所得的均值和方差
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # 使用全连接层的情况，计算特征维上的均值和方差
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            # 使用二维卷积层的情况，计算通道维上（axis=1）的均值和方差。
            # 这里我们需要保持X的形状以便后面可以做广播运算
            mean = X.mean(dim=(0, 2, 3), keepdim=True)
            var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
        # 训练模式下，用当前的均值和方差做标准化
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # 更新移动平均的均值和方差
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    Y = gamma * X_hat + beta  # 缩放和移位
    return Y, moving_mean.data, moving_var.data


# 层的定义
class BatchNorm(nn.Module):
    # num_features：完全连接层的输出数量或卷积层的输出通道数。
    # num_dims：2表示完全连接层，4表示卷积层
    def __init__(self, num_features, num_dims):
        super().__init__()
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, num_features, 1, 1)
        # 参与求梯度和迭代的拉伸和偏移参数，分别初始化成1和0
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        # 非模型参数的变量初始化为0和1
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.ones(shape)

    def forward(self, X):
        # 如果X不在内存上，将moving_mean和moving_var
        # 复制到X所在显存上
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        # 保存更新过的moving_mean和moving_var
        Y, self.moving_mean, self.moving_var = batch_norm(
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.9)
        return Y

网络的层数只考虑卷机层和全连接层。

1
2

1
2

1
2

1
2

1
2

1
2

1
2

1
2

PyTorch编程原则

设计网络时，在__init__中最终的网络是nn.Sequential,而不能是列表。否则，会导致对网络参数转移到cuda上失败。

错误范例：

class DenseBlock(nn.Module):
    def __init__(self, num_conv, input_channels, num_channels):
        super(DenseBlock, self).__init__()
        self.blk = []
        for i in range(num_conv):
            self.blk.append(conv_block(input_channels + i * num_channels, num_channels))

    def forward(self, X):
        for block in self.blk:
            Y = block(X)
            X = torch.cat((X, Y), dim=1)
        return X

正确范例：

class DenseBlock(nn.Module):
    def __init__(self, num_conv, input_channels, num_channels):
        super(DenseBlock, self).__init__()
        self.blk = []
        for i in range(num_conv):
            self.blk.append(conv_block(input_channels + i * num_channels, num_channels))
        self.blk = nn.Sequential(*self.blk)

    def forward(self, X):
        for block in self.blk:
            Y = block(X)
            X = torch.cat((X, Y), dim=1)
        return X

1
2

1
2

1
2

画图

热图

import torch
from d2l import torch as d2l


def show_heatmaps(matrices, xlabel, ylabel, titles=None, figsize=(3.5, 2.5), cmap='Reds'):
    d2l.use_svg_display()
    num_rows, num_cols = matrices.shape[0], matrices.shape[1]
    fig, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize,sharex=True, sharey=True, squeeze=False)
    for i, (row_axes, row_matrices) in enumerate(zip(axes, matrices)):
        for j, (ax, matrix) in enumerate(zip(row_axes, row_matrices)):
            pcm = ax.imshow(matrix.detach().numpy(), cmap=cmap)
            if i == num_rows - 1:
                ax.set_xlabel(xlabel)
            if j == 0:
                ax.set_ylabel(ylabel)
            if titles:
                ax.set_title(titles[j])
    fig.colorbar(pcm, ax=axes, shrink=0.6)


attention_weights = torch.eye(10).reshape(1, 1, 10, 10)
show_heatmaps(attention_weights, xlabel='Keys', ylabel='Queries')

TensorBoard

Refs: https://zhuanlan.zhihu.com/p/103630393

常用算法

EMA

import pandas as pd
import numpy as np
from collections import deque

def ema(d: deque, new_elem) -> list:
    """
    Calculate the exponential moving average for a list of values
    :param d: deque of values
    :param n: number of values to use for the calculation
    :return: list of exponential moving averages
    """
    multiplier = 2 / (len(d) + 1)
    d.append(multiplier*new_elem+(1-multiplier)*sum(d)/len(d))

values = deque([9, 5, 10, 16, 5, 6, 8, 9], maxlen=8)
print(values)
ema(values, 10)
print(values)

deque([9, 5, 10, 16, 5, 6, 8, 9], maxlen=8)
deque([5, 10, 16, 5, 6, 8, 9, 8.833333333333334], maxlen=8)

1
2

1
2

1
2

1
2

错误记录

RuntimeError: cuda error: device-side assert triggered

一般为输入的值超过边界所致。
可检查label或输入的tensor的最大值和最小值。比如本次遇到的问题就是，定位框超过边界，导致坐标值大于了1000（layoutlmv2分类任务）。

OSError: [Errno 9] Bad file descriptor

解决方案：
加上：

1 2	import torch.multiprocessing torch.multiprocessing.set_sharing_strategy('file_system')

参考文档： https://stackoverflow.com/questions/73125231/pytorch-dataloaders-bad-file-descriptor-and-eof-for-workers0

[W ParallelNative.cpp:206]

[W ParallelNative.cpp:206] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
解决方案：

I have the same problem. Mac. Python 3.6 (also reproduces on 3.8). Pytorch 1.7.

It seems that with this error dataloaders don’t (or can’t) use parallel computing. You can remove the error (this will not fix the problem) in two ways.

If you can access your dataloaders, set num_workers=0 when creating a dataloader
Set environment variable export OMP_NUM_THREADS=1

Again, both solutions kill parallel computing and may slow down data loading (and therefore training). I look forward to efficient solutions or a patch in Pytorch 1.7

按照带cuda的toch报错

安装命令：pip3 install torch torchvision —extra-index-url https://download.pytorch.org/whl/cu113
报错，执行：pip uninstall nvidia_cublas_cu11，报另外错误.(参考：https://blog.csdn.net/bcfd_yundou/article/details/129206267)
报错信息：undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11
解决方法：
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 —extra-index-url https://download.pytorch.org/whl/cu113
参考：https://stackoverflow.com/questions/74394695/how-does-one-fix-when-torch-cant-find-cuda-error-version-libcublaslt-so-11-no

1
2

torch基本操作

nn.LazyLinear()

利用transforms实验并可视化

torch.isfinite

向量排序与对应排列

对于str类型

获得排序后的x_l，y_l

取top-k

概率分布

累加: torch.cumsum(a, dim=0)

torch.pi:

创建恒等矩阵I：

交换通道

配置gpu or cpu

查看 gpu 数量

取0卡作为计算硬件

broatcasting

矩阵堆叠

.reshape() (与.view()等效)

reshape后得到的新向量开辟了新内存，但内存共享

扩展维度

使变量脱离自动求导

argmax: 取行的最大值的index

torch.numel 对Tensor内所有元素的个数

取元素

间隔取元素

X[-1], X[1:3]表示切的行

取矩阵指定行和列

取指定行

取指定列

取指定元素

对tensor进行shuffle

按元素进行：x + y, x - y, x y, x / y, x*y

赋值

减少内存消耗

tensor与numpy

tensor转换到numpy后共享内存。

numpy转换到tensor后不共享内存。

2. 把大小为1的张量 转化为 标量

torch.exp()

torch.zeros_like(), torch.ones_like()

.clone()

不进行内存共享, 地址不同

.sum()降维

keepdims=True时不降维

矩阵乘法：torch.matmul()包括torch.mm()(两个元素必须都为矩阵)和torch.mv()

范数：向量的长度或大小，从张量映射到标量。

$L_2$范数

$L_1$范数

len(X）为X第0维度数量

自动求导

生成高斯数据

指定均值和方差

生成正太分布数据

转换数据类型

直接制定要转换的数据类型

和其他变量类型一致

softmax()

训练中的原则

python中默认为float64，对深度学习一般为float32

批次约小，训练的效果越好，因为有随机噪声，反而对模型训练有帮助。

牛顿法？ 因为其收敛快，为什么不用牛顿法（二阶导算法）？

训练最后不够一个batch的样本：

将每个数据样本作为矩阵中的行向量更为常见。

tensor维数：

变量 转移 其他设备

保存模型

state_dict 是什么？

保存和加载模型

保存

加载

保存 & 加载 Checkpoint 用于 推断 and/or 恢复训练

保存多个模型

不同模型之间的热启动

保存与加载在不同设备

保存在gpu， 加载在cpu

保存在gpu， 加载在gpu

保存在cpu， 加载在gpu

加载并行模型

torch.device相关操作

2. 把大小为1的张量转化为标量

牛顿法？因为其收敛快，为什么不用牛顿法（二阶导算法）？

变量转移其他设备

保存 & 加载 Checkpoint 用于推断 and/or 恢复训练

保存在gpu，加载在cpu

保存在gpu，加载在gpu

保存在cpu，加载在gpu