実験記録 47c97768

StarGAN によって声質変換を試みる。

StarGAN 論文に近い形での実装をしたかったが、GPU で学習しようとした際に VRAM 不足により落ちてしまった。そこで、VRAM 内に収めるためにネットワークを縮小した。これにより表現空間が小さくなりすぎている可能性がある。

入力データ

スペクトログラム（音声データ）とラベル（教師データ）の二種類を扱う。

スペクトログラム

以下の手順で生成する。

16000Hz/1ch にリサンプリングする。
任意の 66304 サンプルを抽出し、[-1, 1] の範囲に収まるように正規化する。
- このとき、RMS が 0.5 を超える場合はサンプルを破棄する。（無音）
512 サンプルずつ重なるように、1024 サンプルで切り、hamming 窓を掛ける。
- 1024x256 のデータが得られる。
FFT を掛け、パワースペクトルを得る。データ長（1024）で割ることで、[0, 1] の範囲に収まるようになる。
各パワースペクトルの 2 番目の値から、256 個のデータを使用する。
- 1 番目の値は直流成分のため使用しない。
- スペクトルは 512 番目を境に対象になっている。左から 256 のデータを使うことで、8000 Hz で LPF を掛けたことと同じ効果が得られる。
- この時点で 256x256 のスペクトログラムになっている。
対数スケールにしてから [0, 1] に正規化する。
1. 10^-3 未満の値をすべて 10^-3 にする。
2. 対数をとる。
3. log_e10^-3 を引いて、 -log_e10^-3 で割る。これによって [0, 1] に正規化される。

ラベル

ラベルは以下の3グループ。

発話者 ID — 1421 クラス
発話者の生まれた地域 — 53 クラス
発話者の性別 — 2 クラス

全てのグループについてラベリングされているとは限らない。

入力ラベル

3グループすべての one-hot vector を concat して、更にラベリングされているグループに 1 が立っている vector を concat する。

最終的なサイズは 1421 + 53 + 2 + 3 = 1479。値は {0, 1}。

出力ラベル

3グループ全ての one-hot vector を concat する。ただし、ラベリングされていないグループはすべて値を -1 とし、学習の際に -1 だった場合は無視することにする。

最終的なサイズは 1421 + 53 + 2 = 1476。値は {-1, 0, 1}。

モデル

VRAM のサイズのため、論文の実装とは多少異なる。

Discriminator のモデル

論文の実装と異なる点は以下。

フィルタの数が全体として半分になっている。

graph TD

input("&laquo;Input&raquo;<br/>Spectrogram<br/>1 x 256 x 256<br/>[0, 1]")
conv1[Convolution<br/>n: 32, k: 4, stride: 2, pad: 1<br/>LeakyReLU]
conv2[Convolution<br/>n: 64, k: 4, stride: 2, pad: 1<br/>LeakyReLU]
conv3[Convolution<br/>n: 128, k: 4, stride: 2, pad: 1<br/>LeakyReLU]
conv4[Convolution<br/>n: 256, k: 4, stride: 2, pad: 1<br/>LeakyReLU]
conv5[Convolution<br/>n: 512, k: 4, stride: 2, pad: 1<br/>LeakyReLU]
conv6[Convolution<br/>n: 1024, k: 4, stride: 2, pad: 1<br/>LeakyReLU]

subgraph Output
    clazz[Convolution<br/>n: 1476, k: 4, stride: 1, pad: 1]
    discriminate[Convolution<br/>n: 1, k: 3, stride: 1, pad: 1]

    output-feature("&laquo;Output&raquo;<br/>Feature<br/>1024 x 4 x 4")
    output-clazz("&laquo;Output&raquo;<br/>Classification</sub><br/>1476 x 1 x 1")
    output-discriminate("&laquo;Output&raquo;<br/>Adversarial<br/>1 x 4 x 4")
end

input --> conv1
conv1 --> | 32 x 128 x 128 | conv2
conv2 --> | 64 x 64 x 64 | conv3
conv3 --> | 128 x 32 x 32 | conv4
conv4 --> | 256 x 16 x 16 | conv5
conv5 --> | 512 x 8 x 8 | conv6
conv6 --> output-feature

output-feature --> clazz
clazz --> output-clazz

output-feature --> discriminate
discriminate --> output-discriminate

Generator のモデル

論文の実装と異なる点は以下。

フィルタの数が全体として半分になっている。
Bottleneck part の数が半分になっている。
ラベルを入力に concat するのではなく、 Bottleneck part の直前で concat している。
- フィルタの数の辻褄を合わせるため、Bottleneck part の直前に Convolution Layer が挿入してある。

graph TD

subgraph Input
    input-spectrogram("&laquo;Input&raquo;<br/>Spectrogram<br/>1 x 256 x 256<br/>[0, 1]")
    input-labels("&laquo;Input&raquo;<br/>Labels<sub>input</sub><br/>1479<br/>{0, 1}")
end

subgraph Downsampling
    downsamp1[Convoluton<br/>n: 32, k: 7, stride: 1, pad: 3<br/>BatchNormalization<br/>ReLU]
    downsamp2[Convoluton<br/>n: 64, k: 4, stride: 2, pad: 1<br/>BatchNormalization<br/>ReLU]
    downsamp3[Convoluton<br/>n: 128, k: 4, stride: 2, pad: 1<br/>BatchNormalization<br/>ReLU]
end

repeat((repeat<br/>64 x 64))
concat((concat))
merge[Convolution<br/>n: 128, k: 3, stride: 1, pad: 1<br/>BatchNormalization<br/>ReLU]

subgraph Bottleneck
    bottleneck1[ResBlock<br/>n: 128, k: 3, stride: 1, pad: 1<br/>BatchNormalization<br/>ReLU]
    bottleneck2[ResBlock<br/>n: 128, k: 3, stride: 1, pad: 1<br/>BatchNormalization<br/>ReLU]
    bottleneck3[ResBlock<br/>n: 128, k: 3, stride: 1, pad: 1<br/>BatchNormalization<br/>ReLU]
end

subgraph Upsampling
    upsamp1[Deconvolution<br/>n: 64, k: 4, stride: 2, pad: 1<br/>BatchNormalization<br/>ReLU]
    upsamp2[Deconvolution<br/>n: 32, k: 4, stride: 2, pad: 1<br/>BatchNormalization<br/>ReLU]
    upsamp3[Deconvolution<br/>n: 1, k: 7, stride: 1, pad: 3<br/>sigmoid]
end

output("&laquo;Output&raquo;<br/>Spectrogram<br/>1 x 256 x 256<br/>(0, 1)")

input-spectrogram --> downsamp1
downsamp1 --> | 32 x 256 x 256 | downsamp2
downsamp2 --> | 64 x 128 x 128 | downsamp3
downsamp3 --> | 128 x 64 x 64 | concat

input-labels --> repeat
repeat --> | 1479 x 64 x 64 | concat
concat --> | 1607 x 64 x 64 | merge

merge --> | 128 x 64 x 64 | bottleneck1
bottleneck1 --> | 128 x 64 x 64 | bottleneck2
bottleneck2 --> | 128 x 64 x 64 | bottleneck3

bottleneck3 --> | 128 x 64 x 64 | upsamp1
upsamp1 --> | 64 x 128 x 128 | upsamp2
upsamp2 --> | 32 x 256 x 256 | upsamp3
upsamp3 --> output

学習

Discriminator の学習を 5 回、Generator の学習を 1 回行い、それを 1 iteration とする。

Discriminator の学習

論文の実装とほぼ同じで、Conditional GAN + WGAN-gp といった感じ。

graph LR

subgraph Input A
    input-x-real("«Input»<br/>Spectrogram<br/>1 x 256 x 256<br/>[0, 1]")
    input-label-real-out("«Input»<br/>Labels<sub>output</sub><br/>1476<br/>{-1, 0, 1}")
end

subgraph Input B
    input-x-orig("«Input»<br/>Spectrogram<br/>1 x 256 x 256<br/>[0, 1]")
    input-label-fake-in("«Input»<br/>Labels<sub>input</sub><br/>1479<br/>{0, 1}")
end

subgraph Loss 
    classify(Domain Classification Loss)
    adversarial(Adversarial Loss)
end

gen1[Generator]
dis1[Disciminator]
plus1((+))
wasserstein(Wasserstein Distance)
x-fake("Fake<br/>1 x 256 x 256<br/>[0, 1]")
cross1[Sigmoid Cross Entropy]

sampling[Sampling]
gradient[Calculating Gradient Penalty]
gradient-penalty(Gradient Penalty)
x10((x10))
plus2((+))

dis2[Discriminator]
minus1(("-"))

input-x-real --> gen1
input-label-fake-in --> gen1
gen1 --> x-fake
x-fake --> dis1
dis1 --> | Adversarial | plus1
plus1 --> wasserstein

input-x-real --> dis2
dis2 --> | Adversarial | minus1
minus1 --> wasserstein

dis1 --> | Classification | cross1
input-label-real-out --> cross1
cross1 --> classify
x-fake --> sampling
input-x-orig --> sampling
sampling --> gradient
gradient --> gradient-penalty
gradient-penalty --> x10
x10 --> plus2
wasserstein --> plus2
plus2 --> adversarial

Generator の学習

論文で使用している loss に加えて、StarGAN でいうところの Identity Mapping Loss を導入した。StarGAN では、明度反転画像を生成するような学習をしてしまうケースが報告されており、それを回避できるのではないかと考えた。

更に、Feature Matching Loss も導入しているが、なぜ導入したかは不明。

graph LR

subgraph Input A
    x-real("«Input»<br/>Spectrogram<br/>1 x 256 x 256<br/>[0, 1]")
    label-real-in("«Input»<br/>Labels<sub>input</sub><br/>1479<br/>{0, 1}")
end

subgraph Input B
    x-orig("«Input»<br/>Spectrogram<br/>1 x 256 x 256<br/>[0, 1]")
    label-fake-in("«Input»<br/>Labels<sub>input</sub><br/>1479<br/>{0, 1}")
    label-fake-out("«Input»<br/>Labels<sub>output</sub><br/>1476<br/>{-1, 0, 1}")
end

subgraph Loss
    classify(Domain Classification Loss)
    adversarial(Adversarial Loss)
    recon(Reconstruction Loss)
    identity(Identity Mapping Loss)
    feature(Feature Matching Loss)
end

gen1[Generator]
gen2[Generator]
gen3[Generator]
x-fake("Fake<br/>1 x 256 x 256<br/>[0, 1]")
recon-mse[Mean Squared Error]
feat-mse[Mean Squared Error]
ident-mse[Mean Squred Error]
x10-recon((x10))
x10-feat((x10))
x10-ident((x10))
minus(("-"))

dis-fake[Discriminator]
dis-orig[Dsicriminator]
sce[Sigmoid Cross Entropy]

x-real --> gen1
label-fake-in --> gen1
gen1 --> x-fake

x-fake --> gen2
label-real-in --> gen2
gen2 --> recon-mse
x-real --> recon-mse
recon-mse --> x10-recon
x10-recon --> recon

x-fake --> dis-fake
x-orig --> dis-orig
dis-fake --> | Feature | feat-mse
dis-orig --> | Feature | feat-mse
feat-mse --> x10-feat
x10-feat --> feature

dis-fake --> | Classification | sce
label-fake-out --> sce
sce --> classify

dis-fake --> | Adversarial | minus
minus --> adversarial

x-real --> gen3
label-real-in --> gen3

gen3 --> ident-mse
x-real --> ident-mse

ident-mse --> x10-ident
x10-ident --> identity

学習パラメータ

optimizer: RMSprop

learning rate: 1e-5

minibatch size: 2

epoch: 203

結果

学習時間: 4.32 時間

ラベルの入力を問わず、ほぼ入力スペクトログラムと同じ出力を得るようになった。

学習曲線

Wasserstein Distance Domain Classification Loss Gradient Penalty Identity Mapping / Reconstruction Loss Feature Matching Loss

感想

冷静に考えればわかることだが、この Generator モデルでは、ラベルとマージした直後から考えると、横 256px のうち左右 62px 分しか影響しない。このため、想像のような出力が得られたかったのだと考える。