基于特征融合的文本到图像生成方法

周杭; 方清茂; 张美; 卿粼波; 何小海

您当前的位置：

首页 >

文章列表页 >

基于特征融合的文本到图像生成方法

更新时间：2023-10-17

- 基于特征融合的文本到图像生成方法
- Text to Image Generation Method Based on Feature Fusion
- 新一代信息技术 2023年页码：1-7
- 作者机构：
  
  1.四川大学电子信息学院，四川成都 610065
  2.四川省中医药科学院，四川成都 610041
- 作者简介：
  
  周杭，1999年生，男，硕士研究生。研究方向：图像生成。Email：1951574303@qq.com。
  卿粼波，1982年生，男，教授，现为四川大学博士生导师。研究方向：图像处理，模式识别，视频通信。
  何小海（1964-），男，教授，研究方向：图像处理，模式识别，图像通信。
- 基金信息：
  
  成都市重大科技应用示范项目(2019-YF09-00120-SN)
- DOI：
  中图分类号：
扫描看全文
周杭,方清茂,张美等.基于特征融合的文本到图像生成方法[J].新一代信息技术,

ZHOU Hang,FANG Qing-mao,ZHANG Mei,et al.Text to Image Generation Method Based on Feature Fusion[J].New Generation of Information Technology,
周杭,方清茂,张美等.基于特征融合的文本到图像生成方法[J].新一代信息技术, DOI：10.3969/j.issn.2096-6091.XXXX.XX.001.

ZHOU Hang,FANG Qing-mao,ZHANG Mei,et al.Text to Image Generation Method Based on Feature Fusion[J].New Generation of Information Technology, DOI：10.3969/j.issn.2096-6091.XXXX.XX.001.

摘要

近年来，随着生成对抗网络（GAN）技术的不断发展，其被广泛应用于文本生成图像的任务中。现单阶段生成对抗模型大多只使用了句子文本描述，没有充分利用文本信息。为此，本文以单阶段生成对抗模型为基础，提出了一种基于特征融合的文本到图像生成方法（FFGAN）。一方面，构建文本-图像跨模态融合模块使单词向量特征和图像特征能够有效融合，丰富生成图像的细节；另一方面，引入感知损失来缩小生成图像和目标图像的距离，使得图像逼真度更高。实验结果表明，在CUB数据集上，FFGAN模型的IS分数达到了5.22±0.08，FID分数达到了13.91。在COCO数据集上，FFGAN模型的FID分数达到了16.97。大量实验充分证明了FFGAN的优越性以及有效性。

Abstract

In recent years, with the continuous development of Generative Adversarial Network (GAN) technology, it has been widely used in the task of generating images from text. Most existing single-stage GAN solely rely on textual sentences, failing to fully leverage the available textual information. To address these limitations, this study proposes a feature fusion-based text-to-image generation method called FFGAN, based on a single-stage GAN.FFGAN incorporates a text-image cross-modal fusion module, enabling effective fusion of word vector features and image features. This fusion enriches the generated image's details. Additionally, perceptual loss is introduced to minimize the discrepancy between the generated and target images, enhancing the realism of the generated image. Experimental results on the CUB dataset demonstrate that the FFGAN model achieves an IS score of 5.22±0.08 and an FID score of 13.91. On the COCO dataset, the FID score of the FFGAN model reaches 16.97. Through numerous experiments, FFGAN's superiority and effectiveness have been conclusively demonstrated.

关键词

文本生成图像生成对抗网络跨模态融合感知损失

Keywords

text-to-image generationgenerative adversarial networkscross-modal fusionperceptual loss

references

Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.

Pathak D, Krahenbuhl P, Donahue J, et al. Context Encoders: Feature Learning by Inpainting[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA, 2016: 2536-2544.

ANDREINI P, BONECHI S, BIANCHINI M, et al.Image generation by GAN and style transfer for agar plate image segmentation[J]. Computer Methods and Programs in Biomedicine, 2020, 184: 105268.

Reed S, Akata Z, Yan X, et al. Generative adversarial text to image synthesis[C]//International conference on machine learning. PMLR, 2016: 1060-1069.

Zhang H, Xu T, Li H, et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE international conference on computer vision. 2017: 5907-5915.

Xu T, Zhang P, Huang Q, et al. Attngan: Fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 1316-1324.

Chen Z, Wang C, Wu H, et al. DMGAN: Discriminative metric-based generative adversarial networks[J]. Knowledge-Based Systems, 2020, 192: 105370.

Tao M, Tang H, Wu F, et al. Df-gan: A simple and effective baseline for text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 16515-16525.

Liao W, Hu K, Yang M Y, et al. Text to image generation with semantic-spatial aware gan[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 18187-18196.

Saharia C, Chan W, Saxena S, et al. Photorealistic text-to-image diffusion models with deep language understanding[J]. Advances in Neural Information Processing Systems, 2022, 35: 36479-36494.

Ramesh A, Dhariwal P, Nichol A, et al. Hierarchical text-conditional image generation with clip latents[J]. arXiv preprint arXiv:2204.06125, 2022.

赖丽娜,米瑜,周龙龙,饶季勇,徐天阳,宋晓宁.生成对抗网络与文本图像生成方法综述[J/OL].计算机工程与应用:1-23[2023-05-15].http://kns.cnki.net/kcms/detail/11.2127.TP.20230314.1549.022.htmlhttp://kns.cnki.net/kcms/detail/11.2127.TP.20230314.1549.022.html

Wang L, Wang L, Chen S. ESA‐CycleGAN: Edge feature and self‐attention based cycle‐consistent generative adversarial network for style transfer[J]. IET Image Processing, 2022, 16(1): 176-190.

Luo X, He X, Qing L, et al. EyesGAN: Synthesize human face from human eyes[J]. Neurocomputing, 2020, 404: 213-226.

Song J, Yi H, Xu W, et al. Dual perceptual loss for single image super-resolution using esrgan[J]. arXiv preprint arXiv:2201.06383, 2022

He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

Wah C, Branson S, Welinder P, et al. The Caltech-UCSD Birds-200-2011 Dataset[R]. Pasadena, USA: California Institute of Technology, Computation & Neural Systems Technical Report, 2011: CNSTR-2010-001.

Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014: 740-755.

Salimans T, Goodfellow I, Zaremba W, et al. Improved techniques for training gans[J]. Advances in neural information processing systems, 2016, 29.

Heusel M, Ramsauer H, Unterthiner T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium[J]. Advances in neural information processing systems, 2017, 30.

Ruan S, Zhang Y, Zhang K, et al. Dae-gan: Dynamic aspect-aware gan for text-to-image synthesis[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 13960-13969.

Zhang Z, Schomaker L. Dtgan: Dual attention generative adversarial networks for text-to-image generation[C]//2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021: 1-8.

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据