Text to Image Generation Method Based on Feature Fusion

ZHOU Hang; FANG Qing-mao; ZHANG Mei; QING Lin-bo; HE Xiao-hai

doi:10.3969/j.issn.2096-6091.2023.15.002

您当前的位置：

首页 >

文章列表页 >

Text to Image Generation Method Based on Feature Fusion

RESEARCH PAPER | 更新时间：2025-03-26

- Text to Image Generation Method Based on Feature Fusion
- New Generation of Information Technology Vol. 6, Issue 15, Pages: 6-12(2023)
- 作者机构：
  
  1.四川大学电子信息学院，四川成都 610065
  2.四川省中医药科学院，四川成都 610041
- 作者简介：
- 基金信息：
- DOI：10.3969/j.issn.2096-6091.2023.15.002
  CLC： TP391.41
- Accepted：2023-10-17，
  
  Published：15 August 2023
- 稿件说明：
移动端阅览
周杭, 方清茂, 张美, 等. 基于特征融合的文本到图像生成方法[J]. 新一代信息技术, 2023, 6(15): 06-12

ZHOU Hang, FANG Qing-mao, ZHANG Mei, et al. Text to Image Generation Method Based on Feature Fusion[J]. New Generation of Information Technology, 2023, 6(15): 06-12
周杭, 方清茂, 张美, 等. 基于特征融合的文本到图像生成方法[J]. 新一代信息技术, 2023, 6(15): 06-12 DOI： 10.3969/j.issn.2096-6091.2023.15.002.

ZHOU Hang, FANG Qing-mao, ZHANG Mei, et al. Text to Image Generation Method Based on Feature Fusion[J]. New Generation of Information Technology, 2023, 6(15): 06-12 DOI： 10.3969/j.issn.2096-6091.2023.15.002.

摘要

近年来，随着生成对抗网络（Generative Adversarial Network

GAN）技术的不断发展，其被广泛应用于文本生成图像的任务中。现单阶段生成对抗模型大多只使用了句子文本描述，没有充分利用文本信息。为此，本文以单阶段生成对抗模型为基础，提出了一种基于特征融合的文本到图像生成方法（FFGAN）。一方面，构建文本-图像跨模态融合模块使单词向量特征和图像特征能够有效融合，丰富生成图像的细节；另一方面，引入感知损失来缩小生成图像和目标图像的距离，使得图像逼真度更高。实验结果表明，在CUB数据集上，FFGAN模型的IS分数达到了5.22±0.08，FID分数达到了13.91。在COCO数据集上，FFGAN模型的FID分数达到了16.97。大量实验充分证明了FFGAN的优越性以及有效性。

Abstract

In recent years

with the continuous development of generative adversarial network (GAN) technology

it has been widely used in the task of generating images from text. Most existing single-stage GAN solely rely on textual sentences

failing to fully leverage the available textual information. To address these limitations

this study proposes a feature fusion-based text-to-image generation method (FFGAN)

based on a single-stage GAN. FFGAN incorporates a text-image cross-modal fusion module

enabling effective fusion of word vector features and image features. This fusion enriches the generated image's details. Additionally

perceptual loss is introduced to minimize the discrepancy between the generated and target images

enhancing the realism of the generated image. Experimental results on the CUB dataset demonstrate that the FFGAN model achieves an IS score of 5.22±0.08 and an FID score of 13.91. On the COCO dataset

the FID score of the FFGAN model reaches 16.97. Through numerous experiments

FFGAN’s superiority and effectiveness have been conclusively demonstrated.

关键词

Keywords

references

GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al . Generative adversarial networks [J ] . Communications of the ACM , 2020 , 63 ( 11 ): 139 - 144 .

PATHAK D , KRÄHENBÜHL P , DONAHUE J , et al . Context encoders: Feature learning by inpainting [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 2536 - 2544 .

ANDREINI P , BONECHI S , BIANCHINI M , et al . Image generation by GAN and style transfer for agar plate image segmentation [J ] . Computer Methods and Programs in Biomedicine , 2020 , 184 : 105268 .

REED S , AKATA Z , YAN X C , et al . Generative adversarial text to image synthesis [C ] // Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 . New York : ACM , 2016 : 1060 - 1069 .

ZHANG H , XU T , LI H S , et al . StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 5908 - 5916 .

XU T , ZHANG P C , HUANG Q Y , et al . AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 1316 - 1324 .

CHEN Z L , WANG C , WU H M , et al . DMGAN: Discriminative metric-based generative adversarial networks [J ] . Knowledge-Based Systems , 2020 , 192 : 105370 .

TAO M , TANG H , WU F , et al . DF-GAN: A simple and effective baseline for text-to-image synthesis [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 16494 - 16504 .

LIAO W T , HU K , YANG M Y , et al . Text to image generation with semantic-spatial aware GAN [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 18166 - 18175 .

SAHARIA C , CHAN W , SAXENA S , et al . Photorealistic text-to-image diffusion models with deep language understanding [J ] . Advances in Neural Information Processing Systems , 2022 , 35 : 36479 - 36494 .

RAMESH A , DHARIWAL P , NICHOL A , et al . Hierarchical text-conditional image generation with CLIP latents [EB/OL ] . [ 2023-8-15 ] . https://3dvar.com/Ramesh2022Hierarchical.pdf https://3dvar.com/Ramesh2022Hierarchical.pdf .

赖丽娜 , 米瑜 , 周龙龙 , 等 . 生成对抗网络与文本图像生成方法综述 [J/OL ] . 计算机工程与应用 , 2023 : 1 - 23 . ( 2023-03-15 )[ 2023-09-23 ] . https://kns.cnki.net/kcms/detail/11.2127.TP.20230314.1549.022.html https://kns.cnki.net/kcms/detail/11.2127.TP.20230314.1549.022.html .

WANG L , WANG L , CHEN S . ESA‐CycleGAN: Edge feature and self‐attention based cycle‐consistent generative adversarial network for style transfer [J ] . IET Image Processing , 2022 , 16 ( 1 ): 176 - 190 .

LUO X , HE X , QING L , et al . EyesGAN: Synthesize human face from human eyes [J ] . Neurocomputing , 2020 , 404 : 213 - 226 .

SONG J , YI H W , XU W Q , et al . Dual perceptual loss for single image super-resolution using ESRGAN [EB/OL ] . [ 2023-8-15 ] . https://arxiv.org/pdf/2201.06383.pdf https://arxiv.org/pdf/2201.06383.pdf .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 770 - 778 .

WAH C , BRANSON S , WELINDER P , et al . The Caltech-UCSD Birds-200-2011 Dataset [R ] . Pasadena, USA : California Institute of Technology, Computation & Neural Systems Technical Report , 2011 : CNSTR-2010-001.

LIN T Y , MAIRE M , BELONGIE S , et al . Microsoft COCO: Common objects in context [C ] // Computer Vision — ECCV 2014 . Cham : Springer International Publishing , 2014 : 740 - 755 .

SALIMANS T , GOODFELLOW I , ZAREMBA W , et al . Improved techniques for training GANs [C ] // Proceedings of the 30th International Conference on Neural Information Processing Systems . New York : ACM , 2016 : 2234 - 2242 .

HEUSEL M , RAMSAUER H , UNTERTHINER T , et al . GANs trained by a two time-scale update rule converge to a local Nash equilibrium [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 6629 - 6640 .

RUAN S L , ZHANG Y , ZHANG K , et al . DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2022 : 13940 - 13949 .

ZHANG Z X , SCHOMAKER L . DTGAN: dual attention generative adversarial networks for text-to-image generation [C ] // 2021 International Joint Conference on Neural Networks (IJCNN) . Piscataway : IEEE , 2021 : 1 - 8 .

Views

268

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

ZHOU Hang

HE Xiao-hai

Related Institution

No data

⁰