텍스트-비디오 모델

텍스트-비디오 모델(text-to-video model)은 자연어 설명을 입력으로 사용하고 입력에서 비디오 또는 여러 비디오를 생성하는 기계 학습 모델이다.^[1]

안정적인 배경에서 객체를 사실적으로 만들기 위한 비디오 예측은 커넥터 합성곱 신경망이 있는 시퀀스-시퀀스(Sequence to Sequence) 모델에 대해 순환 신경망을 사용하여 각 프레임을 픽셀 단위로^[2] 인코딩 및 디코딩하고 딥 러닝을 사용하여 비디오를 생성함으로써 수행된다.^[3] 텍스트의 기존 정보에 대한 조건부 생성 모델의 자료 집합 테스트는 변분 오토인코더 및 생성적 적대 신경망(GAN)을 통해 수행할 수 있다.

같이 보기

소라 (텍스트-비디오 모델)

각주

↑ Artificial Intelligence Index Report 2023 (PDF) (보고서). Stanford Institute for Human-Centered Artificial Intelligence. 98쪽. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.
↑ “Leading India” (PDF).
↑ Narain, Rohit (2021년 12월 29일). “Smart Video Generation from Text Using Deep Neural Networks” (미국 영어). 2022년 10월 12일에 확인함.

[AIIR-1] Artificial Intelligence Index Report 2023 (PDF) (보고서). Stanford Institute for Human-Centered Artificial Intelligence. 98쪽. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.

[2] “Leading India” (PDF).

[3] Narain, Rohit (2021년 12월 29일). “Smart Video Generation from Text Using Deep Neural Networks” (미국 영어). 2022년 10월 12일에 확인함.

[1]

[2]

[3]