Is there a multi-modal model that can generate a mix of images and text when prompted with a query in text only?

Asked Dec 20 '24 at 04:34

Active Dec 20 '24 at 13:38

Viewed 28 times

I am looking for a multi-modal model that can generate a mix of images and text when prompted with a query in text only.

Please find a sample expected input and output below.

Are there any models that can achieve this? Please suggest.

I have seen Next-GPT capable of generating Images based on prompt however was wondering if there are models that can decide if a text can be presented as an Image and presents the same in output.

e.g.

Input - What was the market share of smartphones in the year 2008?

Output - The marketshare of smartphones segregated by manufacturer are as listed below : Nokia: 265.6148 million (32.5% market share) Motorola: 144.9204 million (17.7% market share) Samsung: 103.7536 million (12.7% market share) LG: 54.9246 million (6.7% market share) Sony Ericsson: 51.7738 million (6.3% market share) Siemens: 28.5906 million (3.5% market share) Others: 166.9851 million (20.6% market share) Total: 816.5629 million