Select multiple pictures and videos locally, and generate new copy based on the pictures and videos and historical copy

I want to deploy a set of offline, select local pictures or videos (maybe dozens of them) on the front-end management page, and then select the previous word copy, and then generate new picture and text word copy through the big model by identifying these pictures and referring to the word copy description style I selected. How can I do it?