Sketch & Text to 3D CAD Model (technical design)

I’m looking for a way to generate a 3d model based on a hand sketch plus a decriptive text (may be additional descriptive labels in the skech).

In my attempts I could provoke the GPT4 model to generate scad as output files but currently not really happy … too far away from something that could be used as template for e.g. 3d printing.

Any experiences, lessons learned, prompts, models, recommendations?


I’ll provide you with one possible workflow. This is a very bleeding edge demo that I’ve seen. That uses Comfy UI to create 6 perspectives, generating normal maps, and stuff to arrive at a simple 3D object, with pretty low resolution texture maps…but it’s sure to get better in the months to come. I’ve personally thought on doing something kinda similar to what you want…trying to generate consistent scenes with image generation such that you could do photogrammetry or create an instant NeRF or instant-ngp. If you consider what SORA will be able to do, you probably could get the camera to revolve around an object or have the object spin like a lazy Susan…when that gets released, I’m sure people will have their workflows for it.


Thanks for the link. I’ve seen this tool in another video but the idea to use sketches for that as a source is cool.

The video remembered me to my results playing around with the kinect20 camera with 3dScan some years ago. But the output of a teddy bear was very blurry.

This video might intersting for you.
Making a NeRF movie with NVIDIA’s Instant NGP

1 Like

That first video was a new one for me, that was pretty spectacular being able to sweep through different changes with Picasso. Didn’t watch the full video, but saw enough to be like dang, I need to try that in the near future. And the instant-NGP is something I’ve dabbled with…there’s a github project that let’s you view the NeRFs primitives or whatever correct terminology is in VR…however, it’s written against an outdated XR paradigm that isn’t used any more, so it’s on the to-do list to pass it through AI to make it work using the new XR standards…so then you can make a bunch of NGPs/NeRFs and make a composition out of them. Thanks for the video link to that Picasso demo…solid find

And I lost sight of what you wanted to do with 3D models by talking about VR uses of NeRFs/NGPs…but perhaps you could get better watertight models by sculpting within Unity…those NGPs are super messy with their points…and the cropping options are good, but won’t be great for complex shapes…I don’t even know how you could make an STL out of a NGP :joy:

Neuron dropped a new video, and I think it’s more promising than the previous one I linked to.

I want to go from text prompt to 3D file.
At the moment I am using Makehuman and Blender to make 3D figures for model trains. I have tried a few methods. Rendernet to 2D, TripoSR to 3D.
However Makehuman/Blender already know Human shapes and rigging for poses.
Too much AI compute is being wasted, what if AI could be trained on all those BVH and human model files?

Haven’t tried anything yet, myself, but very interested.

I’ve thought about hooking an LLM into Fusion360 or Blender’s python interpreter, give the LLM the API docs and use it via function calling. There are probably better ways - if you can get a good python library, you could to the whole thing in a GPT.

There’s OpenPose as a ControlNet, which takes a source image with a human in some pose, and uses that as a constraint for generating an image where the subject is more or less in the same pose. There’s also single camera Mocap solutions that do a decent job of driving a humanoid character’s pose in Unity. I think MediaPipe is such a solution…haven’t looked into it for awhile. I’ve done the reverse operation…Have a humanoid character in Unity, get the pose by assigning OpenPose compliant color schemes for joints and limb, and feed that into ComfyUI with pretty good results…but, I’m trying to figure out how to save the 3D coordinates of the joints into something that ComfyUI can use directly…JSON file properly structured that would be what OpenPose ControlNet deduces through CV. Surely it’s possible, I just didn’t figure that out the little bit I tried… And, you could go from text prompt to 3D file…follow Neuron’s second to last video to do the image to 3D model…and then just have normal workflow up to getting the source image which is the starting point of that video. The results aren’t spectacular yet, as he shows…but eventually it will probably be pretty good…with lots and lots of GPU VRAM, of course :rofl: And, also, I think that generative Mocap ought to be the next frontier…we’ve got images, audio, video, text, music…but no dang Mocap (to my knowledge). That would come in so handy.

ohhh. :astonished: Haven’t independently checked this claim, but Neuron asserts “In this tutorial I walk you through a basic SV3D workflow in ComfyUI. SV3D stands for Stable Video 3D and is now usable with ComfyUI. With SV3D in ComfyUI you can create multi-view sequences for now. In the near future there will be the possibility to create 3D models with support for defining a camera path. This will give a whole new world of possibility’s”