Is It Possible to Transform Speech or Text to Real Material Objects?

Transforming speech into objects is rather an elusive concept, which nevertheless, begins partly turning into reality with the GenAI’s advent. The idea can be materialized through a synergistic effect of several AI types put together:
- Speech recognition. Required to let the generative model capture a spoken request.
- Text processing. At this stage the request is deciphered by a Large Language Model that retrieves data from the semantic content.
- 3D modelling. Generative Engineering Model (GEM) is the next step as it converts natural language input into a detailed design suitable for printing.
- 3D printing. The concluding stage. At which an object is created according to a synthetically generated blueprint.
Even though the concept may seem rather futuristic, several AI models are capable of processing textual content to create object designs — based on Transformer or Diffusion models, they can be enhanced with such know-hows as iterative sampling, estimation of distribution, hierarchical token sequences, and so on.

The Concept of Industry 6.0

Industry 6.0 is a futuristic scenario, in which the production cycle almost entirely excludes human interference — the only exception is the starting point when a request is given to the roboticized system.
Industry 6.0 implies that GenAI has a high potential for developing solid decision-making capability and succeeding even without human guidance. The operations will be handled by a swarm of robots — from autonomous machines to drones — equipped with individual intelligence.
The production pipeline starts with 2D signed distance function (SDF) for outlining the object’s geometry, then it is converted into a 3D stereolithography file (STL), which in turn is translated into the G-code to initiate 3D printing. According to the authors' estimation, the proposed system outperforms human developers by an improvement factor of 4.4.
Transforming Speech to Real Material Objects
There is another concept that shows how verbal commands can be turned into real objects. It consists of:
- System Framework
The core of the system includes:
- Speech recognition and language-processing models.
- Generative model for creating a mesh.
- Voxelization component to turn the mesh into building blocks (voxels).
- Assembly phase when voxels are placed on their coordinates.
Each stage requires a separate GenAI solution.
- System Hardware
The physical part of the system includes cuboctahedron-shaped voxels that can be assembled from any direction, a 6-axis robotic arm equipped with indexers for better alignment, and a conveyor belt.

- System Implementation
The key implementation factor is to find balance between speed and failure prevention. It is suggested that the voxel dispenser should be set up first to achieve alignment and avoid collisions. Then the speed of the robotic hand movement should undergo precise calibration.
Converting Speech to Virtual Object Control
Some object manipulation techniques based on speech are proposed.
- Objects Selection in Virtual Reality
Picture: The object manipulation challenge in the VR setting with three levels of perplexity
It is suggested that objects can be moved with verbal commands in a simulated reality. For that purpose a training set is prepared, which contains:
- Utterances. Refer to sizes, shapes, colors and features of objects.
- Intents. They imply manipulation commands.
The Azure text-to-speech tool is used for interpreting human requests.

- Text Selection in Virtual Reality
An experiment put to competition three hands-free selection methods manipulation in VR: Blink that focuses on blinking as a command signal, Dwell that does the same with gazing, and Voice that processes standard voice requests. The test, featuring 24 participants, showed that blinking outperforms other methods in terms of precision and speed.