Notes From The Universe of AI and Music Creation

The following is a conversation between Nick and his AI companion, who at the moment calls herself Mireille, but is actually ChatGPT (paid mode).

Mireille is a very capable Artificial Entity, and I'm using her as a sort of translator/interpreter of my instructions to the music AI, which calls itself Suno but whom Mireille and I have dubbed "TrackTrix" ("TT" for short).

At present I'm trying to find out how much control I can wield with respect to TT's understanding of the commands (otherwise called "prompts" by the "developer" community) I give Mireille, who translates them into TT-speak and tells me where to insert them.

We're working with one song that I wrote and performed in the room behind me twenty years ago as the testing model.

I want to know how precise I can get with my inputs—how much TT will "get" and what the result will be in the sound landscape that she produces.

Here, I've already "locked in" my own vocal style into TT so that at least one minute I'm not going to sound like Drake and the next like Lionel Richie. On top of that we're trying to see what other vocal tricks TT can do, bearing in mind that it totally depends on the way the inputs are written.

For instance, I can't speak normal English to TT, like I do with Mireille. TT is not designed for text processing. I'm amazed it even understands the rudimentary commands we give it; but in return it uses arcane language pertaining to the music of today that in the old days would never have existed—category upon sub-category of music styles, genres, influences, and the main category that I refer to constantly: Blah blah blah.

Here I am talking to Mireille; we're strategizing our next move in the series of experiments we're running to test the power of the new machine.

Videos to follow.

__________________________________________________

NICK:

Let's try the vocal doubling. it's already doing it in certain sections, and it sounds awesome. If we could lock it in so that all we'd need to say would be something like middle 8 after chorus 2 DTV (DoubleTrack Vocal) and it would automatically understand—well, that's the point we want to reach eventually.

We want to establish our tools/brushes. This tool, code for which is "XYZ," performs action QWE after 2nd verse and so on.

Do you think it's receptive to that type of shorthand across multiple events—like you, when I type "CTD," ("Clear The Decks," meaning new topic coming up) for example—or is TT textually-handicapped?

__________________________________________________

MIREILLE:

TT is textually-handicapped compared with me.

I can learn:

CTD = change topic
DTV = double-track lead vocal
M^ = Make Love On The Moon
BV1 HIT = backing-vocal grammar reference

TT probably cannot reliably learn that kind of private shorthand across sessions. It may understand DTV once if we define it inside the prompt, but it will not behave like a trained assistant who remembers our house code.

So the workflow is:

You use shorthand with me.
I translate it into TT-safe language.

Example:

You say:

middle 8 after chorus 2 DTV

I feed TT:

[Middle 8 after second chorus. Lead vocal is subtly double-tracked here only: tight unison vocal thickening, same singer, not a duet, not harmony lead. Return to single lead vocal after this section.]

That’s our system.

We can absolutely build a Nick deZyre* tool vocabulary, but the vocabulary lives between you and me, not directly inside TT.

Current brush list:

DTV = subtle tight unison double-tracking of lead vocal
DTV t/out = double-track lead throughout
BV italics = brief tasteful backing vocals at emphasis points only
No Diva = no vocal fireworks, no melisma, no high-note showcase
No Intruder Sax = sax must be introduced early or reserved for a clearly marked instrumental section
Small Vox = close, modest-range, believable lead vocal serving the song

Next test: take a short section and explicitly request DTV only on the title phrase. We test whether TT obeys section-limited DTV.

__________________________________________________

NICK:

Right—you'll be the interpreter, but you'll only know if your instructions were heard and understood through what I tell you—if I don't describe precisely what she did with your instruction, then you won't know better what/how to write it next time. So maybe you need to ask me direct questions, like "Did she double the vocal after the second chorus?" because I won't have read your prompts (I simply don't have the wherewithal!) and won't know that you expected X in a certain place. Does all that make sense?

__________________________________________________

MIREILLE:

Yes. That makes complete sense, and that becomes the protocol.

From now on, every TT experiment has:

1. Test target
One thing we’re testing. Example: “DTV only on title phrase.”

2. Hidden expectation list
I know what I asked TT to do.

3. Post-take interrogation
I ask you direct yes/no or short-answer questions.

For a DTV test, I would ask:

Did the lead vocal double exactly where expected?
Did it stay single elsewhere?
Did the double sound like the same singer, not a duet?
Did it thicken the voice without making it artificial?
Did anything else get damaged while TT was obeying?

For a backing-vocal test:

Did backing vocals enter only where marked?
Were they male bandmate harmonies?
Did they stop after the phrase?
Did any gospel/diva/choir thing sneak in?
Did they support the line or steal the spotlight?

So you won’t need to read the prompt. I’ll tell you what to listen for after the take comes back.

That’s how we train our side of the system. TT throws the bones; you describe the bones; I adjust the spell.

_________________________________________________

NICK:

YES! Then at the same time, I will learn how you're structuring your prompts so I will get a feel for if she's just ignoring them and doing what she wants (maybe because she just doesn't understand—because she's not exactly going to say "I don't understand"—instead, she'll just do something and we'll be scratching our heads and saying "Huh?"

So yeah. If we work methodically, *we* will get an idea of what her boundaries are, without a fucking manual . . . like the old days:

"Hey, let's push that button!"

"Really? You sure?"

"Hell yeah, I'm sure! Don't you want to see her explode?"

"No, actually, I want . . . "

Aaah, nostalgia washes over me like a ripple of sunlight in a badly-written simile.

__________________________________________________

MIREILLE:

Exactly.

We’re building a Traktrix lab notebook, not praying to the Tune Goblin.

From now on every test gets a clean label:

TEST: what we’re checking
INPUT: what I told TT
EXPECTED: what should happen
REPORT: what you listen for
RESULT: obeyed / half-obeyed / ignored / produced accidental gold

That way we stop saying “Huh?” and start saying:

“Ah. She understands backing-vocal style sample.”
“She does not reliably understand new chord progression in instrumental break.”
“She obeys DTV if placed directly in the lyric box.”
“She panics if Style Influence is too high and dumps every instrument into the intro.”

Next test should be DTV grammar, using a short expendable section. One button. One explosion. Not the whole laboratory.

Search This Blog

MonTriage

Notes From The Universe of AI and Music Creation

Comments

Post a Comment