This may be overly obvious, but I believe it needs to be said that you're going to have to write some code, and above all else, you're going to have to approach this project as many smaller pieces which you test & troubleshoot, then combine together to form the final solution. Maybe I'm just reading too much into the messages you've already written, but it kinda seems like you're imagining an "easy" solution where almost everything is already done by existing code. Indeed libraries exist for the really tough parts, like reading the SD card and efficiently drawing on the display. But you're going to need to do quite a bit of work on this too. The right mindset, where you get pieces working and build up the solution from well tested parts is needed. If you jump into this expecting some existing code to do almost everything with only a minor edit, you're almost certainly going to end up with frustration.
The first major decision, which Theremingenieur pointed out, it whether to "play" the file at regular audio speed, or just read it directly as fast as possible. If the file is 4 minutes of music, playing the file through the audio lib will end up taking 4 minutes to draw the waveform. Maybe that's what you want? But if you'd like the waveform to appear as quickly as possible, you'll need to read the file directly. You've not said much about *why* you're doing this... we don't have the context to understand if you need it to animate slowly as the music plays, or show up quickly like you'd see in an audio editing program. Aside from the tech details, can you see how better communication here could allow us to help you more? We're real humans with a lot of programming experience, but that expertise doesn't help much if we don't understand what you're trying to do.
No matter how you get the data, you're also going to have to put it into the display. Maybe you've got
the ILI9341 display? Or maybe it's some other display?
No matter which display you use, this project is going to involve mapping large chunks of audio data onto each pixel. In this example waveform, it's pretty clear the positive and negative peeks are drawn differently.
So even if you use the audio lib, the peak object which gives only a single number probably will not be enough. You're probably going to have to get the raw audio data and write a loop to find the min (usually negative) & max (usually positive) values over each range of samples that correspond to each pixel on your display.
How exactly you do that depends on whether you want gradual real-time or render-instantly drawing, and which display, and perhaps other details nobody can't anticipate without knowing more of the purpose of your project.