和 Gemini 的對話

Here's a thinking process to arrive at the explanation and solution for the UnicodeEncodeError: 'latin-1' codec can't encode characters error in the Flask context:

  1. Understand the Error: The user is getting a UnicodeEncodeError: 'latin-1' codec can't encode characters... error specifically when decoding the standard output (stdout_info) from a subprocess.Popen call within a Flask endpoint, even though the same code works outside Flask. The error occurs at stdout_info.decode().

  2. Analyze the Error Message:

    • UnicodeEncodeError: This seems counter-intuitive at first because the code is calling .decode(). Decode errors are usually UnicodeDecodeError. An EncodeError during a decode operation suggests something else is happening around the decode call, or the default encoding being used for some other operation (perhaps implicitly) is causing the issue.
    • 'latin-1' codec: This tells us that Python, somewhere, is trying to encode data using the latin-1 codec. Latin-1 (ISO-8859-1) is a limited, 8-bit encoding that cannot represent many characters found in other languages (like Chinese, Japanese, Korean, Cyrillic, emojis, etc.).
    • can't encode characters in position...: This confirms that the data being encoded contains characters that don't exist in the latin-1 character set.
  3. Identify the Operation: The error occurs at stdout_info.decode(). The decode() method converts bytes into a string. It takes an optional encoding argument. If omitted, it uses the system's default encoding.

  4. Why the Difference (Standalone vs. Flask)?

    • Standalone: When running the script directly from a terminal, the environment (including default encodings) is often determined by the terminal's settings (locale, environment variables like LANG, LC_ALL, PYTHONIOENCODING). Modern systems and terminals often default to UTF-8, which can handle almost all Unicode characters. This is likely why it works standalone.
    • Flask (Web Server Context): When running under a web server (like Flask's development server, Gunicorn, uWSGI), the environment might be different. The server process might inherit or set a different default encoding. Older systems or improperly configured servers might default to ASCII or latin-1.
  5. Hypothesize the Cause:

    • The stdout_info contains bytes representing characters that are not latin-1.
    • When stdout_info.decode() is called without specifying an encoding, Python uses a default encoding.
    • In the Flask environment, this default encoding is apparently latin-1.
    • The decode() operation itself might not be the direct cause of the UnicodeEncodeError. It's more likely that the result of the decode() (a Unicode string) is then being implicitly encoded somewhere else by Flask or the underlying web server using the restrictive latin-1 codec, perhaps during logging, response generation, or interaction with the WSGI server. However, the user states the error happens at the decode line. This suggests that decode() is trying to use latin-1 as the source encoding, which would be a UnicodeDecodeError. Let's re-read the error: UnicodeEncodeError. This strongly points towards an implicit encoding step after or during the process.
    • Correction/Refinement: Could the decode() method itself trigger an encode error under specific circumstances? Maybe if there's an interaction with system locales or standard streams within the decode implementation? Less likely. Let's reconsider the initial hypothesis: The default encoding used by decode() is the issue. Perhaps the error message itself is misleading, or the traceback points to the decode line, but the actual encoding failure happens slightly later (e.g., when handling the exception message itself).
    • Let's focus on the most probable cause: stdout_info.decode() is attempting to decode bytes assuming they are encoded in the wrong format (likely the system default, which is latin-1 in the Flask environment), and it fails when it encounters bytes that don't fit that encoding. Wait, that would be a UnicodeDecodeError.
    • Second Correction/Refinement: What if the command being run (command_info) produces output that is encoded in something other than latin-1 (very likely, probably UTF-8)? When stdout_info.decode() is called without an argument in the Flask context, it defaults to latin-1. It tries to interpret UTF-8 (or other encoded) bytes as if they were latin-1. This should cause a UnicodeDecodeError. Why EncodeError?
      • Maybe the error reporting mechanism itself is trying to format the error message, which contains problematic characters from stderr_info or the offending part of stdout_info, and that formatting involves encoding to latin-1? Yes, this is plausible. The error occurs at the line, but the final exception object construction or reporting might involve encoding.
      • Consider the stderr_info.decode() call inside the if block. If an error occurs there, and stderr_info contains non-latin-1 characters, decoding it with the default latin-1 could fail (DecodeError), or trying to log or return that error string could involve an implicit encode to latin-1 (EncodeError).
      • Let's assume the most direct cause: stdout_info contains non-latin-1 bytes (likely UTF-8). Calling stdout_info.decode() implicitly uses latin-1 in the Flask environment. This should be a DecodeError. The fact it's an EncodeError is confusing but points to the environment's default encoding being the problem.
  6. Formulate the Explanation:

    • Explain that bytes.decode() converts bytes to a string using a specific encoding.
    • If no encoding is specified, it uses a default, which varies by environment.
    • In the Flask environment, the default seems to be latin-1.
    • The output (stdout_info) from the subprocess likely contains characters (e.g., from the video title) encoded in a different format (most commonly UTF-8) that cannot be represented in latin-1.
    • Trying to decode these bytes as latin-1 (or potentially an implicit encoding step during error handling/logging within Flask using latin-1) causes the error. The error type UnicodeEncodeError is slightly unusual here but points to latin-1 being the problematic codec invoked implicitly.
  7. Propose Solutions:

    • Explicit Decoding: The most robust solution is to explicitly specify the correct encoding when decoding. Since the subprocess likely outputs UTF-8 (especially if it's handling potentially international text like video titles), use stdout_info.decode('utf-8'). Apply the same to stderr_info.decode('utf-8').
    • Error Handling: Add error handling to the decode step itself using the errors parameter (e.g., errors='ignore' or errors='replace') if losing some characters is acceptable, although decoding with the correct encoding is preferred.
    • Check Subprocess Output Encoding: Verify what encoding the command run by subprocess.Popen actually produces. If it's not UTF-8, adjust the decode() call accordingly. yt-dlp (a common tool for this) usually outputs UTF-8.
    • Environment Configuration (Less Ideal): While possible to try and configure the Flask/server environment to use UTF-8 as default, explicitly handling encoding in the code is generally safer and more portable.
  8. Structure the Answer:

    • Start by explaining the error message and what it means in this context.
    • Explain why it likely happens in Flask but not standalone (different default encodings).
    • Pinpoint the problematic line (stdout_info.decode()) and the implicit reliance on the default encoding.
    • Provide the primary solution: explicit decoding with UTF-8.
    • Show the corrected code snippet.
    • Mention the need to apply the same fix to stderr_info.decode().
    • Briefly mention alternative solutions/considerations (error handling, checking source encoding).
    • Conclude with why explicit encoding is the best practice.
  9. Refine Language: Ensure the explanation is clear, avoids jargon where possible, and directly addresses the user's code and problem. Make sure to clarify the slightly confusing EncodeError vs DecodeError aspect by focusing on the role of the implicit default latin-1 codec.

Okay, let's break down this UnicodeEncodeError: 'latin-1' codec can't encode characters... error in your Flask application.

It's a bit counter-intuitive because you're calling .decode(), which usually raises UnicodeDecodeError if it fails. The fact that you're getting an EncodeError suggests that somewhere after or during the processing initiated by the .decode() call, Python is implicitly trying to encode the resulting string using the latin-1 codec, and failing because the string contains characters not representable in latin-1.

Here's the likely scenario:

  1. Subprocess Output (stdout_info): The command you're running (command_info, likely something like yt-dlp or ffmpeg to get video info) produces output as raw bytes (stdout_info). This output almost certainly contains text (like the video title) encoded using UTF-8, which is the standard for handling diverse characters (different languages, emojis, symbols).
  2. .decode() without Encoding: When you call stdout_info.decode() without specifying an encoding, Python uses a default encoding.
  3. Different Default Encodings:
    • Standalone Script: When you run your script directly from your terminal, the default encoding is often determined by your system's locale settings (e.g., LANG, LC_ALL). On modern systems (Linux, macOS, recent Windows), this default is frequently UTF-8. Since the output bytes are likely UTF-8, stdout_info.decode() works correctly because the default matches the actual encoding.
    • Flask Environment: When running inside a web server environment (like Flask's development server, Gunicorn, uWSGI), the environment might be more minimal or configured differently. In many such environments, the default Python encoding can fall back to something more basic, like ASCII or, in your case, latin-1 (also known as ISO-8859-1).
  4. The Error:
    • Your Flask app tries to run stdout_info.decode().
    • It implicitly uses the default encoding, which is latin-1 in this environment.
    • It attempts to interpret the UTF-8 bytes coming from the subprocess as if they were latin-1. This should typically cause a UnicodeDecodeError.
    • However, the specific error UnicodeEncodeError strongly suggests that the problematic step might be slightly later: perhaps Python successfully decodes some bytes, gets a Unicode string, and then Flask (or an underlying library, or even the error reporting mechanism itself) tries to implicitly encode this string using the default latin-1 codec for logging, creating the response, or interacting with the WSGI interface, and this encoding fails because the string contains characters (like non-European characters or special symbols from the video title) that don't exist in latin-1. The traceback points to the .decode() line as the originator of the data that ultimately causes the encoding failure.

Solution: Explicitly Specify the Encoding

The most robust solution is to never rely on the default encoding. You should explicitly tell Python how to decode the bytes, based on what encoding the subprocess produces (which is almost always UTF-8 for JSON output containing text).

Modify the decoding lines like this:

Python
import subprocess
import json
# Assuming 'command_info' is defined elsewhere

process_info = subprocess.Popen(command_info, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout_info, stderr_info = process_info.communicate()

# Specify 'utf-8' for decoding stderr
if process_info.returncode != 0:
    # Use errors='replace' or errors='ignore' as a fallback if stderr might contain broken bytes
    error_message = stderr_info.decode('utf-8', errors='replace')
    return f"Error fetching video info: {error_message}", 500

# Specify 'utf-8' for decoding stdout
try:
    # Decode stdout using UTF-8
    decoded_stdout = stdout_info.decode('utf-8')
    # Load the JSON data from the decoded string
    video_info = json.loads(decoded_stdout)
    video_title = video_info.get('title', 'untitled_video')
    # ... rest of your code using video_title ...

# Add specific error handling for JSON decoding and potential unexpected output
except json.JSONDecodeError:
    return "Error: Could not decode JSON from video info.", 500
except UnicodeDecodeError:
    # This might happen if the subprocess output wasn't actually UTF-8
    return "Error: Could not decode video info output (invalid UTF-8).", 500
except Exception as e:
    # Catch other potential errors
    return f"An unexpected error occurred: {str(e)}", 500

# Example return for success (adjust as needed for your Flask endpoint)
# return jsonify({"title": video_title}), 200

Key Changes:

  1. stderr_info.decode('utf-8', errors='replace'): Decodes the standard error using UTF-8. errors='replace' is added as a safety measure – if stderr contains bytes that aren't valid UTF-8, they'll be replaced with a placeholder () instead of crashing the decoding.
  2. decoded_stdout = stdout_info.decode('utf-8'): Explicitly decodes the standard output using UTF-8. This is crucial.
  3. video_info = json.loads(decoded_stdout): Loads the JSON from the correctly decoded string.
  4. Added try...except blocks for better error handling around decoding and JSON parsing.

By specifying 'utf-8', you tell Python exactly how to interpret the bytes from the subprocess, regardless of the Flask environment's default encoding, resolving the UnicodeEncodeError.

Why does video_info = json.loads(stdout_info.decode('utf-8', errors='replace')) or video_info = json.loads(stdout_info.decode('utf-8')) get the same message, "UnicodeEncodeError: 'latin-1' codec can't encode characters in ..."?

Here's a thinking process to diagnose why the UnicodeEncodeError: 'latin-1' might persist even after explicitly decoding with utf-8:

  1. Analyze the New Information: The user has implemented the primary suggestion (decode('utf-8') or decode('utf-8', errors='replace')) but is still getting the exact same UnicodeEncodeError: 'latin-1' codec can't encode characters... error, seemingly originating from the json.loads(...) line or shortly after.

  2. Revisit the Error: UnicodeEncodeError means something is trying to encode a Unicode string into bytes using the latin-1 codec, and failing because the string contains characters not present in latin-1. The error isn't about decoding the initial bytes anymore (since we specified utf-8), but about what happens with the successfully decoded string.

  3. Consider the Code Flow:

    • stdout_info (bytes, likely UTF-8 encoded)
    • decoded_stdout = stdout_info.decode('utf-8') -> Creates decoded_stdout (Python Unicode string). This step should now succeed.
    • video_info = json.loads(decoded_stdout) -> Parses the Unicode string into a Python dictionary (video_info). This should also succeed if the decoded string is valid JSON.
    • video_title = video_info.get('title', 'untitled_video') -> Extracts a string value (video_title) from the dictionary. This is still a Python Unicode string.
    • Implicit Encoding Step: Something after this must be triggering an implicit encoding using latin-1.
  4. Where could the implicit latin-1 encoding happen after successful decoding and JSON parsing?

    • Logging: Is Flask or your code trying to log the video_info dictionary, the video_title, or maybe the original decoded_stdout? If the logging mechanism is configured (or defaults in the Flask/server environment) to use latin-1, it will fail when trying to log strings containing non-latin-1 characters.
    • Returning the Response: How is the Flask endpoint returning data? If it's trying to construct an HTTP response (e.g., using render_template, jsonify, or just returning a string) and the process involves an intermediate step that defaults to latin-1 encoding, the error could occur there. jsonify should handle Unicode correctly by encoding to UTF-8 and setting the Content-Type header appropriately, but maybe something else is interfering? If returning a plain string, Flask might try to encode it using a default.
    • Debugging Output: Is there a print() statement somewhere trying to print video_info or video_title to the console where Flask's output is being redirected? If that console/redirect mechanism expects latin-1, printing Unicode could trigger the error.
    • Error Handling Itself: Could the error be happening when Python tries to format the traceback or the exception message itself? If the original stdout_info or stderr_info bytes (which might be included in an error message) are causing issues during the error reporting phase that uses latin-1. This seems less likely if the traceback clearly points to the json.loads line after the decode was supposedly successful, but it's possible the error actually occurs slightly later and the traceback is slightly misleading.
    • Flask/WSGI Internals: Interactions between Flask, the WSGI server (like Gunicorn, uWSGI, or Flask's development server), and Python's standard streams (sys.stdout, sys.stderr) can sometimes lead to unexpected encoding issues if the environment isn't configured for UTF-8. The server might be configured with a latin-1 locale.
    • Database Interaction: Is the code trying to store video_title or parts of video_info in a database immediately after fetching? If the database connection isn't configured for UTF-8, an implicit encoding might occur. (Less likely based on the provided snippet, but possible in the full code).
    • External Libraries: Is another library being used immediately after getting video_title that might perform implicit encoding?
  5. Prioritize Likely Causes:

    • Logging seems very plausible.
    • Response generation/returning data from the Flask view.
    • print() statements directed to a non-UTF8 console/log.
    • Environment/WSGI server configuration.
  6. Formulate Diagnostic Steps:

    • Isolate the Line: Confirm exactly which line the traceback points to. Is it truly json.loads, or the line after it where video_info or video_title is first used?
    • Simplify: Temporarily comment out everything after the json.loads line within the try block. Does the error still occur? If not, uncomment lines one by one to find the trigger.
    • Check Logging: Review Flask logging configuration. Is it writing to a file? To the console? What encoding is it using? Explicitly configure logging to use UTF-8.
    • Check Response: How is the data returned? Ensure jsonify is used for JSON responses. If returning HTML via render_template, ensure the template engine and the data passed to it handle Unicode correctly. If returning a plain string, explicitly encode it: return video_title.encode('utf-8') (though this requires setting the content type manually and isn't standard practice for text).
    • Check prints: Remove any diagnostic print() statements that might be printing the problematic data. Use logging instead.
    • Check Environment: Verify the environment Flask is running in. Check locale settings on the server. Check how the WSGI server (Gunicorn, uWSGI) is launched – are locale environment variables (LC_ALL=C.UTF-8, LANG=C.UTF-8) being set correctly? Sometimes WSGI servers run in a very minimal C locale which defaults to ASCII or Latin-1.
    • Inspect the Data: Add a line before json.loads to log or print the repr(decoded_stdout). This will show the Python string representation, including Unicode escape sequences (\uXXXX), which might reveal problematic characters without triggering encoding errors during the print/log itself. Check if this string looks like valid JSON.
  7. Structure the Answer:

    • Acknowledge that this is strange since explicit decoding should fix decoding errors.
    • Explain that the UnicodeEncodeError now strongly suggests the problem is not decoding stdout_info, but rather an implicit encoding using latin-1 happening later in the Flask request lifecycle.
    • List the most likely places this implicit encoding could occur (logging, response generation, print statements, environment/WSGI).
    • Provide concrete diagnostic steps for the user to try:
      • Pinpoint the exact line.
      • Simplify the code block.
      • Examine logging.
      • Examine how the response is returned.
      • Remove print.
      • Check the server environment/locale.
      • Inspect the decoded data safely using repr().
    • Emphasize checking the WSGI server launch configuration and locale settings as a common culprit in web environments.

That's certainly puzzling! If explicitly decoding with stdout_info.decode('utf-8') still results in a UnicodeEncodeError: 'latin-1' codec can't encode..., it strongly suggests the problem isn't the decoding of stdout_info itself anymore.

Instead, the error is likely happening after the decoding and json.loads are successful, during a later step where Python or Flask is implicitly trying to encode the resulting Unicode string (or data containing it) using the environment's problematic default encoding (latin-1).

Here’s a breakdown of why this might still happen and how to troubleshoot:

  1. The Data is Now Unicode: After stdout_info.decode('utf-8'), you have a Python Unicode string (decoded_stdout). After json.loads(decoded_stdout), you have a Python dictionary (video_info) potentially containing Unicode strings (like video_title). These steps likely succeeded.

  2. Implicit Encoding Later: The UnicodeEncodeError indicates that somewhere after json.loads, this Unicode data is being processed in a way that triggers an automatic encoding attempt using latin-1. Common culprits in a Flask context include:

    • Logging: If you are logging the video_info, video_title, or even the decoded_stdout itself (perhaps in an exception handler or just for debugging), and the logging framework (Flask's default logger, or standard logging) is configured (or defaults in that environment) to use latin-1 for its output stream (file or console), it will fail when it encounters characters not in latin-1.
    • Returning the Response: How are you using video_title or video_info in your Flask return statement?
      • If returning a plain string (return video_title), Flask might try to encode it using a default encoding before sending it.
      • If using render_template and passing video_title to a template, the template rendering engine or Flask's response creation might involve an encoding step.
      • Even jsonify could potentially interact with other misconfigured parts of the stack, though it's usually robust with Unicode.
    • Printing to Console: Are there any print(video_info) or print(video_title) statements? If Flask's standard output is being redirected to a log file or console that expects latin-1, printing Unicode directly can cause this error.
    • Exception Handling/Tracebacks: Sometimes, the error occurs when Python tries to format the traceback or an error message that includes the problematic string data, and the mechanism for formatting that error uses the default encoding.
    • Environment/WSGI Server: The WSGI server (like Gunicorn, uWSGI, or Flask's dev server) might be running in an environment (e.g., a Docker container, a systemd service) that hasn't had its locale set correctly to support UTF-8 (e.g., it defaults to C or POSIX locale which implies ASCII/Latin-1). Flask might inherit this limited default encoding.

Troubleshooting Steps:

  1. Pinpoint the Exact Line: Double-check the full traceback. Does the error really occur on the json.loads line, or is it on a subsequent line where video_info or video_title is used (e.g., in a log.info(), print(), return, or render_template call)?
  2. Isolate with repr(): Temporarily modify the code to see the decoded data safely before json.loads and check the data after json.loads without triggering encoding:
    Python
    try:
        decoded_stdout = stdout_info.decode('utf-8')
        print("--- Decoded stdout (repr): ---")
        print(repr(decoded_stdout)) # repr() is safe, avoids encoding issues on print
        print("--- End Decoded stdout ---")
    
        video_info = json.loads(decoded_stdout)
        print("--- Video Info (repr): ---")
        print(repr(video_info)) # Check the structure and content safely
        print("--- End Video Info ---")
    
        video_title = video_info.get('title', 'untitled_video')
        print("--- Video Title (repr): ---")
        print(repr(video_title)) # Check the title safely
        print("--- End Video Title ---")
    
        # Now, comment out ALL subsequent lines that USE video_info or video_title
        # (logging, returning, printing directly, etc.)
        # and see if the error still happens. If not, uncomment them one by one.
    
        # Example: Temporarily return a simple string
        return "Processing potentially successful up to here", 200
    
    # Keep your existing except blocks
    except json.JSONDecodeError:
        # Maybe log decoded_stdout here (carefully) if JSON fails
        return "Error: Could not decode JSON from video info.", 500
    except UnicodeDecodeError:
        return "Error: Could not decode video info output (invalid UTF-8).", 500
    except Exception as e:
        # Log the exception properly here
        print(f"An unexpected error occurred: {type(e).__name__} - {e}") # Basic safe print
        # import traceback
        # traceback.print_exc() # This might also trigger encoding errors
        return f"An unexpected error occurred: {type(e).__name__}", 500
    
  3. Check Logging Configuration: Explicitly configure Flask/Python logging to use UTF-8 encoding, especially if logging to files.
  4. Check Response Generation: Ensure you're using jsonify(video_info) or similar for API responses. If rendering templates, ensure the template file itself is saved as UTF-8 and Flask is configured correctly.
  5. Check Environment/WSGI Launch: This is a common cause in web deployments. Ensure the environment where your Flask app runs has its locale set to UTF-8.
    • Check locale command output on the server.
    • When launching Gunicorn/uWSGI, explicitly set environment variables: LC_ALL=C.UTF-8 LANG=C.UTF-8.
    • If using Docker, ensure your base image has UTF-8 locales configured and use ENV LC_ALL=C.UTF-8 LANG=C.UTF-8 in your Dockerfile.

By explicitly decoding with 'utf-8' you've likely solved the initial decoding, but the environment's insistence on latin-1 is causing problems later in the request handling. Focus on where the successfully decoded Unicode data is used next.

Below is my complete flak app for your reference.


from flask import Flask, Response

from flask_cors import CORS

import subprocess

import json


import os

os.environ['PYTHONIOENCODING'] = 'utf-8' # does not work


app = Flask(__name__)

CORS(app)


from flask import jsonify


@app.route('/', methods=['GET'])

def usage():

    USAGE = {'usage':'curl -o <output-file-name>.webm "http://your.host/download_audio/<YouTube-URL-encoded>"'}

    return jsonify(USAGE), 200


@app.route('/download_audio/<path:youtube_url>', methods=['GET'])

def download_audio(youtube_url):

    # Fetch video info (including title) using yt-dlp

    command_info = [

        'yt-dlp',

        '--no-check-certificate',

        '--skip-download',

        '--print-json',

        f'{youtube_url}'

    ]

    

    try:

        # Run yt-dlp to get the video info as JSON

        process_info = subprocess.Popen(command_info, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

        stdout_info, stderr_info = process_info.communicate()


        if process_info.returncode != 0:

            return f"Error fetching video info: {stderr_info.decode()}", 500

        

        # Extract video title from JSON output

        #video_info = json.loads(stdout_info.decode())

        video_info = json.loads(stdout_info.decode('utf-8', errors='replace'))

        video_title = video_info.get('title', 'untitled_video')


        # Now, download audio using yt-dlp

        command_audio = [

            'yt-dlp',

            '--no-check-certificate',

            '-f', 'bestaudio',      

            '-o', '-',

            f'{youtube_url}'

        ]

        

        # Download the audio

        process_audio = subprocess.Popen(command_audio, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

        stdout_audio, stderr_audio = process_audio.communicate()

        

        if process_audio.returncode != 0:

            return f"Error downloading video: {stderr_audio.decode()}", 500

        

        # Set the filename to the video title

        response = Response(stdout_audio, mimetype='audio/webm')

        response.headers["Content-Disposition"] = f"attachment; filename={video_title}.webm"


        # Add Access-Control-Expose-Headers to allow the client to access Content-Disposition

        response.headers["Access-Control-Expose-Headers"] = "Content-Disposition"


        return response

    except Exception as e:

        return f"An error occurred: {str(e)}", 500


if __name__ == "__main__":

    app.run(port=5000)

  1. Analyze the Code: The user provided the full Flask app.

    • It uses yt-dlp twice: first to get video info (title) using --print-json, then to download the audio stream to stdout.
    • It decodes the JSON output using stdout_info.decode('utf-8', errors='replace') - this part seems correct now.
    • It extracts the video_title.
    • It runs the second yt-dlp command to get the raw audio bytes (stdout_audio).
    • It creates a Flask Response object with the audio bytes (stdout_audio) and mimetype audio/webm.
    • Crucially: It sets the Content-Disposition header using an f-string: f"attachment; filename={video_title}.webm".
    • It returns the Response.
    • The PYTHONIOENCODING environment variable attempt is noted as not working (it often doesn't fix underlying locale/WSGI server issues reliably).
    • Error handling uses f-strings like f"Error fetching video info: {stderr_info.decode()}" and f"Error downloading video: {stderr_audio.decode()}".
  2. Identify the Remaining Potential Failure Point: Now that decoding stdout_info is explicitly UTF-8, where could the latin-1 encoding error still occur?

    • The stderr_info.decode() and stderr_audio.decode() calls in the error returns don't specify an encoding. If yt-dlp writes non-latin-1 characters to stderr and an error occurs, this could trigger the error when Flask tries to format the f-string response using the default latin-1.
    • The Content-Disposition header line: response.headers["Content-Disposition"] = f"attachment; filename={video_title}.webm"
      • video_title is a Unicode string (potentially containing non-latin-1 characters).
      • HTTP headers traditionally were expected to be ASCII or Latin-1. While modern standards (RFC 5987, RFC 6266) allow for encoding non-ASCII characters in filenames within headers (using filename* parameter with URL encoding), directly putting raw Unicode characters into the filename parameter is problematic and might cause issues depending on the HTTP server (Flask/Werkzeug) or client.
      • Werkzeug (Flask's underlying library) might be trying to encode this header value using the system's default encoding (latin-1 in this case) when constructing the final HTTP response, leading to the UnicodeEncodeError if video_title contains incompatible characters. This is the most likely culprit.
  3. How to Fix the Content-Disposition Header: The standard way to handle non-ASCII filenames in Content-Disposition is to use RFC 5987 encoding. This involves specifying the charset (UTF-8) and percent-encoding the characters. The format looks like filename*=UTF-8''<percent-encoded-filename>.

    • Python's urllib.parse.quote can be used for percent-encoding.
    • Need to construct both a plain filename= (for older clients, potentially lossy) and a filename*= (for modern clients).
  4. Refine Error Handling Decoding: Add explicit 'utf-8', errors='replace' to the .decode() calls for stderr_info and stderr_audio as well, just to be safe and consistent.

  5. Structure the Solution:

    • Acknowledge the code structure and the explicit UTF-8 decoding for stdout_info.
    • Pinpoint the most likely remaining issue: setting the Content-Disposition header directly with a Unicode video_title containing non-latin-1 characters, causing Werkzeug/Flask to implicitly encode it with the faulty default.
    • Explain the standard way (RFC 5987) to encode filenames in headers.
    • Provide the corrected code snippet for setting the header, using urllib.parse.quote. Include both filename= (best effort ASCII conversion) and filename*= (UTF-8 percent-encoded).
    • Recommend also adding explicit decoding to the stderr handling for robustness.
    • Mention that if this still fails, the underlying environment/WSGI locale remains a suspect, but the header is the prime candidate now.
  6. Code Implementation Details (Header):

    • Import urllib.parse.
    • Create an ASCII-safe version for filename=: video_title.encode('ascii', errors='ignore').decode('ascii') might work, but just removing non-ASCII might be simpler or using a generic name if it contains only non-ASCII. A simpler approach for the fallback might be to sanitize the title heavily or just use a fixed name. Let's stick to the standard: provide both.
    • Generate the filename*= value: filename_encoded = urllib.parse.quote(video_title) then f"filename*=UTF-8''{filename_encoded}".
    • Generate a safe filename= fallback. A common pattern is to replace non-ASCII: filename_ascii = ''.join(c if ord(c) < 128 else '_' for c in video_title). Ensure it's not empty.
    • Combine them: f"attachment; filename=\"{filename_ascii}.webm\"; filename*=UTF-8''{filename_encoded}.webm" (Need to handle the .webm extension correctly in both parts).
  7. Refine Header Implementation: Let's make the fallback safer. If the title contains non-ASCII, maybe the ASCII fallback is just downloaded_audio.webm. Or try to transliterate? Simpler is often better for the fallback. Let's try replacing non-ASCII with underscores.

    Python
    import re
    import urllib.parse
    
    # Sanitize for basic filename= parameter (replace non-ASCII with _)
    sanitized_title_ascii = re.sub(r'[^\x00-\x7F]+', '_', video_title)
    # Further sanitize for filesystem/header safety (remove slashes, etc.) - optional but good
    # sanitized_title_ascii = re.sub(r'[\\/*?:"<>|]', '_', sanitized_title_ascii)
    if not sanitized_title_ascii.strip('_'): # Handle case where title was ONLY non-ascii
        sanitized_title_ascii = 'downloaded_audio'
    
    # Encode for filename*= parameter
    encoded_title_utf8 = urllib.parse.quote(video_title)
    
    disposition = f'attachment; filename="{sanitized_title_ascii}.webm"; filename*=UTF-8\'\'{encoded_title_utf8}.webm'
    response.headers["Content-Disposition"] = disposition
    

    This looks robust.

  8. Final Review: Add imports, apply the fix to the code, add the stderr decoding fixes, and write the explanation.

Okay, thank you for providing the full code. Looking at it, even with stdout_info.decode('utf-8', errors='replace'), there are a couple of likely places the UnicodeEncodeError: 'latin-1' could still be happening:

  1. The Content-Disposition Header (Most Likely):

    Python
    response.headers["Content-Disposition"] = f"attachment; filename={video_title}.webm"
    

    Here, video_title is a Python Unicode string. If it contains characters that are not representable in ASCII or Latin-1 (e.g., emojis, Cyrillic, CJK characters), you are embedding that raw Unicode string directly into an HTTP header.

    HTTP headers have strict rules. While modern browsers support RFC 5987/6266 which allows UTF-8 filenames using a special filename* syntax, directly putting raw Unicode into the older filename= parameter is problematic. Flask/Werkzeug (the underlying library) likely attempts to encode this header value before sending it. If the environment's default encoding is latin-1, this encoding step will fail when it encounters a character in video_title that latin-1 doesn't support, raising the UnicodeEncodeError.

  2. Error Handling Decode Calls:

    Python
     if process_info.returncode != 0:
         # Implicit default encoding used here!
         return f"Error fetching video info: {stderr_info.decode()}", 500
     # ...
     if process_audio.returncode != 0:
         # Implicit default encoding used here!
         return f"Error downloading video: {stderr_audio.decode()}", 500
    

    You correctly decoded stdout_info with utf-8, but in your error handling, you are calling .decode() on stderr_info and stderr_audio without specifying an encoding. If yt-dlp writes an error message containing non-latin-1 characters to stderr, and your app tries to return that error, the .decode() call (or the subsequent formatting into the f-string response) will use the default latin-1 encoding and fail.

Solutions:

  1. Fix Content-Disposition Header: Use the standard approach (RFC 5987) to encode the filename properly for HTTP headers. This involves providing both a basic ASCII-safe filename= for older clients and a filename*= with UTF-8 encoding for modern clients.

  2. Fix stderr Decoding: Explicitly decode stderr using utf-8 (with error handling like replace or ignore) just like you did for stdout.

Corrected Code:

Python
from flask import Flask, Response, jsonify
from flask_cors import CORS
import subprocess
import json
import re # Import re for sanitizing
import urllib.parse # Import quote for URL-encoding

# os.environ['PYTHONIOENCODING'] = 'utf-8' # Generally not reliable enough, remove or keep commented

app = Flask(__name__)
CORS(app, expose_headers=["Content-Disposition"]) # Also ensure expose_headers is set here or in CORS() call

@app.route('/', methods=['GET'])
def usage():
    USAGE = {'usage':'curl -o <output-file-name>.webm "http://your.host/download_audio/<YouTube-URL-encoded>"'}
    return jsonify(USAGE), 200

@app.route('/download_audio/<path:youtube_url>', methods=['GET'])
def download_audio(youtube_url):
    # Fetch video info (including title) using yt-dlp
    command_info = [
        'yt-dlp',
        '--no-check-certificate',
        '--skip-download',
        '--print-json',
        f'{youtube_url}'
    ]

    try:
        # Run yt-dlp to get the video info as JSON
        process_info = subprocess.Popen(command_info, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        stdout_info, stderr_info = process_info.communicate()

        if process_info.returncode != 0:
            # Decode stderr explicitly using utf-8
            error_message = stderr_info.decode('utf-8', errors='replace')
            return f"Error fetching video info: {error_message}", 500

        # Decode stdout and extract video title from JSON output
        video_info = json.loads(stdout_info.decode('utf-8', errors='replace'))
        video_title = video_info.get('title', 'untitled_video')

        # Now, download audio using yt-dlp
        command_audio = [
            'yt-dlp',
            '--no-check-certificate',
            '-f', 'bestaudio',
            '-o', '-', # Output to stdout
            f'{youtube_url}'
        ]

        # Download the audio
        process_audio = subprocess.Popen(command_audio, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        # Use communicate() to avoid deadlocks if stderr buffer fills up
        stdout_audio, stderr_audio = process_audio.communicate()

        if process_audio.returncode != 0:
            # Decode stderr explicitly using utf-8
            error_message = stderr_audio.decode('utf-8', errors='replace')
            return f"Error downloading video: {error_message}", 500

        # --- Prepare Headers Correctly ---
        response = Response(stdout_audio, mimetype='audio/webm')

        # Create a safe ASCII version for filename= (replace non-ASCII with _)
        # Also remove potentially problematic characters for filenames
        sanitized_title_ascii = re.sub(r'[^\x00-\x7F]+', '_', video_title)
        sanitized_title_ascii = re.sub(r'[\\/*?:"<>|]', '_', sanitized_title_ascii)
        sanitized_title_ascii = sanitized_title_ascii.strip(' ._') # Remove leading/trailing problematic chars
        if not sanitized_title_ascii: # Handle case where title was ONLY non-ascii/invalid chars
            sanitized_title_ascii = 'downloaded_audio'

        # Create the UTF-8 version for filename*= (URL-encode the original title)
        encoded_title_utf8 = urllib.parse.quote(video_title)

        # Combine both using RFC 5987 syntax
        # Note the double quotes around the filename= value
        disposition = f'attachment; filename="{sanitized_title_ascii}.webm"; filename*=UTF-8\'\'{encoded_title_utf8}.webm'
        response.headers["Content-Disposition"] = disposition

        # expose_headers can also be set directly on the response or in the main CORS() call
        # response.headers["Access-Control-Expose-Headers"] = "Content-Disposition" # Already handled by CORS() above

        return response

    except json.JSONDecodeError:
        # Log the error and potentially the non-JSON output for debugging
        # print(f"Failed JSON Decode. Stderr: {stderr_info.decode('utf-8', errors='replace')}")
        # print(f"Failed JSON Decode. Stdout: {stdout_info.decode('utf-8', errors='replace')}")
        return "Error: Could not decode JSON from video info.", 500
    except UnicodeDecodeError as e:
        # This shouldn't happen now with errors='replace', but good to keep
        return f"Error: Could not decode subprocess output - {e}", 500
    except Exception as e:
        # Log the exception here for debugging
        # import traceback
        # traceback.print_exc()
        return f"An unexpected error occurred: {str(e)}", 500

if __name__ == "__main__":
    # Consider setting host='0.0.0.0' if running in a container or VM
    # and add debug=True for development (but not production)
    app.run(host='0.0.0.0', port=5000, debug=False)

By correctly encoding the Content-Disposition header and explicitly decoding stderr, you should eliminate the remaining sources of the UnicodeEncodeError: 'latin-1' related to processing the subprocess output and constructing the response.


Gemini 可能會出錯,請查證回覆內容