Skip to content

Block invalid language tags#980

Merged
pmachapman merged 1 commit into
mainfrom
source_target_language
Jun 29, 2026
Merged

Block invalid language tags#980
pmachapman merged 1 commit into
mainfrom
source_target_language

Conversation

@pmachapman

@pmachapman pmachapman commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Fixes #967.

I also took the opportunity to improve unit testing and API documentation around this area.


This change is Reviewable

@pmachapman pmachapman requested review from Enkidu93 and ddaspit June 25, 2026 03:40
@codecov-commenter

codecov-commenter commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.58824% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.45%. Comparing base (68e38d2) to head (553a981).

Files with missing lines Patch % Lines
src/Serval/src/Serval.Client/Client.g.cs 54.54% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #980      +/-   ##
==========================================
- Coverage   65.47%   65.45%   -0.02%     
==========================================
  Files         387      387              
  Lines       21704    21718      +14     
  Branches     2764     2769       +5     
==========================================
+ Hits        14210    14215       +5     
- Misses       6458     6466       +8     
- Partials     1036     1037       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ddaspit ddaspit left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit reviewed 5 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on Enkidu93 and pmachapman).


src/Serval/src/Serval.Translation/Features/EngineTypes/GetLanguageInfo.cs line 85 at r1 (raw file):

            return NotFound();
        }
        catch (ArgumentException)

This is a very generic exception. I don't want to accidentally mask bugs in the code. Could we use a more specific exception or catch the exception closer to where the LanguageTagService is called and indicate that it is invalid in the response?

@Enkidu93

Copy link
Copy Markdown
Collaborator

src/Machine/src/Serval.Machine.Shared/Services/LanguageTagService.cs line 5 at r1 (raw file):

public partial class LanguageTagService : ILanguageTagService
{
    [GeneratedRegex("^[A-Za-z0-9_-]+$")]

I wonder how restrictive we want to make this. I'm worried about long codes we get out of Paratext projects settings files like cod-Scrip-RG-dialect. We have a number of these engines already in production. I assume if they are a true language code, we won't hit this regex in the code, but if they aren't (maybe a made-up code for a under documented language or new dialect) and they use a dialect-differentiator code (not sure what the official name is?) that has characters other than these, we could be preventing them from running a build. Do you know if you're able to only use a-z in these codes? None of the ones on production have something other than a-z. I'm thinking here of the final segment of a <LanguageIsoCode\> tag in a Settings.xml. I can't seem to find documentation for this.

@Enkidu93 Enkidu93 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 reviewed all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on pmachapman).

@pmachapman pmachapman force-pushed the source_target_language branch from dfc4d6c to 553a981 Compare June 29, 2026 01:51

@pmachapman pmachapman left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmachapman made 2 comments.
Reviewable status: 1 of 5 files reviewed, 2 unresolved discussions (waiting on ddaspit and Enkidu93).


src/Machine/src/Serval.Machine.Shared/Services/LanguageTagService.cs line 5 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I wonder how restrictive we want to make this. I'm worried about long codes we get out of Paratext projects settings files like cod-Scrip-RG-dialect. We have a number of these engines already in production. I assume if they are a true language code, we won't hit this regex in the code, but if they aren't (maybe a made-up code for a under documented language or new dialect) and they use a dialect-differentiator code (not sure what the official name is?) that has characters other than these, we could be preventing them from running a build. Do you know if you're able to only use a-z in these codes? None of the ones on production have something other than a-z. I'm thinking here of the final segment of a <LanguageIsoCode\> tag in a Settings.xml. I can't seem to find documentation for this.

I tried to make this broadly support BCP-47 (https://en.wikipedia.org/wiki/IETF_language_tag), with the primary purpose of not allowing injection, rather than ensuring a valid language code (I think the TryParse above will take care of known validity). As far as I am aware (based on the Wikipedia article) language codes are alphanumeric with underscores and dashes.


src/Serval/src/Serval.Translation/Features/EngineTypes/GetLanguageInfo.cs line 85 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This is a very generic exception. I don't want to accidentally mask bugs in the code. Could we use a more specific exception or catch the exception closer to where the LanguageTagService is called and indicate that it is invalid in the response?

Done. I have changed this to an InvalidOperationException, as there is already a BadRequestExceptionFilter which will catch this and output a 400 HTTP status code.

@Enkidu93

Copy link
Copy Markdown
Collaborator

src/Machine/src/Serval.Machine.Shared/Services/LanguageTagService.cs line 5 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

I tried to make this broadly support BCP-47 (https://en.wikipedia.org/wiki/IETF_language_tag), with the primary purpose of not allowing injection, rather than ensuring a valid language code (I think the TryParse above will take care of known validity). As far as I am aware (based on the Wikipedia article) language codes are alphanumeric with underscores and dashes.

Right, I think it would be an odd case, but I'm just thinking of folks who might supply a yet-unregisterd/made-up language code. Then it would get past the TryParse. If it also had a variant specified with non-ASCII letters/numbers, we would throw an error. From the Wikipedia article, it's not clear than the variant names are completely standardized. I guess I was mainly wondering if Paratext itself allows you to enter non-ASCII letters/numbers in these settings because you just pass this tag from the Settings.xml on to Serval from SF, correct?

@Enkidu93 Enkidu93 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 reviewed 5 files and all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on ddaspit and pmachapman).

@ddaspit ddaspit left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@ddaspit reviewed 4 files and all commit messages, made 2 comments, and resolved 1 discussion.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on Enkidu93).


src/Machine/src/Serval.Machine.Shared/Services/LanguageTagService.cs line 5 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Right, I think it would be an odd case, but I'm just thinking of folks who might supply a yet-unregisterd/made-up language code. Then it would get past the TryParse. If it also had a variant specified with non-ASCII letters/numbers, we would throw an error. From the Wikipedia article, it's not clear than the variant names are completely standardized. I guess I was mainly wondering if Paratext itself allows you to enter non-ASCII letters/numbers in these settings because you just pass this tag from the Settings.xml on to Serval from SF, correct?

Paratext does not allow you to enter non-ASCII characters in a private-use subtag. All other subtags must be valid and known.

@Enkidu93 Enkidu93 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 1 comment and resolved 1 discussion.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on pmachapman).


src/Machine/src/Serval.Machine.Shared/Services/LanguageTagService.cs line 5 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Paratext does not allow you to enter non-ASCII characters in a private-use subtag. All other subtags must be valid and known.

OK, perfect, thank you!

@pmachapman pmachapman merged commit 4c3df05 into main Jun 29, 2026
2 checks passed
@pmachapman pmachapman deleted the source_target_language branch June 29, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RCE via SourceLanguage / TargetLanguage

4 participants