From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	ip-172-31-74-118.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.6
Path: eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Ben Bacarisse <ben.usenet@bsb.me.uk>
Newsgroups: comp.lang.ada
Subject: Re: Some advice required [OT]
Date: Tue, 28 Dec 2021 00:29:54 +0000
Organization: A noiseless patient Spider
Message-ID: <87y2456071.fsf@bsb.me.uk>
References: <7bede061-4b0f-4029-beb1-1056637e57d6n@googlegroups.com>
	<j2tlk8FneraU1@mid.individual.net>
	<49538254-21ed-4fd0-8316-1bccc7d3c635n@googlegroups.com>
	<87sfue8a0v.fsf@bsb.me.uk>
	<31332c61-a370-43a5-bbe0-efe338ee6d8fn@googlegroups.com>
	<87fsqd7oz8.fsf@bsb.me.uk>
	<e5ab38f4-b456-4e99-b702-be4f88e8b5c1n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="006a31cbf3e05dab45838058281df2a3";
	logging-data="26737"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/VEHGYNF7RRA8X0p/wfa6lGCTQnDSM5aA="
Cancel-Lock: sha1:GMX0Gf5HDh5CTgTex65RZgGdgZ8=
	sha1:VWlpvvPFxynu48+RCyk3a9GFzK0=
X-BSB-Auth: 1.58a2ef3709a584525dd7.20211228002954GMT.87y2456071.fsf@bsb.me.uk
Xref: reader02.eternal-september.org comp.lang.ada:63289
List-Id: <comp.lang.ada>

Laurent <lutgenl@icloud.com> writes:

> On Monday, 27 December 2021 at 21:49:18 UTC+1, Ben Bacarisse wrote:
>> Laurent <lut...@icloud.com> writes: 
>> 
>> > On Monday, 27 December 2021 at 14:14:42 UTC+1, Ben Bacarisse wrote: 
>> >> Laurent <lut...@icloud.com> writes: 
>> >> 
>> >> > On Monday, 27 December 2021 at 12:16:27 UTC+1, Niklas Holsti wrote: 
>> >> > 
>> >> >> Sorry, but I found your problem description impossible to understand. 
>> >> >> Try to describe more clearly the experiment that is done, the structure 
>> >> >> of the data the experiment provides (the meaning of the Excel rows and 
>> >> >> columns), and the statistic you want to compute. 
>> >> > 
>> >> > Sorry tried to keep it short, was too short. 
>> >> > 
>> >> > Columns are the antimicrobial drugs 
>> >> > Rows are the microorganism. 
>> >> > 
>> >> > So every cell contains a result of S, I, R or simply an empty cell 
>> >> > 
>> >> > S = Sensible 
>> >> > I = Intermediate 
>> >> > R = Resistant 
>> >> > 
>> >> > empty cell <S<I<R 
>> >> > 
>> >> > If a patient has 3 strains of the same microorganism but with 
>> >> > different resistance profiles I have to find the most resistant 
>> >> > one. Or if they are different I keep them all. 
>> >> > 
>> >> > I have no idea how to explain what I am doing to the compiler. 
>> >> I think when you can explain it to people, you'll be able to code it. I 
>> >> am still struggling to understand what you need. 
>> >> > Why I would choose result from strain B over the result from strain A. 
>> >> > 
>> >> > strain A: SSSRSS 
>> >> > strain B: SSRRRS 
>> >> Let's space it out 
>> >> 
>> >> drug 1 drug 2 drug 3 drug 4 drug 5 drug 6 
>> >> strain A S S S R S S 
>> >> strain B S S R R R S 
>> >> 
>> >> You want to choose B because it has is resistant to more drugs, yes? 
>> >> 
>> > 
>> > Yes indeed 
>> > 
>> >> I think, from the ordering you give, you need a measure that treats an R 
>> >> as "more important" that any "I" which is "more important" than an "S". 
>> >> (We will come to empty cells later.) 
>> >> 
>> >> I think you need to treat the number of Rs, Is and Ss like digits in a 
>> >> number. In base 10, the strains score 
>> >> 
>> >> R S I 
>> >> strain A 1 5 0 = 150 
>> >> strain B 3 3 0 = 330 
>> >> 
>> >> Now, in fact, you don't need to use base 10. The smallest base you can 
>> >> use is one more than the maximum number of test results. If there can 
>> >> be up to 16 tests (say) the score is 
>> >> 
>> >> n(R)*17*17 + n(S)*17 + n(I). 
>> >> 
>> >> If this suits your needs, we can consider empty cells later on. It's 
>> >> not at all clear to me how to compare 
>> >> 
>> >> strain C R____ 
>> >> strain D RRSSSS 
>> >> 
>> >> Strain C is "less resistant" but only because there is not enough 
>> >> information. In fact it seems more serious as it is resistant to all 
>> >> tested drugs. 
>> >> 
>> > 
>> > Strain C is probably garbage and I would remove it. With a bit of luck I will have the result with the same sample Id which would be complete. 
>> > 
>> >> And then what about 
>> >> 
>> >> strain D SR 
>> >> strain E RS 
>> >> 
>> > 
>> > Yes those are the cases which are annoying me. 
>> > 
>> > That's why I came up withe idea of multiplying the value of the result 
>> > (S=1, I=2 and R=3) with the position of the value. Tried it with 
>> > triplets but there will still be cases where different results will
>> > give the same numeric value. Ignoring empty cell able tps for the moment.
>> > 
>> > Strain F: SSR (1*1+2*1+3*3) =12 and Strain G: RRS (1*3+ 2*3+3*1) = 12 
>> > will be the same numerical value but they are different resistance 
>> > profiles I would in this case keep both. 
>> > 
>> > How to prevent that from happening.
>> Can you first say why the suggestion I made is not helpful? 
>> 
>> -- 
>> Ben.
>
> You mean that one:
>
>> >> I think you need to treat the number of Rs, Is and Ss like digits in a 
>> >> number. In base 10, the strains score 
>> >> 
>> >> R S I 
>> >> strain A 1 5 0 = 150 
>> >> strain B 3 3 0 = 330 
>> >> 
>
> Different resistance profiles same result:

I don't yet understand the requirements so I am taking it in stages.
The first requirement seemed to be "more or less resistant".  To do that
you can use digits in a large enough base but this will make the number
of Rs, Ss and Is paramount.  Is that acceptable as a first step?

In order to help people to be able to make further suggestions, maybe
you could give the relative ordering you would like to see between the
following sets of profiles.  For example, between SSR, SRS and RSS, I
think the order you want is RSS > SRS > SSR.

1: SSR, SRS, RSS

2: RSI, RIS, SRI, SIR, IRS, ISR

3: SSSR, SSRS, SRSS, RSSS

4: RRSSS, RSSSR, RIIII, SRIII, RSIII, IIIRS, IIISR

It's possible you could make do with an extra field (or digits) that
gives some measure of the relative ordering between otherwise similar
sequences.  For example, using base 10 (for convenience of arithmetic)
both RRSSI and RSRSI would score 212xx but the last xx would reflect the
positioning of the results in the sequence.  There are lots of way to do
this.  One way would be use, as you were thinking, some sort of weighted
count.  Using S=0, I=1 and R=2 with weights

54321
RRSSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+4) + 0*(3+2) + 1*1 = 21219
RSRSI scores 2*10000 + 1*1000 + 2*100 + 2*(5+3) + 0*(4+2) + 1*1 = 21217

If you absolutely must never get duplicate numbers, but you still want
to preserve a strict specified ordering, I think you will have much more
work to do.

Getting a unique number for each case it trivial (but the ordering will
be wrong) and getting an ordering that rates every R > every S > every I
is also trivial, but there will be lots of duplicates.  It's finding the
balance that's going to be hard.

-- 
Ben.